This section describes how pre-trained models can be downloaded and used in MatConvNet. Using the pre-trained model is easy; just start from the example code included in the quickstart guide.
Remark: The following CNN models may have been imported from other reference implementations and are equivalent to the originals up to numerical precision. However, note that:
Images need to be pre-processed (resized and cropped) before being submitted to a CNN for evaluation. Even small differences in the prepreocessing details can have a non-negligible effect on the results.
The example below shows how to evaluate a CNN, but does not include data augmentation or encoding normalization as for example provided by the VGG code. While this is easy to implement, it is not done automatically here.
These models are provided here for convenience, but please credit the original authors.
These models are trained for object detection in PASCAL VOC.
Fast R-CNN. Models from the FastR-CNN page:
The model performance is as follows (mAP 11 indicates mean average precision computed using 11 point interpolation, as per PASCAL VOC 07 specification):
|model||training set||PASCAL07 test mAP||mAP 11|
|fast-rcnn-caffenet-pascal07-dagnn||imnet12+pas07||57.3 %||58.1 %|
|fast-rcnn-vggm12-pascal07-dagnn||imnet12+pas07||59.4 %||60.5 %|
|fast-rcnn-vgg16-pascal07-dagnn||imnet12+pas07||67.3 %||68.7 %|
These models are trained for face classification and verification.
VGG-Face. The face classification and verification network from the VGG project.
Deep face recognition, O. M. Parkhi and A. Vedaldi and A. Zisserman, Proceedings of the British Machine Vision Conference (BMVC), 2015 (paper).
See the script
examples/cnn_vgg_face.mfor an example of using VGG-Face for classification. To use this network for face verification instead, extract the 4K dimensional features by removing the last classification layer and normalize the resulting vector in L2 norm.
These models are trained for semantic image segmentation using the PASCAL VOC category definitions.
Fully-Convolutional Networks (FCN) training and evaluation code is available here.
BVLC FCN (the original implementation) imported from the Caffe version [DagNN format].
'Fully Convolutional Models for Semantic Segmentation', Jonathan Long, Evan Shelhamer and Trevor Darrell, CVPR, 2015 (paper).
These networks are trained on the PASCAL VOC 2011 training and (in part) validation data, using Berekely's extended annotations (SBD).
The performance measured on the PASCAL VOC 2011 validation data subset used in the revised version of the paper above (dubbed RV-VOC11):
Model Test data Mean IOU Mean pix. accuracy Pixel accuracy FNC-32s RV-VOC11 59.43 89.12 73.28 FNC-16s RV-VOC11 62.35 90.02 75.74 FNC-8s RV-VOC11 62.69 90.33 75.86
Torr Vision Group FCN-8s. This is the FCN-8s subcomponent of the CRF-RNN network from the paper:
'Conditional Random Fields as Recurrent Neural Networks' Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr, ICCV 2015 (paper).
These networks are trained on the PASCAL VOC 2011 training and (in part) validation data, using Berekely's extended annotations, as well as Microsoft COCO.
While the CRF component is missing (it may come later to MatConvNet), this model still outperforms the FCN-8s network above, partially because it is trained with additional data from COCO. In the table below, the RV-VOC12 data is the subset of the PASCAL VOC 12 data as described in the 'Conditional Random Fields' paper:
Model Tes data mean IOU mean pix. accuracy pixel accuracy FNC-8s-TVG RV-VOC12 69.85 92.94 78.80
TVG implementation note: The model was obtained by first fine-tuning the plain FCN-32s network (without the CRF-RNN part) on COCO data, then building built an FCN-8s network with the learnt weights, and finally training the CRF-RNN network end-to-end using VOC 2012 training data only. The model available here is the FCN-8s part of this network (without CRF-RNN, while trained with 10 iterations CRF-RNN).
ImageNet ILSVRC classification
These modes are trained to perform classification in the ImageNet ILSVRC challenge data.
ResNet models imported from the MSRA version.
'Deep Residual Learning for Image Recognition', K. He, X. Zhang, S. Ren and J. Sun, CVPR, 2016 (paper).
GoogLeNet model imported from the Princeton version [DagNN format].
`Going Deeper with Convolutions', Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, CVPR, 2015 (paper).
VGG-VD models from the Very Deep Convolutional Networks for Large-Scale Visual Recognition project.
`Very Deep Convolutional Networks for Large-Scale Image Recognition', Karen Simonyan and Andrew Zisserman, arXiv technical report, 2014, (paper).
VGG-S,M,F models from the Return of the Devil paper (v1.0.1).
`Return of the Devil in the Details: Delving Deep into Convolutional Networks', Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, BMVC 2014 (BibTex and paper).
The following models have been trained using MatConvNet (beta17) and batch normalization using the code in the
examples/imagenetdirectory, and using the ILSVRC 2012 data:
imagenet-matconvnet-*.matare deployed models. This means, in particular, that batch normalization layers have been removed for speed at test time. This, however, may affect fine-tuning.
Caffe reference model obtained here (version downloaded on September 2014).
Citation: please see the Caffe homepage.
`ImageNet classification with deep convolutional neural networks', A. Krizhevsky and I. Sutskever and G. E. Hinton, NIPS 2012 (BibTex and paper)
The first model has been imported from Caffe.
The MatConvNet model was trained using using MatConvNet (beta17) and batch normalization using the code in the
This is a summary of the performance of these models on the ILSVRC 2012 validation data:
|model||introduced||top-1 err.||top-5 err.||images/s|
Some of the models trained using MatConvNet are slightly better than the original, probably due to the use of batch normalization during training.
Error rates are computed on a single centre-crop and are therefore higher than what reported in some publications, where multiple evaluations per image are combined. Likewise, no model ensembles are evaluated.
The evaluation speed was measured on a 12-cores machine using a single NVIDIA Titan X, MATLAB R2015b, and CuDNN v5.1. A significant bottleneck for many networks is the data reading speed of
The GoogLeNet model performance is a little lower than expected (the model should be on par or a little better than VGG-VD). This network was imported from the Princeton version of GoogLeNet, not by the Google team, so the difference might be due to parameter setting during training. On the positive side, GoogLeNet is much smaller (in terms of parameters) and faster than VGG-VD.
The following table summarizes the MD5 checksums for the model files.
Older file versions
Older models for MatConvNet beta16 are available
here. They should be numerically equivalent, but in
beta17 the format has changed slightly for SimpleNN models. Older
models can also be updated using the