Combination of a pre-trained YOLO, a pre-trained CNN, and a custom classifier

Nouar Aldahoul
3 min readDec 30, 2020
Block Diagram of combining pre-trained YOLO, pre-trained CNN, and custom classifier

Transfer learning approach is summarised by training CNNs with a large-scale dataset and utilising the trained network with the new small or medium-scale dataset for custom use cases. Usually, large-scale datasets such as ImageNet [1] and COCO [2] are utilised for training the model. ImageNet has a subset of 1.2 million images which was used in 2010 in the Challenge of Large-Scale Visual Recognition for the classification of visual objects into 1000 categories [1]. Similarly, COCO 2017 contains 164k images and 80 classes [2]. However, it is difficult to collect a large-scale dataset for a specific application such as for human activity recognition, nudity and pornography detection, or video’s distortion classification. Therefore, transfer learning has been shown to utilise the weights of CNNs trained on ImageNet such as AlexNet [3], VGG16 [4], GoogleNet [5], Inception3 [6], ResNet50, and ResNet101 [7]. These weights of the first layers are usually frozen without tuning to utilise them for extracting features from a new small-scale dataset. Additionally, a custom classifier can be added to replace the fully connected layers of previous pre-trained CNNs and customize the classification to specific task.

Transfer detection approach was demonstrated [8] with (You Only Look Once) YOLO3 as a COCO based pre-trained detector in detecting a specific class such as a person and ignore other classes. YOLO3 detector was transferred to new dataset to determine the regions of interest (ROIs) that include, for examples, human patches.

Combination of transfer detection and transfer learning approaches include three blocks in the Machine Learning pipeline: a pre-trained object detector such as YOLO, a pre-trained CNN such as ResNet50, and any custom classifier such as Support Vector Machine (SVM), Random Forest (RF), or Extreme Learning Machine (ELM). This approach is useful in many use cases when the size of dataset is not large enough to train each block in an end to end manner. To read more about this combination and its advantages, this paper https://doi.org/10.3390/sym13010026 utilized this combination for nudity and pornography detection.

Please feel free to post a comment if you have any question related to previously mentioned method, I’ll be happy to answer.

References

[1] 1. Jia Deng; Wei Dong; Socher, R.; Li-Jia Li; Kai Li; Li Fei-Fei ImageNet: A large-scale hierarchical image database.; 2009; pp 248–255; https://doi.org/10.1109/CVPR.2009.5206848

[2] Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 2014; pp. 740–755; arXiv:1405.0312v3

[3] 1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems; 2012; https://doi.org/10.1145/3065386

[4] Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015 — Conference Track Proceedings; 2015. arXiv:1409.1556v6

[5] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2015. 10.1109/CVPR.2015.7298594

[6] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2016. https://doi.org/10.1109/CVPR.2016.308

[7] He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2016; pp. 770–778; https://doi.org/10.1109/CVPR.2016.90

[8] AlDahoul, N.; Abdul Karim, H.; Lye Abdullah, M.H.; Ahmad Fauzi, M.F.; Ba Wazir, A.S.; Mansor, S.; See, J. Transfer Detection of YOLO to Focus CNN’s Attention on Nude Regions for Adult Content Detection. Symmetry 2021, 13, 26.

--

--

Nouar Aldahoul

Nouar AlDahoul is an AI developer and researcher with Ph.D. in Machine Learning. She received many awards such as ICIP20 challenge award.