Scene Text Detection and Recognition Using Maximally Stable Extremal Region
DOI:
https://doi.org/10.37385/jaets.v6i1.5958Keywords:
MSER, SWT, Text Detection, Text Recognition, Deep Learning, CRNNAbstract
In recent years, scene text detection and recognition have become important research areas in computer vision and machine learning. Traditional text detection and recognition methods may struggle with detecting and recognizing text in images with low resolution, complex backgrounds, and varying font sizes. The proposed methodology addresses these challenges by combining multiple algorithms and using deep learning techniques. In this paper, we propose a method for scene text detection based on Maximally Stable Extremal Regions (MSER) combined with Stroke Width Transform (SWT) and recognition using Convolutional Recurrent Neural Networks (CRNN). Our method consists of two stages: text detection and text recognition. To detect text, we use MSER and SWT to extract candidate text regions from the input and then, we eradicate non-text regions using image to image translation. Finally, to recognize text, CRNN is used to recognize the text present in the detected regions. Our CRNN architecture consists of convolutional and recurrent layers, which enable us to capture both spatial and temporal features of the text. The methodology is evaluated on various benchmark datasets and has obtained good results with accuracy of 96% when compared to existing methods.
Downloads
References
Bagi, R., Dutta, T., Nigam, N., Verma, D., & Gupta, H. P. (2021). Met-MLTS: leveraging smartphones for end-to-end spotting of multilingual oriented scene texts and traffic signs in adverse meteorological conditions. IEEE Transactions on Intelligent Transportation Systems, 23(8), 12801-12810. https://doi.org/10.1109/TITS.2021.3117793
Cheng, P., Cai, Y., & Wang, W. (2019). A direct regression scene text detector with position-sensitive segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 4171-4181. https://doi.org/10.1109/TCSVT.2019.2947475
Das, A., Palaiahnakote, S., Banerjee, A., Antonacopoulos, A., & Pal, U. (2024). Soft Set-based MSER End-to-End System for Occluded Scene Text Detection, Recognition and Prediction. Knowledge-Based Systems, 112593. https://doi.org/10.1016/j.knosys.2024.112593
Dutta, I. N., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2019). Multi-lingual text localization from camera captured images based on foreground homogenity analysis. In Recent Developments in Machine Learning and Data Analytics: IC3 2018 (pp. 149-158). Springer Singapore. https://doi.org/10.1007/978-981-13-1280-9_15
Epshtein, B., Ofek, E., & Wexler, Y. (2010, June). Detecting text in natural scenes with stroke width transform. In 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 2963-2970). IEEE. https://doi.org/10.1109/CVPR.2010.5540041
Fang, S., Mao, Z., Xie, H., Wang, Y., Yan, C., & Zhang, Y. (2022). Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting. IEEE transactions on pattern analysis and machine intelligence, 45(6), 7123-7141. https://doi.org/10.1109/TPAMI.2022.3223908
Geng, T. (2024). Transforming Scene Text Detection and Recognition: A Multi-Scale End-to-End Approach With Transformer Framework. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3375497
Gomez, L., & Karatzas, D. (2014, August). MSER-based real-time text detection and tracking. In 2014 22nd International Conference on Pattern Recognition (pp. 3110-3115). IEEE. https://doi.org/10.1109/ICPR.2014.536
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27. https://doi.org/10.48550/arXiv.1406.2661
He, W., Zhang, X. Y., Yin, F., & Liu, C. L. (2017). Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE international conference on computer vision (pp. 745-753). https://doi.org/10.1109/ICCV.2017.87
Islam, M. R., Mondal, C., Azam, M. K., & Islam, A. S. M. J. (2016, May). Text detection and recognition using enhanced MSER detection and a novel OCR technique. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) (pp. 15-20). IEEE. https://doi.org/10.1109/ICIEV.2016.7760054
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125-1134). https://doi.org/10.48550/arXiv.1611.07004
Kai, H. E., Jinlong, T. A. N. G., Zikang, L. I. U., & Ziqi, Y. A. N. G. (2024). HAFE: A Hierarchical Awareness and Feature Enhancement Network for Scene Text Recognition. Knowledge-Based Systems, 284, 111178. https://doi.org/10.1016/j.knosys.2023.111178
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., ... & Valveny, E. (2015, August). ICDAR 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1156-1160). IEEE. https://doi.org/10.1109/ICDAR.2015.7333942
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L. G., Mestre, S. R., ... & De Las Heras, L. P. (2013, August). ICDAR 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition (pp. 1484-1493). IEEE. https://doi.org/10.1109/ICDAR.2013.221
Khalid, S., Shah, J. H., Sharif, M., Dahan, F., Saleem, R., & Masood, A. (2024). A Robust Intelligent System for Text-Based Traffic Signs Detection and Recognition in Challenging Weather Conditions. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3401044
Koo, H. I., & Kim, D. H. (2013). Scene text detection via connected component clustering and nontext filtering. IEEE transactions on image processing, 22(6), 2296-2305. https://doi.org/10.1109/TIP.2013.2249082
Liu, Y., Jin, L., & Fang, C. (2019). Arbitrarily shaped scene text detection with a mask tightness text detector. IEEE Transactions on Image Processing, 29, 2918-2930. https://doi.org/10.1109/TIP.2019.2954218
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and vision computing, 22(10), 761-767. https://doi.org/10.1109/TIP.2019.2954218
Mirza, M. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. https://doi.org/10.48550/arXiv.1411.1784
Mu, D., Sun, W., Xu, G., & Li, W. (2021). Random blur data augmentation for scene text recognition. IEEE Access, 9, 136636-136646. https://doi.org/10.1109/ACCESS.2021.3117035
Mukhopadhyay, A., Kumar, S., Chowdhury, S. R., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2019). Multi-lingual scene text detection using one-class classifier. International Journal of Computer Vision and Image Processing (IJCVIP), 9(2), 48-65. https://doi.org/10.4018/IJCVIP.2019040104
Panda, S., Ash, S., Chakraborty, N., Mollah, A. F., Basu, S., & Sarkar, R. (2020). Parameter tuning in MSER for text localization in multi-lingual camera-captured scene text images. In Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019 (pp. 999-1009). Springer Singapore. https://doi.org/10.1007/978-981-13-9042-5_86
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304. https://doi.org/10.1109/TPAMI.2016.2646371
Sun, W., Wang, Q., Hou, Z., Chen, X., Yan, Q., & Zhang, Y. (2024). DPGS: Cross-cooperation guided dynamic points generation for scene text spotting. Knowledge-Based Systems, 302, 112399. https://doi.org/10.1016/j.knosys.2024.112399
Tian, S., Zhu, K. X., Qin, H. B., & Yang, C. (2024). Dynamic receptive field adaptation for scene text recognition. Pattern Recognition Letters, 178, 55-61. https://doi.org/10.1016/j.patrec.2023.12.005
Tong, G., Dong, M., Sun, X., & Song, Y. (2022). Natural scene text detection and recognition based on saturation-incorporated multi-channel MSER. Knowledge-Based Systems, 250, 109040. https://doi.org/10.1016/j.knosys.2022.109040
Wu, L., Xu, Y., Hou, J., Chen, C. P., & Liu, C. L. (2022). A two-level rectification attention network for scene text recognition. IEEE Transactions on Multimedia, 25, 2404-2414. https://doi.org/10.1109/TMM.2022.3146779
Wu, Y., Kong, Q., Qian, C., Nappi, M., & Wan, S. (2023). End-PolarT: Polar Representation for End-to-End Scene Text Detection. Big Data Research, 34, 100410. https://doi.org/10.1016/j.bdr.2023.100410
Xu, Y., Liang, Z., Liang, Y., Li, X., Pan, W., You, J., ... & Scotti, F. (2024). Data-Driven Container Marking Detection and Recognition System with an Open Large-Scale Scene Text Dataset. IEEE Transactions on Emerging Topics in Computational Intelligence. https://doi.org/10.1109/TETCI.2024.3377680
Yan, X., Fang, Z., & Jin, Y. (2023). An adaptive n-gram transformer for multi-scale scene text recognition. Knowledge-Based Systems, 280, 110964. https://doi.org/10.1016/j.knosys.2023.110964
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002. https://doi.org/10.48550/arXiv.1606.09002
Ye, Q., & Doermann, D. (2014). Text detection and recognition in imagery: A survey. IEEE transactions on pattern analysis and machine intelligence, 37(7), 1480-1500. https://doi.org/10.1109/TPAMI.2014.2366765
Yin, X. C., Pei, W. Y., Zhang, J., & Hao, H. W. (2015). Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence, 37(9), 1930-1937. https://doi.org/10.1109/TPAMI.2014.2388210
Yu, W., Liu, Y., Zhu, X., Cao, H., Sun, X., & Bai, X. (2024). Turning a clip model into a scene text spotter. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2024.3379828
Zhang, J., & Kasturi, R. (2014). A novel text detection system based on character and link energies. IEEE Transactions on Image Processing, 23(9), 4187-4198. https://doi.org/10.1109/TIP.2014.2341935
Zhang, Z., Shen, W., Yao, C., & Bai, X. (2015). Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2558-2567). https://doi.org/10.1109/CVPR.2015.7298871
Zhou, G., Liu, Y., Tian, Z., & Su, Y. (2011, September). A new hybrid method to detect text in natural scene. In 2011 18th IEEE International Conference on Image Processing (pp. 2605-2608). IEEE. https://doi.org/10.1109/ICIP.2011.6116199