VTnet+Handcrafted based approach for food cuisines classification

Publication Name

Multimedia Tools and Applications


In this paper, we propose a novel hybrid transformer architecture for food cuisine detection and classification. The work carried out within this paper develops a combination of Vision Transformer ensemble architecture with hand-crafted features, thereby making a hybrid Vision Transformer food recognition system. Recently, Vision transformers have been introduced as an alternative means of classification to convolutional neural networks. It performs pattern detection and classification without convolutions and interprets an image as a sequence of patches. The combination of Vision Transformer and hand-crafted features like GIST, HoG (Histogram of Oriented Gradients), and LBP (Local Binary Pattern) were employed on the dataset. The dataset was specifically created (for this work) from the public logging system. It consisted of 13 food categories with 400 images of Indian food items like Ghevar, Idli, Dosa, and much more. It helped to capture a variety of images from every domain and culture. This work made use of the common and readily available food items, which can further be increased by adding on the specialties (dishes) from different regions. Various experiments were performed on CNN with various classifiers like Random forest, and SVM. Further, we compared our proposed approach with several ensembles of CNN architectures. The experiments proved that our proposed approach outperformed the state-of-the-art ensemble CNN architectures for detecting food cuisines. The proposed hybrid approach achieved an accuracy of 94.63%, sensitivity 84.42%, specificity 95.23%, and kappa coefficient 0.93, which was the best amongst all approaches.

Open Access Status

This publication may be available as open access


Link to publisher version (DOI)