The rapid advancement of Large Multimodal Models (LMMs) has demonstrated remarkable capabilities across diverse vision-language tasks. While specialized architectures like Vision Transformers offer compute-efficient solutions for visual recognition, the potential of LMMs in fine-grained visual classification remains underexplored.
This work investigates the capabilities of multimodal models for fine-grained visual recognition tasks, specifically examining how well these general-purpose models can perform on challenging classification problems that require distinguishing between visually similar categories. We conduct experiments on the Oxford Flowers102 dataset using Qwen3-VL, exploring various adaptation strategies including Visual Instruction Tuning and classification head integration.
Our systematic evaluation reveals that LMMs can achieve competitive performance on fine-grained tasks, with our best configuration reaching 95.19% accuracy—approaching the performance of specialized vision models. These findings provide valuable insights into the applicability and effectiveness of LMMs for specialized visual recognition tasks.