Fine-Grained Visual Recognition in Large Multimodal Models

Abstract

The rapid advancement of Large Multimodal Models (LMMs) has demonstrated remarkable capabilities across diverse vision-language tasks. While specialized architectures like Vision Transformers offer compute-efficient solutions for visual recognition, the potential of LMMs in fine-grained visual classification remains underexplored.

This work investigates the capabilities of multimodal models for fine-grained visual recognition tasks, specifically examining how well these general-purpose models can perform on challenging classification problems that require distinguishing between visually similar categories. We conduct experiments on the Oxford Flowers102 dataset using Qwen3-VL, exploring various adaptation strategies including Visual Instruction Tuning and classification head integration.

Our systematic evaluation reveals that LMMs can achieve competitive performance on fine-grained tasks, with our best configuration reaching 95.19% accuracy—approaching the performance of specialized vision models. These findings provide valuable insights into the applicability and effectiveness of LMMs for specialized visual recognition tasks.

Research Questions & Hypotheses

Primary Question

How effectively can Large Multimodal Models handle fine-grained visual classification tasks, and what factors influence their performance?

Hypothesis 1: SFT Enhancement

Supervised fine-tuning can help a general-purpose multimodal model increase its fine-grained classification accuracy

✓ Validated

Hypothesis 2: Classification Heads

Adding a custom classification head can further improve classification accuracy beyond SFT alone

✓ Validated

Hypothesis 3: Specialization

A more specialized (fine-tuned) base model will achieve better performance when combined with a classification head

✓ Validated

Experimental Results

Our experiments validate all three hypotheses, showing progressive improvements in classification accuracy through systematic enhancement strategies:

Model Configuration	Accuracy (%)	Improvement
Qwen3-VL-8B-Instruct (baseline)	16.08%	—
Qwen3-VL-4B-Instruct (baseline)	20.78%	+4.70%
InstructBLIP-Flan-T5-XL (baseline)	21.18%	+0.40%
Idefics2-8B (baseline)	22.65%	+1.47%
ResNet-50 (CNN baseline)	93.24%	—
Qwen3-VL-4B + Classification Head	64.60%	+43.82%*
Qwen3-VL-4B-SFT (fine-tuned)	73.52%	+8.92%*
Qwen3-VL-4B-SFT + Classification Head	95.19%	+21.67%*

Key Findings

Hypothesis 1 Validated

SFT dramatically improved accuracy from 20.78% to 73.52% (+254% relative improvement)

Hypothesis 2 Validated

Classification heads provide consistent improvements (base: +43.82%, SFT: +21.67%)

Hypothesis 3 Validated

Specialized model + classifier (95.19%) significantly outperforms base + classifier (64.60%)

CNN Comparison

Our best LMM configuration (95.19%) surpasses ResNet-50 baseline (93.24%)

Augmentation Study Results

Our ablation studies reveal important insights about traditional computer vision augmentation techniques:

Model Configuration	Accuracy (%)	Change
ResNet-50 (baseline)	93.24%	—
ResNet-50-MixUp-CutMix	90.59%	-2.65%
Qwen3-VL-4B-SFT (baseline)	73.52%	—
Qwen3-VL-4B-SFT-MixUp-CutMix	66.27%	-7.25%

Finding: Traditional augmentation techniques (MixUp/CutMix) are detrimental to multimodal model performance, highlighting fundamental differences between vision-only and multimodal training paradigms.

Models & Dataset

All trained models and processed datasets are available on Hugging Face for reproducing the experimental results:

Base SFT Model

Fine-tuned Qwen3-VL on flowers domain

73.52%

View Model

Base + Classifier

Base model with classification head

64.60%

View Model

Fine-tuned + Classifier

Fine-tuned model with classification head

95.19%

View Model

ResNet-50 Baseline

Fine-tuned ResNet-50 on flowers102 dataset

93.24%

View Model

Dataset

Flowers102 Dataset

Processed dataset with prompts for all tasks (open-qa, closed-qa, closed-negative-qa, open-qa-mixcut)

102 flower categories 8,289 total images 6,149 training samples 1,020 test samples

View Dataset

Full Collection on Hugging Face

Methodology

Our systematic investigation follows a three-stage progressive enhancement approach to improve LMM performance on fine-grained visual recognition:

1

Baseline Evaluation

Comprehensive evaluation of state-of-the-art LMMs (Qwen3-VL, InstructBLIP, Idefics2) on Flowers102 dataset using structured visual instruction format prompts.

Best baseline: 22.65% (Idefics2-8B)

2

Visual Instruction Tuning (SFT)

Domain-specific fine-tuning of Qwen3-VL-4B using structured visual instruction prompts on flower species classification for up to 4,000 steps.

Achievement: 73.52% (+254% improvement)

3

Classification Head Integration

Addition of specialized linear classification heads to both base and fine-tuned models, optimized using cross-entropy loss over 2,000 training steps.

Final: 95.19% (approaching SOTA)

Technical Configuration

Visual Instruction Tuning

Learning Rate: 2x10⁻⁵ with cosine scheduling
Batch Size: 1
Max Steps: 4,000
Hardware: 4x NVIDIA A100-SXM4-40GB

Classification Head Training

Learning Rate: 1x10⁻⁴ with cosine annealing
Max Steps: 2,000
Dropout Rate: 0.1
Base Model: Frozen during training

Paper & Citation

Research Paper

An Exploration on Fine-Grained Visual Recognition and Classification in Large Multimodal Models

SC4001 Neural Networks and Deep Learning, Nanyang Technological University, 2025

Download PDF

Citation

@misc{sc4001-flowers102-finegrained-recognition,
  title={An Exploration on Fine-Grained Visual Recognition and Classification in Large Multimodal Models},
  author={Oscar Qian, Suki Ng, Li You},
  year={2025},
  institution={Nanyang Technological University},
  course={SC4001 Neural Networks and Deep Learning},
  url={https://github.com/oscarqjh/SC4001-Group-Project}
}

Code & Reproducibility

SC4001-Group-Project

Complete implementation and reproduction pipeline

View Repository

Quick Start

# Clone repository with submodules
git clone --recurse-submodules https://github.com/oscarqjh/SC4001-Group-Project.git
cd SC4001-Group-Project

# Setup environment
uv venv -p 3.11
source .venv/bin/activate
uv pip install -e . -e ./extern/lmms-engine -e ./extern/lmms-eval

# Quick evaluation using pre-trained models
CUDA_VISIBLE_DEVICES=0 ./scripts/bash/evaluation/eval_qwen3vl.sh

Complete Pipeline

Full training and evaluation scripts for reproducing all experimental results

Data Processing

Automated dataset download, processing, and prompt generation tools

Comprehensive Analysis

Training dynamics visualization and detailed performance analysis

Documentation

Detailed documentation for every component and experiment

Hardware Requirements

Recommended: 4x NVIDIA A100-SXM4-40GB for distributed training
Storage: ~50GB for dataset and models
RAM: 32GB+ system memory