Skip to content

Commit 2d57e21

Browse files
authored
Update InternVL 2.5 README (#798)
* Update InternVL 2.5 evaluation code * Update README.md
1 parent e8dd6f8 commit 2d57e21

38 files changed

+6777
-259
lines changed

README.md

+114-99
Large diffs are not rendered by default.

README_zh.md

+118-102
Large diffs are not rendered by default.

internvl_chat/README.md

+618-32
Large diffs are not rendered by default.

internvl_chat/eval/README.md

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# README for Evaluation
2+
3+
Here, we list the codebase we used to obtain the evaluation results in the InternVL 2.5 technical report.
4+
5+
## Multimodal Reasoning and Mathematics
6+
7+
| Benchmark Name | Codebase |
8+
| -------------- | -------------------------------------------------------- |
9+
| MMMU | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
10+
| MMMU-Pro | [This Codebase](./mmmu_pro) |
11+
| MathVista | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
12+
| MATH-Vision | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
13+
| MathVerse | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
14+
| OlympiadBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
15+
16+
## Multimodal Reasoning and Mathematics
17+
18+
| Benchmark Name | Codebase |
19+
| ----------------- | -------------------------------------------------------- |
20+
| AI2D with mask | [This Codebase](./vqa) |
21+
| AI2D without mask | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
22+
| ChartQA | [This Codebase](./vqa) |
23+
| DocVQA | [This Codebase](./vqa) |
24+
| InfoVQA | [This Codebase](./vqa) |
25+
| OCRBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
26+
| SEED-2-Plus | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
27+
| CharXiv | [CharXiv](https://github.com/princeton-nlp/CharXiv) |
28+
| VCR | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
29+
30+
## Multi-Image Understanding
31+
32+
| Benchmark Name | Codebase |
33+
| -------------- | -------------------------------------------------------- |
34+
| BLINK | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
35+
| Mantis Eval | [This Codebase](./mantis_eval) |
36+
| MMIU | [This Codebase](./mmiu) |
37+
| MuirBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
38+
| MMT-Bench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
39+
| MIRB | [This Codebase](./mirb) |
40+
41+
## Real-World Comprehension
42+
43+
| Benchmark Name | Codebase |
44+
| -------------- | -------------------------------------------------------- |
45+
| RealWorldQA | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
46+
| MME-RealWorld | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
47+
| WildVision | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
48+
| R-Bench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
49+
50+
## Comprehensive Multimodal Evaluation
51+
52+
| Benchmark Name | Codebase |
53+
| -------------- | -------------------------------------------------------- |
54+
| MME | [This Codebase](./mme) |
55+
| MMBench | [This Codebase](./mmbench) |
56+
| MMBench v1.1 | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
57+
| MMVet | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
58+
| MMVet v2 | [This Codebase](./mmvetv2) |
59+
| MMStar | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
60+
61+
## Multimodal Hallucination Evaluation
62+
63+
| Benchmark Name | Codebase |
64+
| -------------- | -------------------------------------------------------- |
65+
| HallBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
66+
| MMHal-Bench | [This Codebase](./mmhal) |
67+
| CRPE | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
68+
| POPE | [This Codebase](./pope) |
69+
70+
## Visual Grounding
71+
72+
| Benchmark Name | Codebase |
73+
| -------------- | -------------------------- |
74+
| RefCOCO | [This Codebase](./refcoco) |
75+
| RefCOCO+ | [This Codebase](./refcoco) |
76+
| RefCOCOg | [This Codebase](./refcoco) |
77+
78+
## Multimodal Multilingual Understanding
79+
80+
| Benchmark Name | Codebase |
81+
| -------------------- | -------------------------------------------------------- |
82+
| MMMB | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
83+
| Multilingual MMBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
84+
| MTVQA | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
85+
86+
## Video Understanding
87+
88+
| Benchmark Name | Codebase |
89+
| -------------- | -------------------------------------------------------- |
90+
| Video-MME | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
91+
| MVBench | [This Codebase](./mvbench) |
92+
| MMBench-Video | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
93+
| MLVU | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
94+
| LongVideoBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
95+
| CG-Bench | provided by authors |

internvl_chat/eval/caption/README.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# README for Evaluation
2+
3+
## 🌟 Overview
4+
5+
This script provides an evaluation pipeline for image captioning across three datasets: `COCO`, `Flickr30k`, and `NoCaps`.
6+
7+
## 🗂️ Data Preparation
8+
9+
Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
10+
11+
### COCO Karpathy Test
12+
13+
Follow the instructions below to prepare the data:
14+
15+
```shell
16+
# Step 1: Create the data directory
17+
mkdir -p data/coco && cd data/coco
18+
19+
# Step 2: Download and unzip image files
20+
wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
21+
wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
22+
wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip
23+
24+
# Step 3: Download and place the annotation files
25+
mkdir -p annotations && cd annotations/
26+
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
27+
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.json
28+
29+
cd ../../..
30+
```
31+
32+
After preparation is complete, the directory structure is:
33+
34+
```shell
35+
data/coco
36+
├── annotations
37+
│ ├── coco_karpathy_test.json
38+
│ └── coco_karpathy_test_gt.json
39+
├── train2014
40+
├── val2014
41+
└── test2015
42+
```
43+
44+
### Flickr30K Karpathy Test
45+
46+
Follow the instructions below to prepare the data:
47+
48+
```shell
49+
# Step 1: Create the data directory
50+
mkdir -p data/flickr30k && cd data/flickr30k
51+
52+
# Step 2: Download and unzip image files
53+
# Download images from https://bryanplummer.com/Flickr30kEntities/
54+
55+
# Step 3: Download and place the annotation files
56+
# Karpathy split annotations can be downloaded from the following link:
57+
wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_test_karpathy.txt
58+
# This file is provided by the clip-benchmark repository.
59+
# We convert this txt file to json format, download the converted file:
60+
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json
61+
62+
cd ../..
63+
```
64+
65+
After preparation is complete, the directory structure is:
66+
67+
```shell
68+
data/flickr30k
69+
├── Images
70+
├── flickr30k_test_karpathy.txt
71+
└── flickr30k_test_karpathy.json
72+
```
73+
74+
### NoCaps Val
75+
76+
Follow the instructions below to prepare the data:
77+
78+
```shell
79+
# Step 1: Create the data directory
80+
mkdir -p data/nocaps && cd data/nocaps
81+
82+
# Step 2: Download and unzip image files
83+
# Download images from https://nocaps.org/download
84+
85+
# Step 3: Download and place the annotation files
86+
# Original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
87+
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
88+
89+
cd ../..
90+
```
91+
92+
After preparation is complete, the directory structure is:
93+
94+
```shell
95+
data/nocaps
96+
├── images
97+
└── nocaps_val_4500_captions.json
98+
```
99+
100+
## 🏃 Evaluation Execution
101+
102+
> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
103+
104+
To run the evaluation, execute the following command on an 8-GPU setup:
105+
106+
```shell
107+
torchrun --nproc_per_node=8 eval/caption/evaluate_caption.py --checkpoint ${CHECKPOINT} --datasets ${DATASETS} --dynamic
108+
```
109+
110+
Alternatively, you can run the following simplified command:
111+
112+
```shell
113+
# Test COCO, Flickr30K, and NoCaps
114+
GPUS=8 sh evaluate.sh ${CHECKPOINT} caption --dynamic
115+
# Test COCO only
116+
GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-coco --dynamic
117+
# Test Flickr30K only
118+
GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-flickr30k --dynamic
119+
# Test NoCaps only
120+
GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-nocaps --dynamic
121+
```
122+
123+
### Arguments
124+
125+
The following arguments can be configured for the evaluation script:
126+
127+
| Argument | Type | Default | Description |
128+
| ---------------- | ------ | ------------------------- | ----------------------------------------------------------------------------------------------------------------- |
129+
| `--checkpoint` | `str` | `''` | Path to the model checkpoint. |
130+
| `--datasets` | `str` | `'coco,flickr30k,nocaps'` | Comma-separated list of datasets to evaluate. |
131+
| `--dynamic` | `flag` | `False` | Enables dynamic high resolution preprocessing. |
132+
| `--max-num` | `int` | `6` | Maximum tile number for dynamic high resolution. |
133+
| `--load-in-8bit` | `flag` | `False` | Loads the model weights in 8-bit precision. |
134+
| `--auto` | `flag` | `False` | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |
+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# README for Evaluation
2+
3+
## 🌟 Overview
4+
5+
This script provides an evaluation pipeline for `LLaVA-Bench`.
6+
7+
For scoring, we use **GPT-4-0613** as the evaluation model.
8+
While the provided code can run the benchmark, we recommend using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for testing this benchmark if you aim to align results with our technical report.
9+
10+
## 🗂️ Data Preparation
11+
12+
Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
13+
14+
### LLaVA-Bench
15+
16+
Follow the instructions below to prepare the data:
17+
18+
```shell
19+
# Step 1: Download the dataset
20+
cd data/
21+
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
22+
cd ../
23+
```
24+
25+
After preparation is complete, the directory structure is:
26+
27+
```shell
28+
data/llava-bench-in-the-wild
29+
├── images
30+
├── answers_gpt4.jsonl
31+
├── bard_0718.jsonl
32+
├── bing_chat_0629.jsonl
33+
├── context.jsonl
34+
├── questions.jsonl
35+
└── README.md
36+
```
37+
38+
## 🏃 Evaluation Execution
39+
40+
> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
41+
42+
To run the evaluation, execute the following command on an 1-GPU setup:
43+
44+
```shell
45+
# Step 1: Remove old inference results if exists
46+
rm -rf results/llava_bench_results_review.jsonl
47+
48+
# Step 2: Run the evaluation
49+
torchrun --nproc_per_node=1 eval/llava_bench/evaluate_llava_bench.py --checkpoint ${CHECKPOINT} --dynamic
50+
51+
# Step 3: Scoring the results using gpt-4-0613
52+
export OPENAI_API_KEY="your_openai_api_key"
53+
python -u eval/llava_bench/eval_gpt_review_bench.py \
54+
--question data/llava-bench-in-the-wild/questions.jsonl \
55+
--context data/llava-bench-in-the-wild/context.jsonl \
56+
--rule eval/llava_bench/rule.json \
57+
--answer-list \
58+
data/llava-bench-in-the-wild/answers_gpt4.jsonl \
59+
results/llava_bench_results.jsonl \
60+
--output \
61+
results/llava_bench_results_review.jsonl
62+
python -u eval/llava_bench/summarize_gpt_review.py -f results/llava_bench_results_review.jsonl
63+
```
64+
65+
Alternatively, you can run the following simplified command:
66+
67+
```shell
68+
export OPENAI_API_KEY="your_openai_api_key"
69+
GPUS=1 sh evaluate.sh ${CHECKPOINT} llava-bench --dynamic
70+
```
71+
72+
### Arguments
73+
74+
The following arguments can be configured for the evaluation script:
75+
76+
| Argument | Type | Default | Description |
77+
| ---------------- | ------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
78+
| `--checkpoint` | `str` | `''` | Path to the model checkpoint. |
79+
| `--datasets` | `str` | `'llava_bench'` | Comma-separated list of datasets to evaluate. |
80+
| `--dynamic` | `flag` | `False` | Enables dynamic high resolution preprocessing. |
81+
| `--max-num` | `int` | `6` | Maximum tile number for dynamic high resolution. |
82+
| `--load-in-8bit` | `flag` | `False` | Loads the model weights in 8-bit precision. |
83+
| `--auto` | `flag` | `False` | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |
+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# README for Evaluation
2+
3+
## 🌟 Overview
4+
5+
This script provides an evaluation pipeline for `Mantis-Eval`.
6+
7+
## 🗂️ Data Preparation
8+
9+
Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
10+
11+
### Mantis-Eval
12+
13+
The evaluation script will automatically download the Mantis Eval dataset from HuggingFace, and the cached path is `data/mantis_eval`.
14+
15+
## 🏃 Evaluation Execution
16+
17+
> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
18+
19+
To run the evaluation, execute the following command on an 8-GPU setup:
20+
21+
```shell
22+
torchrun --nproc_per_node=8 eval/mantis_eval/evaluate_mantis.py --checkpoint ${CHECKPOINT} --dynamic
23+
```
24+
25+
Alternatively, you can run the following simplified command:
26+
27+
```shell
28+
GPUS=8 sh evaluate.sh ${CHECKPOINT} mantis --dynamic
29+
```
30+
31+
### Arguments
32+
33+
The following arguments can be configured for the evaluation script:
34+
35+
| Argument | Type | Default | Description |
36+
| ---------------- | ------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
37+
| `--checkpoint` | `str` | `''` | Path to the model checkpoint. |
38+
| `--datasets` | `str` | `'Mantis-Eval'` | Comma-separated list of datasets to evaluate. |
39+
| `--dynamic` | `flag` | `False` | Enables dynamic high resolution preprocessing. |
40+
| `--max-num` | `int` | `6` | Maximum tile number for dynamic high resolution. |
41+
| `--load-in-8bit` | `flag` | `False` | Loads the model weights in 8-bit precision. |
42+
| `--auto` | `flag` | `False` | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |

internvl_chat/eval/mantis_eval/evaluate_mantis.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ def __getitem__(self, idx):
7575
images += tiles
7676
num_patches_list.append(len(tiles))
7777
else:
78-
images = [image]
78+
images = image_list
7979
num_patches_list.append(1)
8080
pixel_values = [self.transform(image) for image in images]
8181
pixel_values = torch.stack(pixel_values)

0 commit comments

Comments
 (0)