OpenGVLab
diff --git a/‎README.md
+114-99 b/‎README.md
+114-99
diff --git a/‎README_zh.md
+118-102 b/‎README_zh.md
+118-102
diff --git a/‎internvl_chat/README.md
+618-32 b/‎internvl_chat/README.md
+618-32
diff --git a/‎internvl_chat/eval/README.md
+95 b/‎internvl_chat/eval/README.md
+95
diff --git a/‎internvl_chat/eval/caption/README.md
+134 b/‎internvl_chat/eval/caption/README.md
+134
diff --git a/‎internvl_chat/eval/llava_bench/README.md
+83 b/‎internvl_chat/eval/llava_bench/README.md
+83
diff --git a/‎internvl_chat/eval/mantis_eval/README.md
+42 b/‎internvl_chat/eval/mantis_eval/README.md
+42
diff --git a/‎internvl_chat/eval/mantis_eval/evaluate_mantis.py
+1-1 b/‎internvl_chat/eval/mantis_eval/evaluate_mantis.py
+1-1
@@ -0,0 +1,95 @@
+# README for Evaluation
+
+Here, we list the codebase we used to obtain the evaluation results in the InternVL 2.5 technical report.
+
+## Multimodal Reasoning and Mathematics
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| MMMU           | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MMMU-Pro       | [This Codebase](./mmmu_pro)                              |
+| MathVista      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MATH-Vision    | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MathVerse      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| OlympiadBench  | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+
+## Multimodal Reasoning and Mathematics
+
+| Benchmark Name    | Codebase                                                 |
+| ----------------- | -------------------------------------------------------- |
+| AI2D with mask    | [This Codebase](./vqa)                                   |
+| AI2D without mask | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| ChartQA           | [This Codebase](./vqa)                                   |
+| DocVQA            | [This Codebase](./vqa)                                   |
+| InfoVQA           | [This Codebase](./vqa)                                   |
+| OCRBench          | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| SEED-2-Plus       | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| CharXiv           | [CharXiv](https://github.com/princeton-nlp/CharXiv)      |
+| VCR               | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+
+## Multi-Image Understanding
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| BLINK          | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| Mantis Eval    | [This Codebase](./mantis_eval)                           |
+| MMIU           | [This Codebase](./mmiu)                                  |
+| MuirBench      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MMT-Bench      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MIRB           | [This Codebase](./mirb)                                  |
+
+## Real-World Comprehension
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| RealWorldQA    | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MME-RealWorld  | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| WildVision     | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| R-Bench        | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+
+## Comprehensive Multimodal Evaluation
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| MME            | [This Codebase](./mme)                                   |
+| MMBench        | [This Codebase](./mmbench)                               |
+| MMBench v1.1   | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MMVet          | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MMVet v2       | [This Codebase](./mmvetv2)                               |
+| MMStar         | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+
+## Multimodal Hallucination Evaluation
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| HallBench      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MMHal-Bench    | [This Codebase](./mmhal)                                 |
+| CRPE           | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| POPE           | [This Codebase](./pope)                                  |
+
+## Visual Grounding
+
+| Benchmark Name | Codebase                   |
+| -------------- | -------------------------- |
+| RefCOCO        | [This Codebase](./refcoco) |
+| RefCOCO+       | [This Codebase](./refcoco) |
+| RefCOCOg       | [This Codebase](./refcoco) |
+
+## Multimodal Multilingual Understanding
+
+| Benchmark Name       | Codebase                                                 |
+| -------------------- | -------------------------------------------------------- |
+| MMMB                 | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| Multilingual MMBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MTVQA                | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+
+## Video Understanding
+
+| Benchmark Name | Codebase                                                 |
+| -------------- | -------------------------------------------------------- |
+| Video-MME      | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MVBench        | [This Codebase](./mvbench)                               |
+| MMBench-Video  | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| MLVU           | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| LongVideoBench | [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) |
+| CG-Bench       | provided by authors                                      |
@@ -0,0 +1,134 @@
+# README for Evaluation
+
+## 🌟 Overview
+
+This script provides an evaluation pipeline for image captioning across three datasets: `COCO`, `Flickr30k`, and `NoCaps`.
+
+## 🗂️ Data Preparation
+
+Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
+
+### COCO Karpathy Test
+
+Follow the instructions below to prepare the data:
+
+```shell
+# Step 1: Create the data directory
+mkdir -p data/coco && cd data/coco
+
+# Step 2: Download and unzip image files
+wget http://images.cocodataset.org/zips/train2014.zip && unzip train2014.zip
+wget http://images.cocodataset.org/zips/val2014.zip && unzip val2014.zip
+wget http://images.cocodataset.org/zips/test2015.zip && unzip test2015.zip
+
+# Step 3: Download and place the annotation files
+mkdir -p annotations && cd annotations/
+wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
+wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.json
+
+cd ../../..
+```
+
+After preparation is complete, the directory structure is:
+
+```shell
+data/coco
+├── annotations
+│   ├── coco_karpathy_test.json
+│   └── coco_karpathy_test_gt.json
+├── train2014
+├── val2014
+└── test2015
+```
+
+### Flickr30K Karpathy Test
+
+Follow the instructions below to prepare the data:
+
+```shell
+# Step 1: Create the data directory
+mkdir -p data/flickr30k && cd data/flickr30k
+
+# Step 2: Download and unzip image files
+# Download images from https://bryanplummer.com/Flickr30kEntities/
+
+# Step 3: Download and place the annotation files
+# Karpathy split annotations can be downloaded from the following link:
+wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_test_karpathy.txt
+# This file is provided by the clip-benchmark repository.
+# We convert this txt file to json format, download the converted file:
+wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_karpathy.json
+
+cd ../..
+```
+
+After preparation is complete, the directory structure is:
+
+```shell
+data/flickr30k
+├── Images
+├── flickr30k_test_karpathy.txt
+└── flickr30k_test_karpathy.json
+```
+
+### NoCaps Val
+
+Follow the instructions below to prepare the data:
+
+```shell
+# Step 1: Create the data directory
+mkdir -p data/nocaps && cd data/nocaps
+
+# Step 2: Download and unzip image files
+# Download images from https://nocaps.org/download
+
+# Step 3: Download and place the annotation files
+# Original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
+wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json
+
+cd ../..
+```
+
+After preparation is complete, the directory structure is:
+
+```shell
+data/nocaps
+├── images
+└── nocaps_val_4500_captions.json
+```
+
+## 🏃 Evaluation Execution
+
+> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
+
+To run the evaluation, execute the following command on an 8-GPU setup:
+
+```shell
+torchrun --nproc_per_node=8 eval/caption/evaluate_caption.py --checkpoint ${CHECKPOINT} --datasets ${DATASETS} --dynamic
+```
+
+Alternatively, you can run the following simplified command:
+
+```shell
+# Test COCO, Flickr30K, and NoCaps
+GPUS=8 sh evaluate.sh ${CHECKPOINT} caption --dynamic
+# Test COCO only
+GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-coco --dynamic
+# Test Flickr30K only
+GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-flickr30k --dynamic
+# Test NoCaps only
+GPUS=8 sh evaluate.sh ${CHECKPOINT} caption-nocaps --dynamic
+```
+
+### Arguments
+
+The following arguments can be configured for the evaluation script:
+
+| Argument         | Type   | Default                   | Description                                                                                                       |
+| ---------------- | ------ | ------------------------- | ----------------------------------------------------------------------------------------------------------------- |
+| `--checkpoint`   | `str`  | `''`                      | Path to the model checkpoint.                                                                                     |
+| `--datasets`     | `str`  | `'coco,flickr30k,nocaps'` | Comma-separated list of datasets to evaluate.                                                                     |
+| `--dynamic`      | `flag` | `False`                   | Enables dynamic high resolution preprocessing.                                                                    |
+| `--max-num`      | `int`  | `6`                       | Maximum tile number for dynamic high resolution.                                                                  |
+| `--load-in-8bit` | `flag` | `False`                   | Loads the model weights in 8-bit precision.                                                                       |
+| `--auto`         | `flag` | `False`                   | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |
@@ -0,0 +1,83 @@
+# README for Evaluation
+
+## 🌟 Overview
+
+This script provides an evaluation pipeline for `LLaVA-Bench`.
+
+For scoring, we use **GPT-4-0613** as the evaluation model.
+While the provided code can run the benchmark, we recommend using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for testing this benchmark if you aim to align results with our technical report.
+
+## 🗂️ Data Preparation
+
+Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
+
+### LLaVA-Bench
+
+Follow the instructions below to prepare the data:
+
+```shell
+# Step 1: Download the dataset
+cd data/
+git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
+cd ../
+```
+
+After preparation is complete, the directory structure is:
+
+```shell
+data/llava-bench-in-the-wild
+├── images
+├── answers_gpt4.jsonl
+├── bard_0718.jsonl
+├── bing_chat_0629.jsonl
+├── context.jsonl
+├── questions.jsonl
+└── README.md
+```
+
+## 🏃 Evaluation Execution
+
+> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
+
+To run the evaluation, execute the following command on an 1-GPU setup:
+
+```shell
+# Step 1: Remove old inference results if exists
+rm -rf results/llava_bench_results_review.jsonl
+
+# Step 2: Run the evaluation
+torchrun --nproc_per_node=1 eval/llava_bench/evaluate_llava_bench.py --checkpoint ${CHECKPOINT} --dynamic
+
+# Step 3: Scoring the results using gpt-4-0613
+export OPENAI_API_KEY="your_openai_api_key"
+python -u eval/llava_bench/eval_gpt_review_bench.py \
+  --question data/llava-bench-in-the-wild/questions.jsonl \
+  --context data/llava-bench-in-the-wild/context.jsonl \
+  --rule eval/llava_bench/rule.json \
+  --answer-list \
+      data/llava-bench-in-the-wild/answers_gpt4.jsonl \
+      results/llava_bench_results.jsonl \
+  --output \
+      results/llava_bench_results_review.jsonl
+python -u eval/llava_bench/summarize_gpt_review.py -f results/llava_bench_results_review.jsonl
+```
+
+Alternatively, you can run the following simplified command:
+
+```shell
+export OPENAI_API_KEY="your_openai_api_key"
+GPUS=1 sh evaluate.sh ${CHECKPOINT} llava-bench --dynamic
+```
+
+### Arguments
+
+The following arguments can be configured for the evaluation script:
+
+| Argument         | Type   | Default         | Description                                                                                                       |
+| ---------------- | ------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
+| `--checkpoint`   | `str`  | `''`            | Path to the model checkpoint.                                                                                     |
+| `--datasets`     | `str`  | `'llava_bench'` | Comma-separated list of datasets to evaluate.                                                                     |
+| `--dynamic`      | `flag` | `False`         | Enables dynamic high resolution preprocessing.                                                                    |
+| `--max-num`      | `int`  | `6`             | Maximum tile number for dynamic high resolution.                                                                  |
+| `--load-in-8bit` | `flag` | `False`         | Loads the model weights in 8-bit precision.                                                                       |
+| `--auto`         | `flag` | `False`         | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |
@@ -0,0 +1,42 @@
+# README for Evaluation
+
+## 🌟 Overview
+
+This script provides an evaluation pipeline for `Mantis-Eval`.
+
+## 🗂️ Data Preparation
+
+Before starting to download the data, please create the `InternVL/internvl_chat/data` folder.
+
+### Mantis-Eval
+
+The evaluation script will automatically download the Mantis Eval dataset from HuggingFace, and the cached path is `data/mantis_eval`.
+
+## 🏃 Evaluation Execution
+
+> ⚠️ Note: For testing InternVL (1.5, 2.0, 2.5, and later versions), always enable `--dynamic` to perform dynamic resolution testing.
+
+To run the evaluation, execute the following command on an 8-GPU setup:
+
+```shell
+torchrun --nproc_per_node=8 eval/mantis_eval/evaluate_mantis.py --checkpoint ${CHECKPOINT} --dynamic
+```
+
+Alternatively, you can run the following simplified command:
+
+```shell
+GPUS=8 sh evaluate.sh ${CHECKPOINT} mantis --dynamic
+```
+
+### Arguments
+
+The following arguments can be configured for the evaluation script:
+
+| Argument         | Type   | Default         | Description                                                                                                       |
+| ---------------- | ------ | --------------- | ----------------------------------------------------------------------------------------------------------------- |
+| `--checkpoint`   | `str`  | `''`            | Path to the model checkpoint.                                                                                     |
+| `--datasets`     | `str`  | `'Mantis-Eval'` | Comma-separated list of datasets to evaluate.                                                                     |
+| `--dynamic`      | `flag` | `False`         | Enables dynamic high resolution preprocessing.                                                                    |
+| `--max-num`      | `int`  | `6`             | Maximum tile number for dynamic high resolution.                                                                  |
+| `--load-in-8bit` | `flag` | `False`         | Loads the model weights in 8-bit precision.                                                                       |
+| `--auto`         | `flag` | `False`         | Automatically splits a large model across 8 GPUs when needed, useful for models too large to fit on a single GPU. |
@@ -75,7 +75,7 @@ def __getitem__(self, idx):
                 images += tiles
                 num_patches_list.append(len(tiles))
         else:
-            images = [image]
+            images = image_list
             num_patches_list.append(1)
         pixel_values = [self.transform(image) for image in images]
         pixel_values = torch.stack(pixel_values)