数据预处理报错：ArrowInvalid: offset overflow while concatenating arrays

大佬，麻烦请问一下，我这边有1w条数据，我按照下面方法进行处理的时候，报错ArrowInvalid: offset overflow while concatenating arrays，想问一下怎么处理，麻烦您了
```py
def extract_text_after_img_tag(input_string):
    # 分割字符串，获取img标签后面的部分
    parts = input_string.split("</img>")
    if len(parts) > 1:
        # 取分割后的第二部分，并去除前后空白字符
        text_after_img = parts[1].strip()
        return text_after_img
    return ""

def process_func(example):
    """
    将数据集进行预处理
    """
    MAX_LENGTH = 8192
    input_ids, attention_mask, labels = [], [], []
    conversation = example["conversations"]
    input_content = conversation[0]["value"]
    output_content = conversation[1]["value"]
    file_path = input_content.split("<img>")[1].split("</img>")[0]  # 获取图像路径
    text_after_img = extract_text_after_img_tag(input_content)
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": f"{file_path}",
                    "resized_height": 1280,
                    "resized_width": 1280,
                },
                {"type": "text", "text": text_after_img},
            ],
        }
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )  # 获取文本
    image_inputs, video_inputs = process_vision_info(messages)  # 获取数据数据（预处理过）
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = {key: value.tolist() for key, value in inputs.items()} #tensor -> list,为了方便拼接
    instruction = inputs

    response = tokenizer(f"{output_content}", add_special_tokens=False)


    input_ids = (
            instruction["input_ids"][0] + response["input_ids"] + [tokenizer.pad_token_id]
    )

    attention_mask = instruction["attention_mask"][0] + response["attention_mask"] + [1]
    labels = (
            [-100] * len(instruction["input_ids"][0])
            + response["input_ids"]
            + [tokenizer.pad_token_id]
    )
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]

    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)
    labels = torch.tensor(labels)
    inputs['pixel_values'] = torch.tensor(inputs['pixel_values'])
    inputs['image_grid_thw'] = torch.tensor(inputs['image_grid_thw']).squeeze(0)  #由（1,h,w)变换为（h,w）
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels,
            "pixel_values": inputs['pixel_values'], "image_grid_thw": inputs['image_grid_thw']}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

数据预处理报错：ArrowInvalid: offset overflow while concatenating arrays #402

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

数据预处理报错：ArrowInvalid: offset overflow while concatenating arrays #402

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions