Skip to content

Instructor-style models (NV-Embed, e5-large-instruct, SFR-Embedding-2_R, etc.) #604

Open
@bigabig

Description

@bigabig

Hi,

How would you go on about training Instructor-style models that also take a prompt?

I am particularily interested in the clustering task, so my current approach looks like this:

instruction = "Identify the topic or theme of the given news articles"
model_name = "intfloat/multilingual-e5-large-instruct"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description.strip()}\nQuery: {query}"

data = [get_detailed_instruct(instruction, x) for x in docs.data] 
labels = docs.target

X_train_full, X_eval, y_train_full, y_eval = train_test_split(
    data, labels, test_size=0.2, random_state=42, stratify=labels
)

train_dataset = Dataset.from_dict(
    {
        "text": X_train_full,
        "label": y_train_full,
    }
)
train_3_shot = sample_dataset(
    train_dataset, label_column="label", num_samples=3, seed=42
)

So I am basically just prepending the instruction to the query. This is what seems to be correct way accorrding to https://huggingface.co/intfloat/multilingual-e5-large-instruct.

However, for 20_newsgroup dataset I get worse results with instruction than without :O Isn't this the correct way?

Do you guys have any suggestions?

EDIT: I just realized that the training probably needs to be adapted. If I understand correctly, the kind of models I am interested in are basically trained Lora adapters. The original weights of the underlying base LLM (most of the time Mistral) are not changed at all.

SetFit, by default, adjusts all model weights. So for this to work correctly, I would only need to modify the Lora Adapter weights with the "SetFit strategy" (data sampling).

This is probably out of scope of SetFit (for now)? In any way, do you have any recommendation on how to achieve this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions