Instructor-style models (NV-Embed, e5-large-instruct, SFR-Embedding-2_R, etc.)

Hi,

How would you go on about training Instructor-style models that also take a prompt?

I am particularily interested in the clustering task, so my current approach looks like this:

```
instruction = "Identify the topic or theme of the given news articles"
model_name = "intfloat/multilingual-e5-large-instruct"

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description.strip()}\nQuery: {query}"

data = [get_detailed_instruct(instruction, x) for x in docs.data] 
labels = docs.target

X_train_full, X_eval, y_train_full, y_eval = train_test_split(
    data, labels, test_size=0.2, random_state=42, stratify=labels
)

train_dataset = Dataset.from_dict(
    {
        "text": X_train_full,
        "label": y_train_full,
    }
)
train_3_shot = sample_dataset(
    train_dataset, label_column="label", num_samples=3, seed=42
)
```

So I am basically just prepending the instruction to the query. This is what seems to be correct way accorrding to https://huggingface.co/intfloat/multilingual-e5-large-instruct.

However, for 20_newsgroup dataset I get worse results with instruction than without :O Isn't this the correct way?

Do you guys have any suggestions?

EDIT: I just realized that the training probably needs to be adapted. If I understand correctly, the kind of models I am interested in are basically trained Lora adapters. The original weights of the underlying base LLM (most of the time Mistral) are not changed at all.

SetFit, by default, adjusts all model weights. So for this to work correctly, I would only need to modify the Lora Adapter weights with the "SetFit strategy" (data sampling).

This is probably out of scope of SetFit (for now)? In any way, do you have any recommendation on how to achieve this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instructor-style models (NV-Embed, e5-large-instruct, SFR-Embedding-2_R, etc.) #604

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instructor-style models (NV-Embed, e5-large-instruct, SFR-Embedding-2_R, etc.) #604

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions