Skip to content

Commit 6a67aa2

Browse files
authored
feat: filter experiments in UI (#473)
- feat: filtering in experiments - fix index - fix
1 parent 8ddd784 commit 6a67aa2

File tree

6 files changed

+94
-0
lines changed

6 files changed

+94
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
import {
2+
CodeTabs,
3+
python,
4+
typescript,
5+
PythonBlock,
6+
TypeScriptBlock,
7+
} from "@site/src/components/InstructionsWithCode";
8+
9+
# Filter experiments in the UI
10+
11+
LangSmith lets you filter your previous experiments by feedback scores and metadata to make it easy
12+
to find only the experiments you care about.
13+
14+
## Background: add metadata to your experiments
15+
16+
When you run an experiment in the SDK, you can attach metadata to make it easier to filter in UI. This
17+
is helpful if you know what axes you want to drill down into when running experiments.
18+
19+
In our example, we are going to attach metadata to our experiment around the model used, the model provider,
20+
and a known ID of the prompt:
21+
22+
<CodeTabs
23+
groupId="client-language"
24+
tabs={[
25+
python`
26+
models = {
27+
"openai-gpt-4o": ChatOpenAI(model="gpt-4o", temperature=0),
28+
"openai-gpt-3.5-turbo": ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
29+
"anthropic-claude-3-sonnet-20240229": ChatAnthropic(temperature=0, model_name="claude-3-sonnet-20240229")
30+
}
31+
prompts = {
32+
"singleminded": "always answer questions with the word banana.",
33+
"fruitminded": "always discuss fruit in your answers.",
34+
"basic": "you are a chatbot."
35+
}
36+
def answer_evaluator(run, example) -> dict:
37+
llm = ChatOpenAI(model="gpt-4o", temperature=0)
38+
answer_grader = hub.pull("langchain-ai/rag-answer-vs-reference") | llm \n
39+
score = answer_grader.invoke(
40+
{
41+
"question": example.inputs["question"],
42+
"correct_answer": example.outputs["answer"],
43+
"student_answer": run.outputs,
44+
}
45+
)
46+
return {"key": "correctness", "score": score["Score"]}
47+
48+
dataset_name = "Filterable Dataset"
49+
for model_type, model in models.items():
50+
for prompt_type, prompt in prompts.items():
51+
52+
def predict(example):
53+
return model.invoke(
54+
[("system", prompt), ("user", example["question"])]
55+
)\n
56+
model_provider = model_type.split("-")[0]
57+
model_name = model_type[len(model_provider) + 1:]\n
58+
evaluate(
59+
predict,
60+
data=dataset_name,
61+
evaluators=[answer_evaluator],
62+
# ADD IN METADATA HERE!!
63+
metadata={
64+
"model_provider": model_provider,
65+
"model_name": model_name,
66+
"prompt_id": prompt_type
67+
}
68+
)
69+
`,
70+
]}
71+
/>
72+
73+
## Filter experiments in the UI
74+
75+
In the UI, we see all experiments that have been run by default.
76+
77+
![](./static/filter-all-experiments.png)
78+
79+
If we, say, have a preference for openai models, we can easily filter down and see scores within just openai
80+
models first:
81+
82+
![](./static/filter-openai.png)
83+
84+
We can stack filters, allowing us to filter out low scores on correctness to make sure we only compare
85+
relevant experiments:
86+
87+
![](./static/filter-feedback.png)
88+
89+
Finally, we can clear and reset filters. For example, if we see there is clear there's a winner with the
90+
`singleminded` prompt, we can change filtering settings to see if any other model providers' models work
91+
as well with it:
92+
93+
![](./static/filter-singleminded.png)
Loading
Loading
Loading
Loading

versioned_docs/version-2.0/how_to_guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,7 @@ Evaluate your LLM applications to measure their performance over time.
149149
- [Open a trace](./how_to_guides/evaluation/compare_experiment_results#open-a-trace)
150150
- [Expand detailed view](./how_to_guides/evaluation/compare_experiment_results#expand-detailed-view)
151151
- [Update display settings](./how_to_guides/evaluation/compare_experiment_results#update-display-settings)
152+
- [Filter experiments in the UI](./how_to_guides/evaluation/filter_experiments_ui)
152153
- [Evaluate an existing experiment](./how_to_guides/evaluation/evaluate_existing_experiment)
153154
- [Unit test LLM applications (Python only)](./how_to_guides/evaluation/unit_testing)
154155
- [Run pairwise evaluations](./how_to_guides/evaluation/evaluate_pairwise)

0 commit comments

Comments
 (0)