Skip to content

Commit 28cac05

Browse files
authored
Documentation for data export (#475)
2 parents 6a67aa2 + 88fcc58 commit 28cac05

File tree

3 files changed

+201
-2
lines changed

3 files changed

+201
-2
lines changed

versioned_docs/version-2.0/how_to_guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ Get started with LangSmith's tracing features to start adding observability to y
8888
- [Trace without setting environment variables](./how_to_guides/tracing/trace_without_env_vars)
8989
- [Trace using the LangSmith REST API](./how_to_guides/tracing/trace_with_api)
9090
- [Calculate token-based costs for traces](./how_to_guides/tracing/calculate_token_based_costs)
91+
- [Bulk Exporting Traces](./how_to_guides/tracing/data_export)
9192

9293
## Datasets
9394

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
---
2+
sidebar_position: 22
3+
---
4+
5+
# [Beta] Bulk Exporting Trace Data
6+
7+
:::tip Note
8+
Please note that the Data Export functionality is in Beta and only supported for LangSmith Plus or Enterprise tiers. To enable this feature, contact [email protected].
9+
:::
10+
11+
LangSmith's bulk data export functionality allows you to export your traces into an external destination. This can be useful if you want to analyze the
12+
data offline in a tool such as BigQuery, Snowflake, RedShift, Jupyter Notebooks, etc.
13+
14+
An export can be launched to target a specific LangSmith project and date range. Once a batch export is launched, our system will handle the orchestration and resilience of the export process.
15+
Please note that exporting your data may take some time depending on the size of your data. We also have a limit on how many of your exports can run at the same time.
16+
Bulk exports also have a runtime timeout of 24 hours.
17+
18+
## Destinations
19+
20+
Currently we support exporting to an S3 bucket or S3 API compatible bucket that you provide. The data will be exported in
21+
[Parquet](https://parquet.apache.org/docs/overview/) columnar format. This format will allow you to easily import the data into
22+
other systems. The data export will contain equivalent data fields as the [Run data format](../../reference/data_formats/run_data_format).
23+
24+
## Exporting Data
25+
26+
### Destinations - Providing a S3 bucket
27+
28+
To export LangSmith data, you will need to provide an S3 bucket where the data will be exported to.
29+
30+
The following information is needed for the export:
31+
32+
- **Bucket Name**: The name of the S3 bucket where the data will be exported to.
33+
- **Prefix**: The root prefix within the bucket where the data will be exported to.
34+
- **S3 Region**: The region of the bucket - this is needed for AWS S3 buckets.
35+
- **Endpoint URL**: The endpoint URL for the S3 bucket - this is needed for S3 API compatible buckets.
36+
- **Access Key**: The access key for the S3 bucket.
37+
- **Secret Key**: The secret key for the S3 bucket.
38+
39+
We support any S3 compatible bucket, for non AWS buckets such as GCS or MinIO, you will need to provide the endpoint URL.
40+
41+
### Preparing the Destination
42+
43+
The following example demonstrates how to create a destination using cURL. Replace the placeholder values with your actual configuration details.
44+
Note that credentials will be stored securely in an encrypted form in our system.
45+
46+
```bash
47+
curl --request POST \
48+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations' \
49+
--header 'Content-Type: application/json' \
50+
--header 'X-API-Key: YOUR_API_KEY' \
51+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
52+
--data '{
53+
"destination_type": "s3",
54+
"display_name": "My S3 Destination",
55+
"config": {
56+
"bucket_name": "your-s3-bucket-name",
57+
"prefix": "root_folder_prefix",
58+
"region": "your aws s3 region",
59+
"endpoint_url": "your endpoint url for s3 compatible buckets"
60+
},
61+
"credentials": {
62+
"access_key_id": "YOUR_S3_ACCESS_KEY_ID",
63+
"secret_access_key": "YOUR_S3_SECRET_ACCESS_KEY"
64+
}
65+
}'
66+
```
67+
68+
Use the returned `id` to reference this destination in subsequent bulk export operations.
69+
70+
### Create an export job
71+
72+
To export data, you will need to create an export job. This job will specify the destination, the project, and the date range of the data to export.
73+
74+
You can use the following cURL command to create the job:
75+
76+
```bash
77+
curl --request POST \
78+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
79+
--header 'Content-Type: application/json' \
80+
--header 'X-API-Key: YOUR_API_KEY' \
81+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
82+
--data '{
83+
"bulk_export_destination_id": "your_destination_id",
84+
"session_id": "project_uuid",
85+
"start_time": "2024-01-01T00:00:00Z",
86+
"end_time": "2024-01-02T23:59:59Z"
87+
}'
88+
```
89+
90+
Use the returned `id` to reference this export in subsequent bulk export operations.
91+
92+
## Monitoring the Export Job
93+
94+
### Monitor Export Status
95+
96+
To monitor the status of an export job, use the following cURL command:
97+
98+
```bash
99+
curl --request GET \
100+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
101+
--header 'Content-Type: application/json' \
102+
--header 'X-API-Key: YOUR_API_KEY' \
103+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
104+
```
105+
106+
Replace `{export_id}` with the ID of the export you want to monitor. This command retrieves the current status of the specified export job.
107+
108+
### List Runs for an Export
109+
110+
An export is typically broken up into multiple runs which correspond to a specific date partition to export.
111+
To list all runs associated with a specific export, use the following cURL command:
112+
113+
```bash
114+
curl --request GET \
115+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}/runs' \
116+
--header 'Content-Type: application/json' \
117+
--header 'X-API-Key: YOUR_API_KEY' \
118+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
119+
```
120+
121+
This command fetches all runs related to the specified export, providing details such as run ID, status, creation time, rows exported, etc.
122+
123+
### List All Exports
124+
125+
To retrieve a list of all export jobs, use the following cURL command:
126+
127+
```bash
128+
curl --request GET \
129+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
130+
--header 'Content-Type: application/json' \
131+
--header 'X-API-Key: YOUR_API_KEY' \
132+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
133+
```
134+
135+
This command returns a list of all export jobs along with their current statuses and creation timestamps.
136+
137+
### Stop an Export
138+
139+
To stop an existing export, use the following cURL command:
140+
141+
```bash
142+
curl --request PATCH \
143+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
144+
--header 'Content-Type: application/json' \
145+
--header 'X-API-Key: YOUR_API_KEY' \
146+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
147+
--data '{
148+
"status": "Cancelled"
149+
}'
150+
```
151+
152+
Replace `{export_id}` with the ID of the export you wish to cancel. Note that a job cannot be restarted once it has been cancelled,
153+
you will need to create a new export job instead.
154+
155+
## Partitioning Scheme
156+
157+
Data will be exported into your bucket into the follow Hive partitioned format:
158+
159+
```
160+
<bucket>/<prefix>/export_id=<export_id>/tenant_id=<tenant_id>/session_id=<session_id>/runs/year=<year>/month=<month>/day=<day>
161+
```
162+
163+
## Importing Data into other systems
164+
165+
Importing data from S3 and Parquet format is commonly supported by the majority of analytical systems. See below for documentation links:
166+
167+
### BigQuery
168+
169+
To import your data into BigQuery, see [Loading Data from Parquet](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet) and also
170+
[Hive Partitioned loads](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).
171+
172+
### Snowflake
173+
174+
You can load data into Snowflake from S3 by following the [Load from Cloud Document](https://docs.snowflake.com/en/user-guide/tutorials/load-from-cloud-tutorial).
175+
176+
### RedShift
177+
178+
You can COPY data from S3 / Parquet into RedShift by following the [AWS COPY Instructions](https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/).
179+
180+
### Clickhouse
181+
182+
You can directly query data in S3 / Parquet format in Clickhouse. As an example, if using GCS, you can query the data as follows:
183+
184+
```sql
185+
SELECT count(distinct id) FROM s3('https://storage.googleapis.com/<bucket>/<prefix>/export_id=<export_id>/**',
186+
'access_key_id', 'access_secret', 'Parquet')
187+
```
188+
189+
See [Clickhouse S3 Integration Documentation](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3) for more information.
190+
191+
### DuckDB
192+
193+
You can query the data from S3 in-memory with SQL using DuckDB. See [S3 import Documentation](https://duckdb.org/docs/guides/network_cloud_storage/s3_import.html).

versioned_docs/version-2.0/how_to_guides/tracing/export_traces.mdx

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ import {
99
} from "@site/src/components/InstructionsWithCode";
1010
import { RegionalUrl } from "@site/src/components/RegionalUrls";
1111

12-
# Export traces
12+
# Query traces
1313

1414
:::tip Recommended Reading
1515
Before diving into this content, it might be helpful to read the following:
@@ -20,7 +20,12 @@ Before diving into this content, it might be helpful to read the following:
2020

2121
:::
2222

23-
The recommended way to export runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.
23+
:::note
24+
If you are looking to export a large volume of traces, we recommen that your use the [Bulk Data Export](./data_export) functionality, as it will better
25+
handle large data volumes and will support automatic retries, and parallelization across partitions.
26+
:::
27+
28+
The recommended way to query runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.
2429

2530
LangSmith stores traces in a simple format that is specified in the [Run (span) data format](../../reference/data_formats/run_data_format).
2631

0 commit comments

Comments
 (0)