Skip to content

Documentation for data export #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions versioned_docs/version-2.0/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ Get started with LangSmith's tracing features to start adding observability to y
- [Trace without setting environment variables](./how_to_guides/tracing/trace_without_env_vars)
- [Trace using the LangSmith REST API](./how_to_guides/tracing/trace_with_api)
- [Calculate token-based costs for traces](./how_to_guides/tracing/calculate_token_based_costs)
- [Bulk Exporting Traces](./how_to_guides/tracing/data_export)

## Datasets

Expand Down
199 changes: 199 additions & 0 deletions versioned_docs/version-2.0/how_to_guides/tracing/data_export.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
sidebar_position: 22
---

# [Beta] Bulk Exporting Trace Data

:::tip Note
Please note that the Data Export functionality is in Beta and only supported for LangSmith Plus or Enterprise tiers. To enable this feature, contact [email protected].
:::

LangSmith's bulk data export functionality allows you to export your traces into an external destination. This can be useful if you want to analyze the
data offline in a tool such as BigQuery, Snowflake, RedShift, Jupyter Notebooks, etc.

An export can be launched to target a specific LangSmith project and date range. Once a batch export is launched, our system will handle the orchestration and resilience of the export process.
Please note that exporting your data may take some time depending on the size of your data. We also have a limit on how many of your exports can run at the same time.
Bulk exports also have a runtime timeout of 24 hours.

## Destinations

Currently we support exporting to an S3 bucket or S3 API compatible bucket that you provide. The data will be exported in
[Parquet](https://parquet.apache.org/docs/overview/) columnar format. This format will allow you to easily import the data into
other systems. The data export will contain equivalent data fields as the [Run data format](../../reference/data_formats/run_data_format).

## Exporting Data

### Destinations - Providing a S3 bucket

To export LangSmith data, you will need to provide an S3 bucket where the data will be exported to.

The following information is needed for the export:

- **Bucket Name**: The name of the S3 bucket where the data will be exported to.
- **Prefix**: The root prefix within the bucket where the data will be exported to.
- **S3 Region**: The region of the bucket - this is needed for AWS S3 buckets.
- **Endpoint URL**: The endpoint URL for the S3 bucket - this is needed for S3 API compatible buckets.
- **Access Key**: The access key for the S3 bucket.
- **Secret Key**: The secret key for the S3 bucket.

We support any S3 compatible bucket, for non AWS buckets such as GCS or MinIO, you will need to provide the endpoint URL.

### Preparing the Destination

The following example demonstrates how to create a destination using cURL. Replace the placeholder values with your actual configuration details.
Note that credentials will be stored securely in an encrypted form in our system.

```bash
curl --request POST \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID' \
--data '{
"destination_type": "s3",
"display_name": "My S3 Destination",
"config": {
"bucket_name": "your-s3-bucket-name",
"prefix": "root_folder_prefix",
"region": "your aws s3 region",
"endpoint_url": "your endpoint url for s3 compatible buckets"
},
"credentials": {
"access_key_id": "YOUR_S3_ACCESS_KEY_ID",
"secret_access_key": "YOUR_S3_SECRET_ACCESS_KEY"
}
}'
```

Use the returned `id` to reference this destination in subsequent bulk export operations.

### Create an export job

To export data, you will need to create an export job. This job will specify the destination, the project, and the date range of the data to export.

You can use the following cURL command to create the job:

```bash
curl --request POST \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID' \
--data '{
"bulk_export_destination_id": "your_destination_id",
"session_id": "project_uuid",
"start_time": "2024-01-01T00:00:00Z",
"end_time": "2024-01-02T23:59:59Z"
}'
```

Use the returned `id` to reference this export in subsequent bulk export operations.

## Monitoring the Export Job

### Monitor Export Status

To monitor the status of an export job, use the following cURL command:

```bash
curl --request GET \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID'
```

Replace `{export_id}` with the ID of the export you want to monitor. This command retrieves the current status of the specified export job.

### List Runs for an Export

An export is typically broken up into multiple runs which correspond to a specific date partition to export.
To list all runs associated with a specific export, use the following cURL command:

```bash
curl --request GET \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}/runs' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID'
```

This command fetches all runs related to the specified export, providing details such as run ID, status, creation time, rows exported, etc.

### List All Exports

To retrieve a list of all export jobs, use the following cURL command:

```bash
curl --request GET \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID'
```

This command returns a list of all export jobs along with their current statuses and creation timestamps.

### Stop an Export

To stop an existing export, use the following cURL command:

```bash
curl --request PATCH \
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: YOUR_API_KEY' \
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
--header 'X-Organization-Id: YOUR_ORG_ID' \
--data '{
"status": "Cancelled"
}'
```

Replace `{export_id}` with the ID of the export you wish to cancel. Note that a job cannot be restarted once it has been cancelled,
you will need to create a new export job instead.

## Partitioning Scheme

Data will be exported into your bucket into the follow Hive partitioned format:

```
<bucket>/<prefix>/export_id=<export_id>/tenant_id=<tenant_id>/session_id=<session_id>/runs/year=<year>/month=<month>/day=<day>
```

## Importing Data into other systems

Importing data from S3 and Parquet format is commonly supported by the majority of analytical systems. See below for documentation links:

### BigQuery

To import your data into BigQuery, see [Loading Data from Parquet](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet) and also
[Hive Partitioned loads](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).

### Snowflake

You can load data into Snowflake from S3 by following the [Load from Cloud Document](https://docs.snowflake.com/en/user-guide/tutorials/load-from-cloud-tutorial).

### RedShift

You can COPY data from S3 / Parquet into RedShift by following the [AWS COPY Instructions](https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/).

### Clickhouse

You can directly query data in S3 / Parquet format in Clickhouse. As an example, if using GCS, you can query the data as follows:

```sql
SELECT count(distinct id) FROM s3('https://storage.googleapis.com/<bucket>/<prefix>/export_id=<export_id>/**',
'access_key_id', 'access_secret', 'Parquet')
```

See [Clickhouse S3 Integration Documentation](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3) for more information.

### DuckDB

You can query the data from S3 in-memory with SQL using DuckDB. See [S3 import Documentation](https://duckdb.org/docs/guides/network_cloud_storage/s3_import.html).
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import {
} from "@site/src/components/InstructionsWithCode";
import { RegionalUrl } from "@site/src/components/RegionalUrls";

# Export traces
# Query traces

:::tip Recommended Reading
Before diving into this content, it might be helpful to read the following:
Expand All @@ -20,7 +20,12 @@ Before diving into this content, it might be helpful to read the following:

:::

The recommended way to export runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.
:::note
If you are looking to export a large volume of traces, we recommen that your use the [Bulk Data Export](./data_export) functionality, as it will better
handle large data volumes and will support automatic retries, and parallelization across partitions.
:::

The recommended way to query runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.

LangSmith stores traces in a simple format that is specified in the [Run (span) data format](../../reference/data_formats/run_data_format).

Expand Down
Loading