Documentation for data export (#475)

akira · web-flow · commit 28cac0555ad2 · 2024-10-21T15:35:48.000-07:00
diff --git a/versioned_docs/version-2.0/how_to_guides/index.md b/versioned_docs/version-2.0/how_to_guides/index.md
@@ -88,6 +88,7 @@ Get started with LangSmith's tracing features to start adding observability to y
 - [Trace without setting environment variables](./how_to_guides/tracing/trace_without_env_vars)
 - [Trace using the LangSmith REST API](./how_to_guides/tracing/trace_with_api)
 - [Calculate token-based costs for traces](./how_to_guides/tracing/calculate_token_based_costs)
+- [Bulk Exporting Traces](./how_to_guides/tracing/data_export)
 
 ## Datasets
 
diff --git a/versioned_docs/version-2.0/how_to_guides/tracing/data_export.mdx b/versioned_docs/version-2.0/how_to_guides/tracing/data_export.mdx
@@ -0,0 +1,193 @@
+---
+sidebar_position: 22
+---
+
+# [Beta] Bulk Exporting Trace Data
+
+:::tip Note
+Please note that the Data Export functionality is in Beta and only supported for LangSmith Plus or Enterprise tiers. To enable this feature, contact support@langchain.dev.
+:::
+
+LangSmith's bulk data export functionality allows you to export your traces into an external destination. This can be useful if you want to analyze the
+data offline in a tool such as BigQuery, Snowflake, RedShift, Jupyter Notebooks, etc.
+
+An export can be launched to target a specific LangSmith project and date range. Once a batch export is launched, our system will handle the orchestration and resilience of the export process.
+Please note that exporting your data may take some time depending on the size of your data. We also have a limit on how many of your exports can run at the same time.
+Bulk exports also have a runtime timeout of 24 hours.
+
+## Destinations
+
+Currently we support exporting to an S3 bucket or S3 API compatible bucket that you provide. The data will be exported in
+[Parquet](https://parquet.apache.org/docs/overview/) columnar format. This format will allow you to easily import the data into
+other systems. The data export will contain equivalent data fields as the [Run data format](../../reference/data_formats/run_data_format).
+
+## Exporting Data
+
+### Destinations - Providing a S3 bucket
+
+To export LangSmith data, you will need to provide an S3 bucket where the data will be exported to.
+
+The following information is needed for the export:
+
+- **Bucket Name**: The name of the S3 bucket where the data will be exported to.
+- **Prefix**: The root prefix within the bucket where the data will be exported to.
+- **S3 Region**: The region of the bucket - this is needed for AWS S3 buckets.
+- **Endpoint URL**: The endpoint URL for the S3 bucket - this is needed for S3 API compatible buckets.
+- **Access Key**: The access key for the S3 bucket.
+- **Secret Key**: The secret key for the S3 bucket.
+
+We support any S3 compatible bucket, for non AWS buckets such as GCS or MinIO, you will need to provide the endpoint URL.
+
+### Preparing the Destination
+
+The following example demonstrates how to create a destination using cURL. Replace the placeholder values with your actual configuration details.
+Note that credentials will be stored securely in an encrypted form in our system.
+
+```bash
+curl --request POST \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
+  --data '{
+    "destination_type": "s3",
+    "display_name": "My S3 Destination",
+    "config": {
+      "bucket_name": "your-s3-bucket-name",
+      "prefix": "root_folder_prefix",
+      "region": "your aws s3 region",
+      "endpoint_url": "your endpoint url for s3 compatible buckets"
+    },
+    "credentials": {
+      "access_key_id": "YOUR_S3_ACCESS_KEY_ID",
+      "secret_access_key": "YOUR_S3_SECRET_ACCESS_KEY"
+    }
+  }'
+```
+
+Use the returned `id` to reference this destination in subsequent bulk export operations.
+
+### Create an export job
+
+To export data, you will need to create an export job. This job will specify the destination, the project, and the date range of the data to export.
+
+You can use the following cURL command to create the job:
+
+```bash
+curl --request POST \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
+  --data '{
+    "bulk_export_destination_id": "your_destination_id",
+    "session_id": "project_uuid",
+    "start_time": "2024-01-01T00:00:00Z",
+    "end_time": "2024-01-02T23:59:59Z"
+  }'
+```
+
+Use the returned `id` to reference this export in subsequent bulk export operations.
+
+## Monitoring the Export Job
+
+### Monitor Export Status
+
+To monitor the status of an export job, use the following cURL command:
+
+```bash
+curl --request GET \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
+```
+
+Replace `{export_id}` with the ID of the export you want to monitor. This command retrieves the current status of the specified export job.
+
+### List Runs for an Export
+
+An export is typically broken up into multiple runs which correspond to a specific date partition to export.
+To list all runs associated with a specific export, use the following cURL command:
+
+```bash
+curl --request GET \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}/runs' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
+```
+
+This command fetches all runs related to the specified export, providing details such as run ID, status, creation time, rows exported, etc.
+
+### List All Exports
+
+To retrieve a list of all export jobs, use the following cURL command:
+
+```bash
+curl --request GET \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID'
+```
+
+This command returns a list of all export jobs along with their current statuses and creation timestamps.
+
+### Stop an Export
+
+To stop an existing export, use the following cURL command:
+
+```bash
+curl --request PATCH \
+  --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \
+  --header 'Content-Type: application/json' \
+  --header 'X-API-Key: YOUR_API_KEY' \
+  --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
+  --data '{
+    "status": "Cancelled"
+}'
+```
+
+Replace `{export_id}` with the ID of the export you wish to cancel. Note that a job cannot be restarted once it has been cancelled,
+you will need to create a new export job instead.
+
+## Partitioning Scheme
+
+Data will be exported into your bucket into the follow Hive partitioned format:
+
+```
+<bucket>/<prefix>/export_id=<export_id>/tenant_id=<tenant_id>/session_id=<session_id>/runs/year=<year>/month=<month>/day=<day>
+```
+
+## Importing Data into other systems
+
+Importing data from S3 and Parquet format is commonly supported by the majority of analytical systems. See below for documentation links:
+
+### BigQuery
+
+To import your data into BigQuery, see [Loading Data from Parquet](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet) and also
+[Hive Partitioned loads](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs).
+
+### Snowflake
+
+You can load data into Snowflake from S3 by following the [Load from Cloud Document](https://docs.snowflake.com/en/user-guide/tutorials/load-from-cloud-tutorial).
+
+### RedShift
+
+You can COPY data from S3 / Parquet into RedShift by following the [AWS COPY Instructions](https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/).
+
+### Clickhouse
+
+You can directly query data in S3 / Parquet format in Clickhouse. As an example, if using GCS, you can query the data as follows:
+
+```sql
+SELECT count(distinct id) FROM s3('https://storage.googleapis.com/<bucket>/<prefix>/export_id=<export_id>/**',
+ 'access_key_id', 'access_secret', 'Parquet')
+```
+
+See [Clickhouse S3 Integration Documentation](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3) for more information.
+
+### DuckDB
+
+You can query the data from S3 in-memory with SQL using DuckDB. See [S3 import Documentation](https://duckdb.org/docs/guides/network_cloud_storage/s3_import.html).
diff --git a/versioned_docs/version-2.0/how_to_guides/tracing/export_traces.mdx b/versioned_docs/version-2.0/how_to_guides/tracing/export_traces.mdx
@@ -9,7 +9,7 @@ import {
 } from "@site/src/components/InstructionsWithCode";
 import { RegionalUrl } from "@site/src/components/RegionalUrls";
 
-# Export traces
+# Query traces
 
 :::tip Recommended Reading
 Before diving into this content, it might be helpful to read the following:
@@ -20,7 +20,12 @@ Before diving into this content, it might be helpful to read the following:
 
 :::
 
-The recommended way to export runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.
+:::note
+If you are looking to export a large volume of traces, we recommen that your use the [Bulk Data Export](./data_export) functionality, as it will better
+handle large data volumes and will support automatic retries, and parallelization across partitions.
+:::
+
+The recommended way to query runs (the span data in LangSmith traces) is to use the `list_runs` method in the SDK or `/runs/query` endpoint in the API.
 
 LangSmith stores traces in a simple format that is specified in the [Run (span) data format](../../reference/data_formats/run_data_format).