|
| 1 | +--- |
| 2 | +sidebar_position: 22 |
| 3 | +--- |
| 4 | + |
| 5 | +# [Beta] Bulk Exporting Trace Data |
| 6 | + |
| 7 | +:::tip Note |
| 8 | +Please note that the Data Export functionality is in Beta and only supported for LangSmith Plus or Enterprise tiers. To enable this feature, contact [email protected]. |
| 9 | +::: |
| 10 | + |
| 11 | +LangSmith's bulk data export functionality allows you to export your traces into an external destination. This can be useful if you want to analyze the |
| 12 | +data offline in a tool such as BigQuery, Snowflake, RedShift, Jupyter Notebooks, etc. |
| 13 | + |
| 14 | +An export can be launched to target a specific LangSmith project and date range. Once a batch export is launched, our system will handle the orchestration and resilience of the export process. |
| 15 | +Please note that exporting your data may take some time depending on the size of your data. We also have a limit on how many of your exports can run at the same time. |
| 16 | +Bulk exports also have a runtime timeout of 24 hours. |
| 17 | + |
| 18 | +## Destinations |
| 19 | + |
| 20 | +Currently we support exporting to an S3 bucket or S3 API compatible bucket that you provide. The data will be exported in |
| 21 | +[Parquet](https://parquet.apache.org/docs/overview/) columnar format. This format will allow you to easily import the data into |
| 22 | +other systems. The data export will contain equivalent data fields as the [Run data format](../../reference/data_formats/run_data_format). |
| 23 | + |
| 24 | +## Exporting Data |
| 25 | + |
| 26 | +### Destinations - Providing a S3 bucket |
| 27 | + |
| 28 | +To export LangSmith data, you will need to provide an S3 bucket where the data will be exported to. |
| 29 | + |
| 30 | +The following information is needed for the export: |
| 31 | + |
| 32 | +- **Bucket Name**: The name of the S3 bucket where the data will be exported to. |
| 33 | +- **Prefix**: The root prefix within the bucket where the data will be exported to. |
| 34 | +- **S3 Region**: The region of the bucket - this is needed for AWS S3 buckets. |
| 35 | +- **Endpoint URL**: The endpoint URL for the S3 bucket - this is needed for S3 API compatible buckets. |
| 36 | +- **Access Key**: The access key for the S3 bucket. |
| 37 | +- **Secret Key**: The secret key for the S3 bucket. |
| 38 | + |
| 39 | +We support any S3 compatible bucket, for non AWS buckets such as GCS or MinIO, you will need to provide the endpoint URL. |
| 40 | + |
| 41 | +### Preparing the Destination |
| 42 | + |
| 43 | +The following example demonstrates how to create a destination using cURL. Replace the placeholder values with your actual configuration details. |
| 44 | +Note that credentials will be stored securely in an encrypted form in our system. |
| 45 | + |
| 46 | +```bash |
| 47 | +curl --request POST \ |
| 48 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations' \ |
| 49 | + --header 'Content-Type: application/json' \ |
| 50 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 51 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \ |
| 52 | + --data '{ |
| 53 | + "destination_type": "s3", |
| 54 | + "display_name": "My S3 Destination", |
| 55 | + "config": { |
| 56 | + "bucket_name": "your-s3-bucket-name", |
| 57 | + "prefix": "root_folder_prefix", |
| 58 | + "region": "your aws s3 region", |
| 59 | + "endpoint_url": "your endpoint url for s3 compatible buckets" |
| 60 | + }, |
| 61 | + "credentials": { |
| 62 | + "access_key_id": "YOUR_S3_ACCESS_KEY_ID", |
| 63 | + "secret_access_key": "YOUR_S3_SECRET_ACCESS_KEY" |
| 64 | + } |
| 65 | + }' |
| 66 | +``` |
| 67 | + |
| 68 | +Use the returned `id` to reference this destination in subsequent bulk export operations. |
| 69 | + |
| 70 | +### Create an export job |
| 71 | + |
| 72 | +To export data, you will need to create an export job. This job will specify the destination, the project, and the date range of the data to export. |
| 73 | + |
| 74 | +You can use the following cURL command to create the job: |
| 75 | + |
| 76 | +```bash |
| 77 | +curl --request POST \ |
| 78 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports' \ |
| 79 | + --header 'Content-Type: application/json' \ |
| 80 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 81 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \ |
| 82 | + --data '{ |
| 83 | + "bulk_export_destination_id": "your_destination_id", |
| 84 | + "session_id": "project_uuid", |
| 85 | + "start_time": "2024-01-01T00:00:00Z", |
| 86 | + "end_time": "2024-01-02T23:59:59Z" |
| 87 | + }' |
| 88 | +``` |
| 89 | + |
| 90 | +Use the returned `id` to reference this export in subsequent bulk export operations. |
| 91 | + |
| 92 | +## Monitoring the Export Job |
| 93 | + |
| 94 | +### Monitor Export Status |
| 95 | + |
| 96 | +To monitor the status of an export job, use the following cURL command: |
| 97 | + |
| 98 | +```bash |
| 99 | +curl --request GET \ |
| 100 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \ |
| 101 | + --header 'Content-Type: application/json' \ |
| 102 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 103 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' |
| 104 | +``` |
| 105 | + |
| 106 | +Replace `{export_id}` with the ID of the export you want to monitor. This command retrieves the current status of the specified export job. |
| 107 | + |
| 108 | +### List Runs for an Export |
| 109 | + |
| 110 | +An export is typically broken up into multiple runs which correspond to a specific date partition to export. |
| 111 | +To list all runs associated with a specific export, use the following cURL command: |
| 112 | + |
| 113 | +```bash |
| 114 | +curl --request GET \ |
| 115 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}/runs' \ |
| 116 | + --header 'Content-Type: application/json' \ |
| 117 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 118 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' |
| 119 | +``` |
| 120 | + |
| 121 | +This command fetches all runs related to the specified export, providing details such as run ID, status, creation time, rows exported, etc. |
| 122 | + |
| 123 | +### List All Exports |
| 124 | + |
| 125 | +To retrieve a list of all export jobs, use the following cURL command: |
| 126 | + |
| 127 | +```bash |
| 128 | +curl --request GET \ |
| 129 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports' \ |
| 130 | + --header 'Content-Type: application/json' \ |
| 131 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 132 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' |
| 133 | +``` |
| 134 | + |
| 135 | +This command returns a list of all export jobs along with their current statuses and creation timestamps. |
| 136 | + |
| 137 | +### Stop an Export |
| 138 | + |
| 139 | +To stop an existing export, use the following cURL command: |
| 140 | + |
| 141 | +```bash |
| 142 | +curl --request PATCH \ |
| 143 | + --url 'https://api.smith.langchain.com/api/v1/bulk-exports/{export_id}' \ |
| 144 | + --header 'Content-Type: application/json' \ |
| 145 | + --header 'X-API-Key: YOUR_API_KEY' \ |
| 146 | + --header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \ |
| 147 | + --data '{ |
| 148 | + "status": "Cancelled" |
| 149 | +}' |
| 150 | +``` |
| 151 | + |
| 152 | +Replace `{export_id}` with the ID of the export you wish to cancel. Note that a job cannot be restarted once it has been cancelled, |
| 153 | +you will need to create a new export job instead. |
| 154 | + |
| 155 | +## Partitioning Scheme |
| 156 | + |
| 157 | +Data will be exported into your bucket into the follow Hive partitioned format: |
| 158 | + |
| 159 | +``` |
| 160 | +<bucket>/<prefix>/export_id=<export_id>/tenant_id=<tenant_id>/session_id=<session_id>/runs/year=<year>/month=<month>/day=<day> |
| 161 | +``` |
| 162 | + |
| 163 | +## Importing Data into other systems |
| 164 | + |
| 165 | +Importing data from S3 and Parquet format is commonly supported by the majority of analytical systems. See below for documentation links: |
| 166 | + |
| 167 | +### BigQuery |
| 168 | + |
| 169 | +To import your data into BigQuery, see [Loading Data from Parquet](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet) and also |
| 170 | +[Hive Partitioned loads](https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs). |
| 171 | + |
| 172 | +### Snowflake |
| 173 | + |
| 174 | +You can load data into Snowflake from S3 by following the [Load from Cloud Document](https://docs.snowflake.com/en/user-guide/tutorials/load-from-cloud-tutorial). |
| 175 | + |
| 176 | +### RedShift |
| 177 | + |
| 178 | +You can COPY data from S3 / Parquet into RedShift by following the [AWS COPY Instructions](https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/). |
| 179 | + |
| 180 | +### Clickhouse |
| 181 | + |
| 182 | +You can directly query data in S3 / Parquet format in Clickhouse. As an example, if using GCS, you can query the data as follows: |
| 183 | + |
| 184 | +```sql |
| 185 | +SELECT count(distinct id) FROM s3('https://storage.googleapis.com/<bucket>/<prefix>/export_id=<export_id>/**', |
| 186 | + 'access_key_id', 'access_secret', 'Parquet') |
| 187 | +``` |
| 188 | + |
| 189 | +See [Clickhouse S3 Integration Documentation](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3) for more information. |
| 190 | + |
| 191 | +### DuckDB |
| 192 | + |
| 193 | +You can query the data from S3 in-memory with SQL using DuckDB. See [S3 import Documentation](https://duckdb.org/docs/guides/network_cloud_storage/s3_import.html). |
0 commit comments