|
| 1 | +# Migrate a `PGVector` vector store to `PGVectorStore` |
| 2 | + |
| 3 | +This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class. |
| 4 | + |
| 5 | +## Why migrate? |
| 6 | + |
| 7 | +This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability. |
| 8 | + |
| 9 | +Migrating to the PGVectorStore interface provides the following benefits: |
| 10 | + |
| 11 | +- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain. |
| 12 | +- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements. |
| 13 | +- **Schema flexibility**: The interface allows users to add tables into any database schema. |
| 14 | +- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections. |
| 15 | +- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows. |
| 16 | +- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object. |
| 17 | + |
| 18 | +## Migration process |
| 19 | + |
| 20 | +> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/). |
| 21 | +
|
| 22 | +While you can use the existing PGVector database, we **strongly recommend** migrating your data to the PGVectorStore-style schema to take full advantage of the performance benefits. |
| 23 | + |
| 24 | +### (Recommended) Data migration |
| 25 | + |
| 26 | +1. **Create a PG engine.** |
| 27 | + |
| 28 | + ```python |
| 29 | + from langchain_postgres import PGEngine |
| 30 | + |
| 31 | + # Replace these variable values |
| 32 | + engine = PGEngine.from_connection_string(url=CONNECTION_STRING) |
| 33 | + ``` |
| 34 | + |
| 35 | + > **_NOTE:_** All sync methods have corresponding async methods. |
| 36 | + |
| 37 | +2. **Create a new table to migrate existing data.** |
| 38 | + |
| 39 | + ```python |
| 40 | + # Vertex AI embeddings uses a vector size of 768. |
| 41 | + # Adjust this according to your embeddings service. |
| 42 | + VECTOR_SIZE = 768 |
| 43 | + |
| 44 | + engine.init_vectorstore_table( |
| 45 | + table_name="destination_table", |
| 46 | + vector_size=VECTOR_SIZE, |
| 47 | + ) |
| 48 | + ``` |
| 49 | + |
| 50 | + **(Optional) Customize your table.** |
| 51 | + |
| 52 | + When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for: |
| 53 | + |
| 54 | + - **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval. |
| 55 | + - **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column. |
| 56 | + |
| 57 | + ```python |
| 58 | + metadata_columns = [ |
| 59 | + Column(f"col_0_{collection_name}", "VARCHAR"), |
| 60 | + Column(f"col_1_{collection_name}", "VARCHAR"), |
| 61 | + ] |
| 62 | + engine.init_vectorstore_table( |
| 63 | + table_name="destination_table", |
| 64 | + vector_size=VECTOR_SIZE, |
| 65 | + metadata_columns=metadata_columns, |
| 66 | + id_column=Column("langchain_id", "VARCHAR"), |
| 67 | + ) |
| 68 | + ``` |
| 69 | + |
| 70 | +3. **Create a vector store object to interact with the new data.** |
| 71 | + |
| 72 | + > **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialise a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table. |
| 73 | + |
| 74 | + ```python |
| 75 | + from langchain_postgres import PGVectorStore |
| 76 | + from langchain_core.embeddings import FakeEmbeddings |
| 77 | + |
| 78 | + destination_vector_store = PGVectorStore.create_sync( |
| 79 | + engine, |
| 80 | + embedding_service=FakeEmbeddings(size=VECTOR_SIZE), |
| 81 | + table_name="destination_table", |
| 82 | + ) |
| 83 | + ``` |
| 84 | + |
| 85 | + If you have any customisations on the metadata or the id columns, add them to the vector store as follows: |
| 86 | + |
| 87 | + ```python |
| 88 | + from langchain_postgres import PGVectorStore |
| 89 | + from langchain_core.embeddings import FakeEmbeddings |
| 90 | + |
| 91 | + destination_vector_store = PGVectorStore.create_sync( |
| 92 | + engine, |
| 93 | + embedding_service=FakeEmbeddings(size=VECTOR_SIZE), |
| 94 | + table_name="destination_table", |
| 95 | + metadata_columns=[col.name for col in metadata_columns], |
| 96 | + id_column="langchain_id", |
| 97 | + ) |
| 98 | + ``` |
| 99 | + |
| 100 | +4. **Migrate the data to the new table.** |
| 101 | + |
| 102 | + ```python |
| 103 | + from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection |
| 104 | + |
| 105 | + migrate_pgvector_collection( |
| 106 | + engine, |
| 107 | + # Set collection name here |
| 108 | + collection_name="collection_name", |
| 109 | + vector_store=destination_vector_store, |
| 110 | + # This deletes data from the original table upon migration. You can choose to turn it off. |
| 111 | + delete_pg_collection=True, |
| 112 | + ) |
| 113 | + ``` |
| 114 | + |
| 115 | + The data will only be deleted from the original table once all of it has been successfully copied to the destination table. |
| 116 | + |
| 117 | +> **TIP:** If you would like to migrate multiple collections, you can use the `alist_pgvector_collection_names` method to get the names of all collections, allowing you to iterate through them. |
| 118 | +> |
| 119 | +> ```python |
| 120 | +> from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names |
| 121 | +> |
| 122 | +> all_collection_names = list_pgvector_collection_names(engine) |
| 123 | +> print(all_collection_names) |
| 124 | +> ``` |
| 125 | + |
| 126 | +### (Not Recommended) Use PGVectorStore interface on PGVector databases |
| 127 | + |
| 128 | +If you choose not to migrate your data, you can still use the PGVectorStore interface with your existing PGVector database. However, you won't benefit from the performance improvements of the PGVectorStore-style schema. |
| 129 | + |
| 130 | +1. **Create an PGVectorStore engine.** |
| 131 | + |
| 132 | + ```python |
| 133 | + from langchain_postgres import PGEngine |
| 134 | + |
| 135 | + # Replace these variable values |
| 136 | + engine = PGEngine.from_connection_string(url=CONNECTION_STRING) |
| 137 | + ``` |
| 138 | + |
| 139 | + > **_NOTE:_** All sync methods have corresponding async methods. |
| 140 | + |
| 141 | +2. **Create a vector store object to interact with the data.** |
| 142 | + |
| 143 | + Use the embeddings service used by your database. See [langchain docs](https://python.langchain.com/docs/integrations/text_embedding/) for reference. |
| 144 | + |
| 145 | + ```python |
| 146 | + from langchain_postgres import PGVectorStore |
| 147 | + from langchain_core.embeddings import FakeEmbeddings |
| 148 | + |
| 149 | + vector_store = PGVectorStore.create_sync( |
| 150 | + engine=engine, |
| 151 | + table_name="langchain_pg_embedding", |
| 152 | + embedding_service=FakeEmbeddings(size=VECTOR_SIZE), |
| 153 | + content_column="document", |
| 154 | + metadata_json_column="cmetadata", |
| 155 | + metadata_columns=["collection_id"], |
| 156 | + id_column="id", |
| 157 | + ) |
| 158 | + ``` |
| 159 | + |
| 160 | +3. **Perform similarity search.** |
| 161 | + |
| 162 | + Filter by collection id: |
| 163 | + |
| 164 | + ```python |
| 165 | + vector_store.similarity_search("query", k=5, filter=f"collection_id='{uuid}'") |
| 166 | + ``` |
| 167 | + |
| 168 | + Filter by collection id and metadata: |
| 169 | + |
| 170 | + ```python |
| 171 | + vector_store.similarity_search( |
| 172 | + "query", k=5, filter=f"collection_id='{uuid}' and cmetadata->>'col_name' = 'value'" |
| 173 | + ) |
| 174 | + ``` |
0 commit comments