Skip to content

Commit c957a09

Browse files
feat: Add the PGVectorStore class (#168)
Co-authored-by: Averi Kitsch <[email protected]>
1 parent e72f740 commit c957a09

32 files changed

+7116
-801
lines changed
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Migrate a `PGVector` vector store to `PGVectorStore`
2+
3+
This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.
4+
5+
## Why migrate?
6+
7+
This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.
8+
9+
Migrating to the PGVectorStore interface provides the following benefits:
10+
11+
- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.
12+
- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.
13+
- **Schema flexibility**: The interface allows users to add tables into any database schema.
14+
- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.
15+
- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.
16+
- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.
17+
18+
## Migration process
19+
20+
> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/).
21+
22+
While you can use the existing PGVector database, we **strongly recommend** migrating your data to the PGVectorStore-style schema to take full advantage of the performance benefits.
23+
24+
### (Recommended) Data migration
25+
26+
1. **Create a PG engine.**
27+
28+
```python
29+
from langchain_postgres import PGEngine
30+
31+
# Replace these variable values
32+
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
33+
```
34+
35+
> **_NOTE:_** All sync methods have corresponding async methods.
36+
37+
2. **Create a new table to migrate existing data.**
38+
39+
```python
40+
# Vertex AI embeddings uses a vector size of 768.
41+
# Adjust this according to your embeddings service.
42+
VECTOR_SIZE = 768
43+
44+
engine.init_vectorstore_table(
45+
table_name="destination_table",
46+
vector_size=VECTOR_SIZE,
47+
)
48+
```
49+
50+
**(Optional) Customize your table.**
51+
52+
When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:
53+
54+
- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.
55+
- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.
56+
57+
```python
58+
metadata_columns = [
59+
Column(f"col_0_{collection_name}", "VARCHAR"),
60+
Column(f"col_1_{collection_name}", "VARCHAR"),
61+
]
62+
engine.init_vectorstore_table(
63+
table_name="destination_table",
64+
vector_size=VECTOR_SIZE,
65+
metadata_columns=metadata_columns,
66+
id_column=Column("langchain_id", "VARCHAR"),
67+
)
68+
```
69+
70+
3. **Create a vector store object to interact with the new data.**
71+
72+
> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialise a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.
73+
74+
```python
75+
from langchain_postgres import PGVectorStore
76+
from langchain_core.embeddings import FakeEmbeddings
77+
78+
destination_vector_store = PGVectorStore.create_sync(
79+
engine,
80+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
81+
table_name="destination_table",
82+
)
83+
```
84+
85+
If you have any customisations on the metadata or the id columns, add them to the vector store as follows:
86+
87+
```python
88+
from langchain_postgres import PGVectorStore
89+
from langchain_core.embeddings import FakeEmbeddings
90+
91+
destination_vector_store = PGVectorStore.create_sync(
92+
engine,
93+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
94+
table_name="destination_table",
95+
metadata_columns=[col.name for col in metadata_columns],
96+
id_column="langchain_id",
97+
)
98+
```
99+
100+
4. **Migrate the data to the new table.**
101+
102+
```python
103+
from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection
104+
105+
migrate_pgvector_collection(
106+
engine,
107+
# Set collection name here
108+
collection_name="collection_name",
109+
vector_store=destination_vector_store,
110+
# This deletes data from the original table upon migration. You can choose to turn it off.
111+
delete_pg_collection=True,
112+
)
113+
```
114+
115+
The data will only be deleted from the original table once all of it has been successfully copied to the destination table.
116+
117+
> **TIP:** If you would like to migrate multiple collections, you can use the `alist_pgvector_collection_names` method to get the names of all collections, allowing you to iterate through them.
118+
>
119+
> ```python
120+
> from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names
121+
>
122+
> all_collection_names = list_pgvector_collection_names(engine)
123+
> print(all_collection_names)
124+
> ```
125+
126+
### (Not Recommended) Use PGVectorStore interface on PGVector databases
127+
128+
If you choose not to migrate your data, you can still use the PGVectorStore interface with your existing PGVector database. However, you won't benefit from the performance improvements of the PGVectorStore-style schema.
129+
130+
1. **Create an PGVectorStore engine.**
131+
132+
```python
133+
from langchain_postgres import PGEngine
134+
135+
# Replace these variable values
136+
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
137+
```
138+
139+
> **_NOTE:_** All sync methods have corresponding async methods.
140+
141+
2. **Create a vector store object to interact with the data.**
142+
143+
Use the embeddings service used by your database. See [langchain docs](https://python.langchain.com/docs/integrations/text_embedding/) for reference.
144+
145+
```python
146+
from langchain_postgres import PGVectorStore
147+
from langchain_core.embeddings import FakeEmbeddings
148+
149+
vector_store = PGVectorStore.create_sync(
150+
engine=engine,
151+
table_name="langchain_pg_embedding",
152+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
153+
content_column="document",
154+
metadata_json_column="cmetadata",
155+
metadata_columns=["collection_id"],
156+
id_column="id",
157+
)
158+
```
159+
160+
3. **Perform similarity search.**
161+
162+
Filter by collection id:
163+
164+
```python
165+
vector_store.similarity_search("query", k=5, filter=f"collection_id='{uuid}'")
166+
```
167+
168+
Filter by collection id and metadata:
169+
170+
```python
171+
vector_store.similarity_search(
172+
"query", k=5, filter=f"collection_id='{uuid}' and cmetadata->>'col_name' = 'value'"
173+
)
174+
```

examples/vectorstore.ipynb

Lines changed: 101 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
"from langchain_core.documents import Document\n",
7373
"\n",
7474
"# See docker command above to launch a postgres instance with pgvector enabled.\n",
75-
"connection = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\" \n",
75+
"connection = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\"\n",
7676
"collection_name = \"my_docs\"\n",
7777
"embeddings = CohereEmbeddings()\n",
7878
"\n",
@@ -126,17 +126,47 @@
126126
"outputs": [],
127127
"source": [
128128
"docs = [\n",
129-
" Document(page_content='there are cats in the pond', metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
130-
" Document(page_content='ducks are also found in the pond', metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
131-
" Document(page_content='fresh apples are available at the market', metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"}),\n",
132-
" Document(page_content='the market also sells fresh oranges', metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"}),\n",
133-
" Document(page_content='the new art exhibit is fascinating', metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"}),\n",
134-
" Document(page_content='a sculpture exhibit is also at the museum', metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"}),\n",
135-
" Document(page_content='a new coffee shop opened on Main Street', metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"}),\n",
136-
" Document(page_content='the book club meets at the library', metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"}),\n",
137-
" Document(page_content='the library hosts a weekly story time for kids', metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"}),\n",
138-
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"})\n",
139-
"]\n"
129+
" Document(\n",
130+
" page_content=\"there are cats in the pond\",\n",
131+
" metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"},\n",
132+
" ),\n",
133+
" Document(\n",
134+
" page_content=\"ducks are also found in the pond\",\n",
135+
" metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"},\n",
136+
" ),\n",
137+
" Document(\n",
138+
" page_content=\"fresh apples are available at the market\",\n",
139+
" metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"},\n",
140+
" ),\n",
141+
" Document(\n",
142+
" page_content=\"the market also sells fresh oranges\",\n",
143+
" metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"},\n",
144+
" ),\n",
145+
" Document(\n",
146+
" page_content=\"the new art exhibit is fascinating\",\n",
147+
" metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"},\n",
148+
" ),\n",
149+
" Document(\n",
150+
" page_content=\"a sculpture exhibit is also at the museum\",\n",
151+
" metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"},\n",
152+
" ),\n",
153+
" Document(\n",
154+
" page_content=\"a new coffee shop opened on Main Street\",\n",
155+
" metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"},\n",
156+
" ),\n",
157+
" Document(\n",
158+
" page_content=\"the book club meets at the library\",\n",
159+
" metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"},\n",
160+
" ),\n",
161+
" Document(\n",
162+
" page_content=\"the library hosts a weekly story time for kids\",\n",
163+
" metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"},\n",
164+
" ),\n",
165+
" Document(\n",
166+
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
167+
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
168+
" ),\n",
169+
"]"
140170
]
141171
},
142172
{
@@ -159,7 +189,7 @@
159189
}
160190
],
161191
"source": [
162-
"vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])"
192+
"vectorstore.add_documents(docs, ids=[doc.metadata[\"id\"] for doc in docs])"
163193
]
164194
},
165195
{
@@ -191,7 +221,7 @@
191221
}
192222
],
193223
"source": [
194-
"vectorstore.similarity_search('kitty', k=10)"
224+
"vectorstore.similarity_search(\"kitty\", k=10)"
195225
]
196226
},
197227
{
@@ -212,17 +242,47 @@
212242
"outputs": [],
213243
"source": [
214244
"docs = [\n",
215-
" Document(page_content='there are cats in the pond', metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
216-
" Document(page_content='ducks are also found in the pond', metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
217-
" Document(page_content='fresh apples are available at the market', metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"}),\n",
218-
" Document(page_content='the market also sells fresh oranges', metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"}),\n",
219-
" Document(page_content='the new art exhibit is fascinating', metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"}),\n",
220-
" Document(page_content='a sculpture exhibit is also at the museum', metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"}),\n",
221-
" Document(page_content='a new coffee shop opened on Main Street', metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"}),\n",
222-
" Document(page_content='the book club meets at the library', metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"}),\n",
223-
" Document(page_content='the library hosts a weekly story time for kids', metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"}),\n",
224-
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"})\n",
225-
"]\n"
245+
" Document(\n",
246+
" page_content=\"there are cats in the pond\",\n",
247+
" metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"},\n",
248+
" ),\n",
249+
" Document(\n",
250+
" page_content=\"ducks are also found in the pond\",\n",
251+
" metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"},\n",
252+
" ),\n",
253+
" Document(\n",
254+
" page_content=\"fresh apples are available at the market\",\n",
255+
" metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"},\n",
256+
" ),\n",
257+
" Document(\n",
258+
" page_content=\"the market also sells fresh oranges\",\n",
259+
" metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"},\n",
260+
" ),\n",
261+
" Document(\n",
262+
" page_content=\"the new art exhibit is fascinating\",\n",
263+
" metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"},\n",
264+
" ),\n",
265+
" Document(\n",
266+
" page_content=\"a sculpture exhibit is also at the museum\",\n",
267+
" metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"},\n",
268+
" ),\n",
269+
" Document(\n",
270+
" page_content=\"a new coffee shop opened on Main Street\",\n",
271+
" metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"},\n",
272+
" ),\n",
273+
" Document(\n",
274+
" page_content=\"the book club meets at the library\",\n",
275+
" metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"},\n",
276+
" ),\n",
277+
" Document(\n",
278+
" page_content=\"the library hosts a weekly story time for kids\",\n",
279+
" metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"},\n",
280+
" ),\n",
281+
" Document(\n",
282+
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
283+
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
284+
" ),\n",
285+
"]"
226286
]
227287
},
228288
{
@@ -275,9 +335,7 @@
275335
}
276336
],
277337
"source": [
278-
"vectorstore.similarity_search('kitty', k=10, filter={\n",
279-
" 'id': {'$in': [1, 5, 2, 9]}\n",
280-
"})"
338+
"vectorstore.similarity_search(\"kitty\", k=10, filter={\"id\": {\"$in\": [1, 5, 2, 9]}})"
281339
]
282340
},
283341
{
@@ -309,10 +367,11 @@
309367
}
310368
],
311369
"source": [
312-
"vectorstore.similarity_search('ducks', k=10, filter={\n",
313-
" 'id': {'$in': [1, 5, 2, 9]},\n",
314-
" 'location': {'$in': [\"pond\", \"market\"]}\n",
315-
"})"
370+
"vectorstore.similarity_search(\n",
371+
" \"ducks\",\n",
372+
" k=10,\n",
373+
" filter={\"id\": {\"$in\": [1, 5, 2, 9]}, \"location\": {\"$in\": [\"pond\", \"market\"]}},\n",
374+
")"
316375
]
317376
},
318377
{
@@ -336,12 +395,15 @@
336395
}
337396
],
338397
"source": [
339-
"vectorstore.similarity_search('ducks', k=10, filter={\n",
340-
" '$and': [\n",
341-
" {'id': {'$in': [1, 5, 2, 9]}},\n",
342-
" {'location': {'$in': [\"pond\", \"market\"]}},\n",
343-
" ]\n",
344-
"}\n",
398+
"vectorstore.similarity_search(\n",
399+
" \"ducks\",\n",
400+
" k=10,\n",
401+
" filter={\n",
402+
" \"$and\": [\n",
403+
" {\"id\": {\"$in\": [1, 5, 2, 9]}},\n",
404+
" {\"location\": {\"$in\": [\"pond\", \"market\"]}},\n",
405+
" ]\n",
406+
" },\n",
345407
")"
346408
]
347409
},
@@ -372,9 +434,7 @@
372434
}
373435
],
374436
"source": [
375-
"vectorstore.similarity_search('bird', k=10, filter={\n",
376-
" 'location': { \"$ne\": 'pond'}\n",
377-
"})"
437+
"vectorstore.similarity_search(\"bird\", k=10, filter={\"location\": {\"$ne\": \"pond\"}})"
378438
]
379439
}
380440
],

0 commit comments

Comments
 (0)