Skip to content

Commit b1c9819

Browse files
feat: Add the PGVectorStore (#175)
This PR includes * New PGVectorStore classes (in namespace v2) * Migration module and guide * Deprecation warning for PGVector * Updated documentation --------- Co-authored-by: dishaprakash <[email protected]>
1 parent 2883be6 commit b1c9819

35 files changed

+7851
-810
lines changed

README.md

Lines changed: 51 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
The `langchain-postgres` package implementations of core LangChain abstractions using `Postgres`.
1111

12-
The package is released under the MIT license.
12+
The package is released under the MIT license.
1313

1414
Feel free to use the abstraction as provided or else modify them / extend them as appropriate for your own application.
1515

@@ -23,22 +23,65 @@ The package currently only supports the [psycogp3](https://www.psycopg.org/psyco
2323
pip install -U langchain-postgres
2424
```
2525

26-
## Change Log
26+
## Usage
2727

28-
0.0.6:
29-
- Remove langgraph as a dependency as it was causing dependency conflicts.
30-
- Base interface for checkpointer changed in langgraph, so existing implementation would've broken regardless.
28+
### Vectorstore
3129

32-
## Usage
30+
> [!NOTE]
31+
> See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb)
32+
`PGVector` is being deprecated. Please migrate to `PGVectorStore`.
33+
`PGVectorStore` is used for improved performance and manageability.
34+
See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.md) for details on how to migrate from `PGVector` to `PGVectorStore`.
35+
36+
> [!TIP]
37+
> All synchronous functions have corresponding asynchronous functions
38+
39+
```python
40+
from langchain_postgres import PGEngine, PGVectorStore
41+
from langchain_core.embeddings import DeterministicFakeEmbedding
42+
import uuid
43+
44+
# Replace these variable values
45+
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
46+
47+
VECTOR_SIZE = 768
48+
embedding = DeterministicFakeEmbedding(size=VECTOR_SIZE)
49+
50+
engine.init_vectorstore_table(
51+
table_name="destination_table",
52+
vector_size=VECTOR_SIZE,
53+
)
54+
55+
store = PGVectorStore.create_sync(
56+
engine=engine,
57+
table_name=TABLE_NAME,
58+
embedding_service=embedding,
59+
)
60+
61+
all_texts = ["Apples and oranges", "Cars and airplanes", "Pineapple", "Train", "Banana"]
62+
metadatas = [{"len": len(t)} for t in all_texts]
63+
ids = [str(uuid.uuid4()) for _ in all_texts]
64+
docs = [
65+
Document(id=ids[i], page_content=all_texts[i], metadata=metadatas[i]) for i in range(len(all_texts))
66+
]
67+
68+
store.add_documents(docs)
69+
70+
query = "I'd like a fruit."
71+
docs = store.similarity_search(query)
72+
print(docs)
73+
```
74+
75+
For a detailed example on `PGVectorStore` see [here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/pg_vectorstore.ipynb).
3376

3477
### ChatMessageHistory
3578

36-
The chat message history abstraction helps to persist chat message history
79+
The chat message history abstraction helps to persist chat message history
3780
in a postgres table.
3881

3982
PostgresChatMessageHistory is parameterized using a `table_name` and a `session_id`.
4083

41-
The `table_name` is the name of the table in the database where
84+
The `table_name` is the name of the table in the database where
4285
the chat messages will be stored.
4386

4487
The `session_id` is a unique identifier for the chat session. It can be assigned
@@ -79,7 +122,6 @@ chat_history.add_messages([
79122
print(chat_history.messages)
80123
```
81124

82-
83125
### Vectorstore
84126

85127
See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb)
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Migrate a `PGVector` vector store to `PGVectorStore`
2+
3+
This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.
4+
5+
## Why migrate?
6+
7+
This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.
8+
9+
Migrating to the PGVectorStore interface provides the following benefits:
10+
11+
- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.
12+
- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.
13+
- **Schema flexibility**: The interface allows users to add tables into any database schema.
14+
- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.
15+
- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.
16+
- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.
17+
18+
## Migration process
19+
20+
> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/).
21+
22+
While you can use the existing PGVector database, we **strongly recommend** migrating your data to the PGVectorStore-style schema to take full advantage of the performance benefits.
23+
24+
### (Recommended) Data migration
25+
26+
1. **Create a PG engine.**
27+
28+
```python
29+
from langchain_postgres import PGEngine
30+
31+
# Replace these variable values
32+
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
33+
```
34+
35+
> **_NOTE:_** All sync methods have corresponding async methods.
36+
37+
2. **Create a new table to migrate existing data.**
38+
39+
```python
40+
# Vertex AI embeddings uses a vector size of 768.
41+
# Adjust this according to your embeddings service.
42+
VECTOR_SIZE = 768
43+
44+
engine.init_vectorstore_table(
45+
table_name="destination_table",
46+
vector_size=VECTOR_SIZE,
47+
)
48+
```
49+
50+
**(Optional) Customize your table.**
51+
52+
When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:
53+
54+
- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.
55+
- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.
56+
57+
```python
58+
metadata_columns = [
59+
Column(f"col_0_{collection_name}", "VARCHAR"),
60+
Column(f"col_1_{collection_name}", "VARCHAR"),
61+
]
62+
engine.init_vectorstore_table(
63+
table_name="destination_table",
64+
vector_size=VECTOR_SIZE,
65+
metadata_columns=metadata_columns,
66+
id_column=Column("langchain_id", "VARCHAR"),
67+
)
68+
```
69+
70+
3. **Create a vector store object to interact with the new data.**
71+
72+
> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialise a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.
73+
74+
```python
75+
from langchain_postgres import PGVectorStore
76+
from langchain_core.embeddings import FakeEmbeddings
77+
78+
destination_vector_store = PGVectorStore.create_sync(
79+
engine,
80+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
81+
table_name="destination_table",
82+
)
83+
```
84+
85+
If you have any customisations on the metadata or the id columns, add them to the vector store as follows:
86+
87+
```python
88+
from langchain_postgres import PGVectorStore
89+
from langchain_core.embeddings import FakeEmbeddings
90+
91+
destination_vector_store = PGVectorStore.create_sync(
92+
engine,
93+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
94+
table_name="destination_table",
95+
metadata_columns=[col.name for col in metadata_columns],
96+
id_column="langchain_id",
97+
)
98+
```
99+
100+
4. **Migrate the data to the new table.**
101+
102+
```python
103+
from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection
104+
105+
migrate_pgvector_collection(
106+
engine,
107+
# Set collection name here
108+
collection_name="collection_name",
109+
vector_store=destination_vector_store,
110+
# This deletes data from the original table upon migration. You can choose to turn it off.
111+
delete_pg_collection=True,
112+
)
113+
```
114+
115+
The data will only be deleted from the original table once all of it has been successfully copied to the destination table.
116+
117+
> **TIP:** If you would like to migrate multiple collections, you can use the `alist_pgvector_collection_names` method to get the names of all collections, allowing you to iterate through them.
118+
>
119+
> ```python
120+
> from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names
121+
>
122+
> all_collection_names = list_pgvector_collection_names(engine)
123+
> print(all_collection_names)
124+
> ```
125+
126+
### (Not Recommended) Use PGVectorStore interface on PGVector databases
127+
128+
If you choose not to migrate your data, you can still use the PGVectorStore interface with your existing PGVector database. However, you won't benefit from the performance improvements of the PGVectorStore-style schema.
129+
130+
1. **Create an PGVectorStore engine.**
131+
132+
```python
133+
from langchain_postgres import PGEngine
134+
135+
# Replace these variable values
136+
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
137+
```
138+
139+
> **_NOTE:_** All sync methods have corresponding async methods.
140+
141+
2. **Create a vector store object to interact with the data.**
142+
143+
Use the embeddings service used by your database. See [langchain docs](https://python.langchain.com/docs/integrations/text_embedding/) for reference.
144+
145+
```python
146+
from langchain_postgres import PGVectorStore
147+
from langchain_core.embeddings import FakeEmbeddings
148+
149+
vector_store = PGVectorStore.create_sync(
150+
engine=engine,
151+
table_name="langchain_pg_embedding",
152+
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
153+
content_column="document",
154+
metadata_json_column="cmetadata",
155+
metadata_columns=["collection_id"],
156+
id_column="id",
157+
)
158+
```
159+
160+
3. **Perform similarity search.**
161+
162+
Filter by collection id:
163+
164+
```python
165+
vector_store.similarity_search("query", k=5, filter=f"collection_id='{uuid}'")
166+
```
167+
168+
Filter by collection id and metadata:
169+
170+
```python
171+
vector_store.similarity_search(
172+
"query", k=5, filter=f"collection_id='{uuid}' and cmetadata->>'col_name' = 'value'"
173+
)
174+
```

0 commit comments

Comments
 (0)