Skip to content

feat: Add the PGVectorStore class #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a21a8e2
feat: Add the PGVectorStore class
dishaprakash Mar 19, 2025
926f417
Linter and format fix
dishaprakash Mar 19, 2025
a34ddbe
update poetry lock
dishaprakash Mar 19, 2025
bdd2bf6
minor variable name change
dishaprakash Mar 19, 2025
239b1c3
Fix import test
dishaprakash Mar 19, 2025
544cade
enabled socket in one test file
dishaprakash Mar 19, 2025
03dcac1
enabled socket in all test files
dishaprakash Mar 19, 2025
7b4fa7f
Debug tests being skipped
dishaprakash Mar 19, 2025
1d42314
Debug tests being skipped
dishaprakash Mar 19, 2025
dc9a5b8
Debug tests being skipped
dishaprakash Mar 19, 2025
4c3f93f
Debug tests being failed
dishaprakash Mar 19, 2025
b3a12b7
revert debug lines
dishaprakash Mar 19, 2025
8b30833
Remove IVFIndex
dishaprakash Mar 19, 2025
1496033
Minor change
dishaprakash Mar 19, 2025
cbd0889
Review changes
dishaprakash Apr 1, 2025
b436df3
Refactor vectorstore packaging in import
dishaprakash Apr 1, 2025
eb6954d
Change test table names
dishaprakash Apr 1, 2025
3e52c56
Linter fix
dishaprakash Apr 1, 2025
a24fe73
Minor fix
dishaprakash Apr 1, 2025
c74858e
Fix test
dishaprakash Apr 1, 2025
e52e609
Fix tests
dishaprakash Apr 1, 2025
1f6a70e
Remove chat message history format
dishaprakash Apr 1, 2025
8029731
Fix test
dishaprakash Apr 1, 2025
cf58c2a
Fix indexing tests
dishaprakash Apr 1, 2025
b9526c6
Make escape sql string function private
dishaprakash Apr 1, 2025
1d6563a
Rename namespaces
dishaprakash Apr 2, 2025
c9ad8f3
Enable support for TypedDict along with Column
dishaprakash Apr 2, 2025
a913b5a
Fix import test
dishaprakash Apr 2, 2025
9e539e0
Linter fix
dishaprakash Apr 2, 2025
1daac17
Linter fix
dishaprakash Apr 2, 2025
5062185
Add validation and quotes for indexes
dishaprakash Apr 3, 2025
fe62c35
Merge branch 'pg-vectorstore' into upstream-langchain
averikitsch Apr 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions examples/migrate_pgvector_to_pgvectorstore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Migrate a `PGVector` vector store to `PGVectorStore`

This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.

## Why migrate?

This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.

Migrating to the PGVectorStore interface provides the following benefits:

- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.
- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.
- **Schema flexibility**: The interface allows users to add tables into any database schema.
- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.
- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.
- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.

## Migration process

> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/).

While you can use the existing PGVector database, we **strongly recommend** migrating your data to the PGVectorStore-style schema to take full advantage of the performance benefits.

### (Recommended) Data migration

1. **Create a PG engine.**

```python
from langchain_postgres import PGEngine

# Replace these variable values
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
```

> **_NOTE:_** All sync methods have corresponding async methods.

2. **Create a new table to migrate existing data.**

```python
# Vertex AI embeddings uses a vector size of 768.
# Adjust this according to your embeddings service.
VECTOR_SIZE = 768

engine.init_vectorstore_table(
table_name="destination_table",
vector_size=VECTOR_SIZE,
)
```

**(Optional) Customize your table.**

When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:

- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.
- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.

```python
metadata_columns = [
Column(f"col_0_{collection_name}", "VARCHAR"),
Column(f"col_1_{collection_name}", "VARCHAR"),
]
engine.init_vectorstore_table(
table_name="destination_table",
vector_size=VECTOR_SIZE,
metadata_columns=metadata_columns,
id_column=Column("langchain_id", "VARCHAR"),
)
```

3. **Create a vector store object to interact with the new data.**

> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialise a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

destination_vector_store = PGVectorStore.create_sync(
engine,
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
table_name="destination_table",
)
```

If you have any customisations on the metadata or the id columns, add them to the vector store as follows:

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

destination_vector_store = PGVectorStore.create_sync(
engine,
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
table_name="destination_table",
metadata_columns=[col.name for col in metadata_columns],
id_column="langchain_id",
)
```

4. **Migrate the data to the new table.**

```python
from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection

migrate_pgvector_collection(
engine,
# Set collection name here
collection_name="collection_name",
vector_store=destination_vector_store,
# This deletes data from the original table upon migration. You can choose to turn it off.
delete_pg_collection=True,
)
```

The data will only be deleted from the original table once all of it has been successfully copied to the destination table.

> **TIP:** If you would like to migrate multiple collections, you can use the `alist_pgvector_collection_names` method to get the names of all collections, allowing you to iterate through them.
>
> ```python
> from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names
>
> all_collection_names = list_pgvector_collection_names(engine)
> print(all_collection_names)
> ```

### (Not Recommended) Use PGVectorStore interface on PGVector databases

If you choose not to migrate your data, you can still use the PGVectorStore interface with your existing PGVector database. However, you won't benefit from the performance improvements of the PGVectorStore-style schema.

1. **Create an PGVectorStore engine.**

```python
from langchain_postgres import PGEngine

# Replace these variable values
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
```

> **_NOTE:_** All sync methods have corresponding async methods.

2. **Create a vector store object to interact with the data.**

Use the embeddings service used by your database. See [langchain docs](https://python.langchain.com/docs/integrations/text_embedding/) for reference.

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

vector_store = PGVectorStore.create_sync(
engine=engine,
table_name="langchain_pg_embedding",
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
content_column="document",
metadata_json_column="cmetadata",
metadata_columns=["collection_id"],
id_column="id",
)
```

3. **Perform similarity search.**

Filter by collection id:

```python
vector_store.similarity_search("query", k=5, filter=f"collection_id='{uuid}'")
```

Filter by collection id and metadata:

```python
vector_store.similarity_search(
"query", k=5, filter=f"collection_id='{uuid}' and cmetadata->>'col_name' = 'value'"
)
```
142 changes: 101 additions & 41 deletions examples/vectorstore.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
"from langchain_core.documents import Document\n",
"\n",
"# See docker command above to launch a postgres instance with pgvector enabled.\n",
"connection = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\" \n",
"connection = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\"\n",
"collection_name = \"my_docs\"\n",
"embeddings = CohereEmbeddings()\n",
"\n",
Expand Down Expand Up @@ -126,17 +126,47 @@
"outputs": [],
"source": [
"docs = [\n",
" Document(page_content='there are cats in the pond', metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
" Document(page_content='ducks are also found in the pond', metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
" Document(page_content='fresh apples are available at the market', metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"}),\n",
" Document(page_content='the market also sells fresh oranges', metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"}),\n",
" Document(page_content='the new art exhibit is fascinating', metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"}),\n",
" Document(page_content='a sculpture exhibit is also at the museum', metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"}),\n",
" Document(page_content='a new coffee shop opened on Main Street', metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"}),\n",
" Document(page_content='the book club meets at the library', metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"}),\n",
" Document(page_content='the library hosts a weekly story time for kids', metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"}),\n",
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"})\n",
"]\n"
" Document(\n",
" page_content=\"there are cats in the pond\",\n",
" metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"ducks are also found in the pond\",\n",
" metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"fresh apples are available at the market\",\n",
" metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the market also sells fresh oranges\",\n",
" metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the new art exhibit is fascinating\",\n",
" metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a sculpture exhibit is also at the museum\",\n",
" metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a new coffee shop opened on Main Street\",\n",
" metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the book club meets at the library\",\n",
" metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the library hosts a weekly story time for kids\",\n",
" metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
" ),\n",
"]"
]
},
{
Expand All @@ -159,7 +189,7 @@
}
],
"source": [
"vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])"
"vectorstore.add_documents(docs, ids=[doc.metadata[\"id\"] for doc in docs])"
]
},
{
Expand Down Expand Up @@ -191,7 +221,7 @@
}
],
"source": [
"vectorstore.similarity_search('kitty', k=10)"
"vectorstore.similarity_search(\"kitty\", k=10)"
]
},
{
Expand All @@ -212,17 +242,47 @@
"outputs": [],
"source": [
"docs = [\n",
" Document(page_content='there are cats in the pond', metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
" Document(page_content='ducks are also found in the pond', metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"}),\n",
" Document(page_content='fresh apples are available at the market', metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"}),\n",
" Document(page_content='the market also sells fresh oranges', metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"}),\n",
" Document(page_content='the new art exhibit is fascinating', metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"}),\n",
" Document(page_content='a sculpture exhibit is also at the museum', metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"}),\n",
" Document(page_content='a new coffee shop opened on Main Street', metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"}),\n",
" Document(page_content='the book club meets at the library', metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"}),\n",
" Document(page_content='the library hosts a weekly story time for kids', metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"}),\n",
" Document(page_content='a cooking class for beginners is offered at the community center', metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"})\n",
"]\n"
" Document(\n",
" page_content=\"there are cats in the pond\",\n",
" metadata={\"id\": 1, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"ducks are also found in the pond\",\n",
" metadata={\"id\": 2, \"location\": \"pond\", \"topic\": \"animals\"},\n",
" ),\n",
" Document(\n",
" page_content=\"fresh apples are available at the market\",\n",
" metadata={\"id\": 3, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the market also sells fresh oranges\",\n",
" metadata={\"id\": 4, \"location\": \"market\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the new art exhibit is fascinating\",\n",
" metadata={\"id\": 5, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a sculpture exhibit is also at the museum\",\n",
" metadata={\"id\": 6, \"location\": \"museum\", \"topic\": \"art\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a new coffee shop opened on Main Street\",\n",
" metadata={\"id\": 7, \"location\": \"Main Street\", \"topic\": \"food\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the book club meets at the library\",\n",
" metadata={\"id\": 8, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"the library hosts a weekly story time for kids\",\n",
" metadata={\"id\": 9, \"location\": \"library\", \"topic\": \"reading\"},\n",
" ),\n",
" Document(\n",
" page_content=\"a cooking class for beginners is offered at the community center\",\n",
" metadata={\"id\": 10, \"location\": \"community center\", \"topic\": \"classes\"},\n",
" ),\n",
"]"
]
},
{
Expand Down Expand Up @@ -275,9 +335,7 @@
}
],
"source": [
"vectorstore.similarity_search('kitty', k=10, filter={\n",
" 'id': {'$in': [1, 5, 2, 9]}\n",
"})"
"vectorstore.similarity_search(\"kitty\", k=10, filter={\"id\": {\"$in\": [1, 5, 2, 9]}})"
]
},
{
Expand Down Expand Up @@ -309,10 +367,11 @@
}
],
"source": [
"vectorstore.similarity_search('ducks', k=10, filter={\n",
" 'id': {'$in': [1, 5, 2, 9]},\n",
" 'location': {'$in': [\"pond\", \"market\"]}\n",
"})"
"vectorstore.similarity_search(\n",
" \"ducks\",\n",
" k=10,\n",
" filter={\"id\": {\"$in\": [1, 5, 2, 9]}, \"location\": {\"$in\": [\"pond\", \"market\"]}},\n",
")"
]
},
{
Expand All @@ -336,12 +395,15 @@
}
],
"source": [
"vectorstore.similarity_search('ducks', k=10, filter={\n",
" '$and': [\n",
" {'id': {'$in': [1, 5, 2, 9]}},\n",
" {'location': {'$in': [\"pond\", \"market\"]}},\n",
" ]\n",
"}\n",
"vectorstore.similarity_search(\n",
" \"ducks\",\n",
" k=10,\n",
" filter={\n",
" \"$and\": [\n",
" {\"id\": {\"$in\": [1, 5, 2, 9]}},\n",
" {\"location\": {\"$in\": [\"pond\", \"market\"]}},\n",
" ]\n",
" },\n",
")"
]
},
Expand Down Expand Up @@ -372,9 +434,7 @@
}
],
"source": [
"vectorstore.similarity_search('bird', k=10, filter={\n",
" 'location': { \"$ne\": 'pond'}\n",
"})"
"vectorstore.similarity_search(\"bird\", k=10, filter={\"location\": {\"$ne\": \"pond\"}})"
]
}
],
Expand Down
Loading
Loading