Skip to content

feat: Add the PGVectorStore #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 51 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

The `langchain-postgres` package implementations of core LangChain abstractions using `Postgres`.

The package is released under the MIT license.
The package is released under the MIT license.

Feel free to use the abstraction as provided or else modify them / extend them as appropriate for your own application.

Expand All @@ -23,22 +23,65 @@ The package currently only supports the [psycogp3](https://www.psycopg.org/psyco
pip install -U langchain-postgres
```

## Change Log
## Usage

0.0.6:
- Remove langgraph as a dependency as it was causing dependency conflicts.
- Base interface for checkpointer changed in langgraph, so existing implementation would've broken regardless.
### Vectorstore

## Usage
> [!NOTE]
> See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb)
`PGVector` is being deprecated. Please migrate to `PGVectorStore`.
`PGVectorStore` is used for improved performance and manageability.
See the [migration guide](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/migrate_pgvector_to_pgvectorstore.md) for details on how to migrate from `PGVector` to `PGVectorStore`.

> [!TIP]
> All synchronous functions have corresponding asynchronous functions

```python
from langchain_postgres import PGEngine, PGVectorStore
from langchain_core.embeddings import DeterministicFakeEmbedding
import uuid

# Replace these variable values
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)

VECTOR_SIZE = 768
embedding = DeterministicFakeEmbedding(size=VECTOR_SIZE)

engine.init_vectorstore_table(
table_name="destination_table",
vector_size=VECTOR_SIZE,
)

store = PGVectorStore.create_sync(
engine=engine,
table_name=TABLE_NAME,
embedding_service=embedding,
)

all_texts = ["Apples and oranges", "Cars and airplanes", "Pineapple", "Train", "Banana"]
metadatas = [{"len": len(t)} for t in all_texts]
ids = [str(uuid.uuid4()) for _ in all_texts]
docs = [
Document(id=ids[i], page_content=all_texts[i], metadata=metadatas[i]) for i in range(len(all_texts))
]

store.add_documents(docs)

query = "I'd like a fruit."
docs = store.similarity_search(query)
print(docs)
```

For a detailed example on `PGVectorStore` see [here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/pg_vectorstore.ipynb).

### ChatMessageHistory

The chat message history abstraction helps to persist chat message history
The chat message history abstraction helps to persist chat message history
in a postgres table.

PostgresChatMessageHistory is parameterized using a `table_name` and a `session_id`.

The `table_name` is the name of the table in the database where
The `table_name` is the name of the table in the database where
the chat messages will be stored.

The `session_id` is a unique identifier for the chat session. It can be assigned
Expand Down Expand Up @@ -79,7 +122,6 @@ chat_history.add_messages([
print(chat_history.messages)
```


### Vectorstore

See example for the [PGVector vectorstore here](https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb)
Expand Down
174 changes: 174 additions & 0 deletions examples/migrate_pgvector_to_pgvectorstore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Migrate a `PGVector` vector store to `PGVectorStore`

This guide shows how to migrate from the [`PGVector`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstores.py) vector store class to the [`PGVectorStore`](https://github.com/langchain-ai/langchain-postgres/blob/main/langchain_postgres/vectorstore.py) class.

## Why migrate?

This guide explains how to migrate your vector data from a PGVector-style database (two tables) to an PGVectoStore-style database (one table per collection) for improved performance and manageability.

Migrating to the PGVectorStore interface provides the following benefits:

- **Simplified management**: A single table contains data corresponding to a single collection, making it easier to query, update, and maintain.
- **Improved metadata handling**: It stores metadata in columns instead of JSON, resulting in significant performance improvements.
- **Schema flexibility**: The interface allows users to add tables into any database schema.
- **Improved performance**: The single-table schema can lead to faster query execution, especially for large collections.
- **Clear separation**: Clearly separate table and extension creation, allowing for distinct permissions and streamlined workflows.
- **Secure Connections:** The PGVectorStore interface creates a secure connection pool that can be easily shared across your application using the `engine` object.

## Migration process

> **_NOTE:_** The langchain-core library is installed to use the Fake embeddings service. To use a different embedding service, you'll need to install the appropriate library for your chosen provider. Choose embeddings services from [LangChain's Embedding models](https://python.langchain.com/v0.2/docs/integrations/text_embedding/).

While you can use the existing PGVector database, we **strongly recommend** migrating your data to the PGVectorStore-style schema to take full advantage of the performance benefits.

### (Recommended) Data migration

1. **Create a PG engine.**

```python
from langchain_postgres import PGEngine

# Replace these variable values
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
```

> **_NOTE:_** All sync methods have corresponding async methods.

2. **Create a new table to migrate existing data.**

```python
# Vertex AI embeddings uses a vector size of 768.
# Adjust this according to your embeddings service.
VECTOR_SIZE = 768

engine.init_vectorstore_table(
table_name="destination_table",
vector_size=VECTOR_SIZE,
)
```

**(Optional) Customize your table.**

When creating your vectorstore table, you have the flexibility to define custom metadata and ID columns. This is particularly useful for:

- **Filtering**: Metadata columns allow you to easily filter your data within the vectorstore. For example, you might store the document source, date, or author as metadata for efficient retrieval.
- **Non-UUID Identifiers**: By default, the id_column uses UUIDs. If you need to use a different type of ID (e.g., an integer or string), you can define a custom id_column.

```python
metadata_columns = [
Column(f"col_0_{collection_name}", "VARCHAR"),
Column(f"col_1_{collection_name}", "VARCHAR"),
]
engine.init_vectorstore_table(
table_name="destination_table",
vector_size=VECTOR_SIZE,
metadata_columns=metadata_columns,
id_column=Column("langchain_id", "VARCHAR"),
)
```

3. **Create a vector store object to interact with the new data.**

> **_NOTE:_** The `FakeEmbeddings` embedding service is only used to initialise a vector store object, not to generate any embeddings. The embeddings are copied directly from the PGVector table.

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

destination_vector_store = PGVectorStore.create_sync(
engine,
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
table_name="destination_table",
)
```

If you have any customisations on the metadata or the id columns, add them to the vector store as follows:

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

destination_vector_store = PGVectorStore.create_sync(
engine,
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
table_name="destination_table",
metadata_columns=[col.name for col in metadata_columns],
id_column="langchain_id",
)
```

4. **Migrate the data to the new table.**

```python
from langchain_postgres.utils.pgvector_migrator import amigrate_pgvector_collection

migrate_pgvector_collection(
engine,
# Set collection name here
collection_name="collection_name",
vector_store=destination_vector_store,
# This deletes data from the original table upon migration. You can choose to turn it off.
delete_pg_collection=True,
)
```

The data will only be deleted from the original table once all of it has been successfully copied to the destination table.

> **TIP:** If you would like to migrate multiple collections, you can use the `alist_pgvector_collection_names` method to get the names of all collections, allowing you to iterate through them.
>
> ```python
> from langchain_postgres.utils.pgvector_migrator import alist_pgvector_collection_names
>
> all_collection_names = list_pgvector_collection_names(engine)
> print(all_collection_names)
> ```

### (Not Recommended) Use PGVectorStore interface on PGVector databases

If you choose not to migrate your data, you can still use the PGVectorStore interface with your existing PGVector database. However, you won't benefit from the performance improvements of the PGVectorStore-style schema.

1. **Create an PGVectorStore engine.**

```python
from langchain_postgres import PGEngine

# Replace these variable values
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)
```

> **_NOTE:_** All sync methods have corresponding async methods.

2. **Create a vector store object to interact with the data.**

Use the embeddings service used by your database. See [langchain docs](https://python.langchain.com/docs/integrations/text_embedding/) for reference.

```python
from langchain_postgres import PGVectorStore
from langchain_core.embeddings import FakeEmbeddings

vector_store = PGVectorStore.create_sync(
engine=engine,
table_name="langchain_pg_embedding",
embedding_service=FakeEmbeddings(size=VECTOR_SIZE),
content_column="document",
metadata_json_column="cmetadata",
metadata_columns=["collection_id"],
id_column="id",
)
```

3. **Perform similarity search.**

Filter by collection id:

```python
vector_store.similarity_search("query", k=5, filter=f"collection_id='{uuid}'")
```

Filter by collection id and metadata:

```python
vector_store.similarity_search(
"query", k=5, filter=f"collection_id='{uuid}' and cmetadata->>'col_name' = 'value'"
)
```
Loading