Skip to content

feat: Add the PGVectorStore class #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dishaprakash
Copy link
Collaborator

No description provided.

@dishaprakash dishaprakash marked this pull request as draft March 19, 2025 12:28
@dishaprakash dishaprakash changed the base branch from main to pg-vectorstore March 19, 2025 20:48
"PGVector",
"PGVectorStore",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a final convo on naming

Copy link
Collaborator

@eyurtsev eyurtsev Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API has been updated to support either the dataclass or the TypedDict

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Would be good to add validation on inputs to avoid SQL injection attacks

id_data_type = "UUID" if isinstance(id_column, str) else id_column.data_type
id_column_name = id_column if isinstance(id_column, str) else id_column.name

query = f"""CREATE TABLE "{schema_name}"."{table_name}"(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR: Could you add at least light weight validation here (i.e., checking that each variable uses the character set for a valid identifier.)

schema_name allows using escape characters right now, which allows for fairly easy sql injection.


You can also use the sqlalchemy functional API to generate the full sql query instead of using text. Table, Metadata, Columns are already being used in this code, so perhaps the code could keep a consistent style?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code using SQLAlchemy functional API has been removed as it has no use for now.
The code is consistent using the full sql query using text now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the light weight validation, Postgres uses a specific character set we could use to validate against. But since we wrap the table name and schema name in double quotes, it allows the user more flexibility in the character set (most characters can be used). So in this case, we could add a layer to escape the strings

ignore_metadata_columns: Optional[list[str]] = None,
id_column: str = "langchain_id",
metadata_json_column: Optional[str] = "langchain_metadata",
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit/aesthetic) Feel free to ignore as this can be updated later if you agree.


My sense is that it's common for OSS libraries in python to accept string inputs in addition (and often in place of Enums). The main reason is to reduce imports in user code. (Literal[...] provides the auto-completion and static type checking support, run time checking can always be added.)

"PGVector",
"PGVectorStore",
Copy link
Collaborator

@eyurtsev eyurtsev Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)

@@ -0,0 +1,842 @@
# TODO: Remove below import when minimum supported Python version is 3.10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My additional proposal is to use v2 for namespacing all new classes like:

from langchain_postgres.v2.vectorstores import PGVectorStore
from langchain_postgres.v2.engine import PGEngine
from langchain_postgres.v2.indexes import IVFFlat

@averikitsch
Copy link
Collaborator

MyPy pulled in a new minor version causing the lint failures

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall left a few minor comments.

A few comments:

  1. let's add missing init.py files in all the test folders and in vectorstore module name.
  2. namespacing for v2 is inconsistent (see comment about namespacing w/ suggestion to change)
  3. likely still sql injection capability in indexes
  4. sql query construction is fairly complex in aadd_embeddings -- could try to do a pass with an LLM on it to see if it could suggest refactoring for readability / maintenance

@@ -0,0 +1,1244 @@
# TODO: Remove below import when minimum supported Python version is 3.10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing init .py for vectorstore module

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want this exposed to customers. They can use this via the module path if they want to use the pure Async version, but we recommend they use the mixed async/sync PGVectorStore interface

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

schema_name: str = "public",
content_column: str = "content",
embedding_column: str = "embedding",
metadata_columns: list[str] = [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mutable default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was a user unreachable code, i hadn't changed it but I've made the change now

query += "\n);"

async with self._pool.connect() as conn:
await conn.execute(text(query))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code will still trigger static code scanners for security issues due to f string interpolation. do we have a way to pass things as params if we're using a text query?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a way to pass params with a text query(CRUD queries), but according to postgres we cannot parameterize the DDL queries creating the structures.

"""
if (
self.extension_name
and re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", self.extension_name) is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we are doing validation at the application layer, this should probably be in a standalone function and used in other places as well (e.g., any of the index classes has the same injection issue)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is added as a post init to the BaseIndex, which is extended by all Index classes, so all the indexes run this check after init.
Is there a different way you would like this to be implemented?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only for the extension_name if I'm understanding correctly. The other index types seem to also generate SQL (e.g., https://github.com/langchain-ai/langchain-postgres/pull/168/files/cf58c2ab9bedf43df0989a7cc707d70e7e5a66a0#diff-5dbed1276479f8f27084b783f123c32e6acc95a6662c9838f3e126f603925809R106)

We can also handle this in a follow up PR if easier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I missed that. I've seperated that function and now it validates for both extension_name and index_type.

I've also wrapped the index_name in double quotes to allow the same flexibility as tables.

if len(self.metadata_columns) > 0
else ""
)
insert_stmt = f'INSERT INTO "{self.schema_name}"."{self.table_name}"("{self.id_column}", "{self.content_column}", "{self.embedding_column}"{metadata_col_names}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL query construction is fairly difficult to read right now (also difficult to audit for validity).


Are we using a mixture of parameters and f string interpolation b/c some of the values (e.g., table name) cannot be expressed as parameters in conn.exec(.., params)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes according to postgres, we can only parameterize (through the connector) the values associated with the CRUD operations. All the names (schema/table/columns) has to be a part of the text query.

@@ -0,0 +1,842 @@
# TODO: Remove below import when minimum supported Python version is 3.10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name spacing is a bit odd:

vectorstore.v2 <-- only contains sync
vectorstore.async_vectorstore <-- but this is also v2 implementation

You could do something like this:

langchain_postgres.v2.async_vectorstore
langchain_postgres.v2.sync_vectorstore

and import everything into langchain_postgres.init so users don't need to know about the internal namespacing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I've made v2 to be the namespace now

@dishaprakash dishaprakash marked this pull request as ready for review April 2, 2025 17:21
@averikitsch
Copy link
Collaborator

@dishaprakash please run make format. That should fix the final lint issue.

@dishaprakash dishaprakash requested a review from eyurtsev April 3, 2025 19:55
@averikitsch
Copy link
Collaborator

Lint fix is in #174

@averikitsch
Copy link
Collaborator

Merging into langchain-ai:pg-vectorstore branch. Will make a final release PR.

@averikitsch averikitsch merged commit c957a09 into langchain-ai:pg-vectorstore Apr 4, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants