-
Notifications
You must be signed in to change notification settings - Fork 81
feat: Add the PGVectorStore class #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add the PGVectorStore class #168
Conversation
"PGVector", | ||
"PGVectorStore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a final convo on naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API has been updated to support either the dataclass or the TypedDict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Would be good to add validation on inputs to avoid SQL injection attacks
langchain_postgres/engine.py
Outdated
id_data_type = "UUID" if isinstance(id_column, str) else id_column.data_type | ||
id_column_name = id_column if isinstance(id_column, str) else id_column.name | ||
|
||
query = f"""CREATE TABLE "{schema_name}"."{table_name}"( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR: Could you add at least light weight validation here (i.e., checking that each variable uses the character set for a valid identifier.)
schema_name allows using escape characters right now, which allows for fairly easy sql injection.
You can also use the sqlalchemy functional API to generate the full sql query instead of using text
. Table, Metadata, Columns are already being used in this code, so perhaps the code could keep a consistent style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code using SQLAlchemy functional API has been removed as it has no use for now.
The code is consistent using the full sql query using text
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the light weight validation, Postgres uses a specific character set we could use to validate against. But since we wrap the table name and schema name in double quotes, it allows the user more flexibility in the character set (most characters can be used). So in this case, we could add a layer to escape the strings
langchain_postgres/vectorstore.py
Outdated
ignore_metadata_columns: Optional[list[str]] = None, | ||
id_column: str = "langchain_id", | ||
metadata_json_column: Optional[str] = "langchain_metadata", | ||
distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit/aesthetic) Feel free to ignore as this can be updated later if you agree.
My sense is that it's common for OSS libraries in python to accept string inputs in addition (and often in place of Enums). The main reason is to reduce imports in user code. (Literal[...] provides the auto-completion and static type checking support, run time checking can always be added.)
"PGVector", | ||
"PGVectorStore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)
langchain_postgres/vectorstore.py
Outdated
@@ -0,0 +1,842 @@ | |||
# TODO: Remove below import when minimum supported Python version is 3.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My additional proposal is to use v2 for namespacing all new classes like:
from langchain_postgres.v2.vectorstores import PGVectorStore
from langchain_postgres.v2.engine import PGEngine
from langchain_postgres.v2.indexes import IVFFlat
MyPy pulled in a new minor version causing the lint failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall left a few minor comments.
A few comments:
- let's add missing init.py files in all the test folders and in vectorstore module name.
- namespacing for v2 is inconsistent (see comment about namespacing w/ suggestion to change)
- likely still sql injection capability in indexes
- sql query construction is fairly complex in
aadd_embeddings
-- could try to do a pass with an LLM on it to see if it could suggest refactoring for readability / maintenance
@@ -0,0 +1,1244 @@ | |||
# TODO: Remove below import when minimum supported Python version is 3.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing init .py for vectorstore module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this exposed to customers. They can use this via the module path if they want to use the pure Async version, but we recommend they use the mixed async/sync PGVectorStore interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
schema_name: str = "public", | ||
content_column: str = "content", | ||
embedding_column: str = "embedding", | ||
metadata_columns: list[str] = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mutable default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this was a user unreachable code, i hadn't changed it but I've made the change now
langchain_postgres/engine.py
Outdated
query += "\n);" | ||
|
||
async with self._pool.connect() as conn: | ||
await conn.execute(text(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code will still trigger static code scanners for security issues due to f string interpolation. do we have a way to pass things as params if we're using a text query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a way to pass params with a text query(CRUD queries), but according to postgres we cannot parameterize the DDL queries creating the structures.
langchain_postgres/indexes.py
Outdated
""" | ||
if ( | ||
self.extension_name | ||
and re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", self.extension_name) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If we are doing validation at the application layer, this should probably be in a standalone function and used in other places as well (e.g., any of the index classes has the same injection issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is added as a post init to the BaseIndex, which is extended by all Index classes, so all the indexes run this check after init.
Is there a different way you would like this to be implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is only for the extension_name
if I'm understanding correctly. The other index types seem to also generate SQL (e.g., https://github.com/langchain-ai/langchain-postgres/pull/168/files/cf58c2ab9bedf43df0989a7cc707d70e7e5a66a0#diff-5dbed1276479f8f27084b783f123c32e6acc95a6662c9838f3e126f603925809R106)
We can also handle this in a follow up PR if easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I missed that. I've seperated that function and now it validates for both extension_name and index_type.
I've also wrapped the index_name in double quotes to allow the same flexibility as tables.
if len(self.metadata_columns) > 0 | ||
else "" | ||
) | ||
insert_stmt = f'INSERT INTO "{self.schema_name}"."{self.table_name}"("{self.id_column}", "{self.content_column}", "{self.embedding_column}"{metadata_col_names}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SQL query construction is fairly difficult to read right now (also difficult to audit for validity).
Are we using a mixture of parameters and f string interpolation b/c some of the values (e.g., table name) cannot be expressed as parameters in conn.exec(.., params)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes according to postgres, we can only parameterize (through the connector) the values associated with the CRUD operations. All the names (schema/table/columns) has to be a part of the text query.
langchain_postgres/vectorstore/v2.py
Outdated
@@ -0,0 +1,842 @@ | |||
# TODO: Remove below import when minimum supported Python version is 3.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name spacing is a bit odd:
vectorstore.v2 <-- only contains sync
vectorstore.async_vectorstore <-- but this is also v2 implementation
You could do something like this:
langchain_postgres.v2.async_vectorstore
langchain_postgres.v2.sync_vectorstore
and import everything into langchain_postgres.init so users don't need to know about the internal namespacing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I've made v2 to be the namespace now
@dishaprakash please run |
Lint fix is in #174 |
Merging into langchain-ai:pg-vectorstore branch. Will make a final release PR. |
No description provided.