feat: Add the PGVectorStore class #168

dishaprakash · 2025-03-19T12:27:04Z

No description provided.

tests/utils.py

averikitsch · 2025-03-21T23:25:27Z

tests/unit_tests/test_imports.py

    "PGVector",
+    "PGVectorStore",


We should have a final convo on naming

(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)

The API has been updated to support either the dataclass or the TypedDict

pyproject.toml

poetry.lock

langchain_postgres/engine.py

eyurtsev

Looks good overall. Would be good to add validation on inputs to avoid SQL injection attacks

langchain_postgres/engine.py

langchain_postgres/indexes.py

eyurtsev · 2025-03-24T17:51:09Z

langchain_postgres/engine.py

+        id_data_type = "UUID" if isinstance(id_column, str) else id_column.data_type
+        id_column_name = id_column if isinstance(id_column, str) else id_column.name
+
+        query = f"""CREATE TABLE "{schema_name}"."{table_name}"(


MAJOR: Could you add at least light weight validation here (i.e., checking that each variable uses the character set for a valid identifier.)

schema_name allows using escape characters right now, which allows for fairly easy sql injection.

You can also use the sqlalchemy functional API to generate the full sql query instead of using text. Table, Metadata, Columns are already being used in this code, so perhaps the code could keep a consistent style?

The code using SQLAlchemy functional API has been removed as it has no use for now.
The code is consistent using the full sql query using text now.

For the light weight validation, Postgres uses a specific character set we could use to validate against. But since we wrap the table name and schema name in double quotes, it allows the user more flexibility in the character set (most characters can be used). So in this case, we could add a layer to escape the strings

langchain_postgres/engine.py

eyurtsev · 2025-03-24T18:08:14Z

langchain_postgres/vectorstore.py

+        ignore_metadata_columns: Optional[list[str]] = None,
+        id_column: str = "langchain_id",
+        metadata_json_column: Optional[str] = "langchain_metadata",
+        distance_strategy: DistanceStrategy = DEFAULT_DISTANCE_STRATEGY,


(nit/aesthetic) Feel free to ignore as this can be updated later if you agree.

My sense is that it's common for OSS libraries in python to accept string inputs in addition (and often in place of Enums). The main reason is to reduce imports in user code. (Literal[...] provides the auto-completion and static type checking support, run time checking can always be added.)

tests/unit_tests/test_async_pg_vectorstore.py

eyurtsev · 2025-03-24T18:15:52Z

tests/unit_tests/test_imports.py

    "PGVector",
+    "PGVectorStore",


(aesthetic/nit) I'd probably prefer a typeddict over a dataclass for column or at least accept a typeddict. It doesn't super matter for this package b/c it's surface area is small, but generally it's nice to reduce the number of imports that are not strictly necessary. It takes a bit more work library side, but makes the user api nicer. (This is aesthetic, so not required for merging)

langchain_postgres/__init__.py

langchain_postgres/vectorstore/async_vectorstore.py

tests/unit_tests/v2/test_async_pg_vectorstore_index.py

averikitsch · 2025-04-01T18:08:09Z

langchain_postgres/vectorstore.py

@@ -0,0 +1,842 @@
+# TODO: Remove below import when minimum supported Python version is 3.10


My additional proposal is to use v2 for namespacing all new classes like:

from langchain_postgres.v2.vectorstores import PGVectorStore from langchain_postgres.v2.engine import PGEngine from langchain_postgres.v2.indexes import IVFFlat

averikitsch · 2025-04-01T18:25:25Z

MyPy pulled in a new minor version causing the lint failures

eyurtsev

Looks pretty good overall left a few minor comments.

A few comments:

let's add missing init.py files in all the test folders and in vectorstore module name.
namespacing for v2 is inconsistent (see comment about namespacing w/ suggestion to change)
likely still sql injection capability in indexes
sql query construction is fairly complex in aadd_embeddings -- could try to do a pass with an LLM on it to see if it could suggest refactoring for readability / maintenance

eyurtsev · 2025-04-01T19:48:33Z

langchain_postgres/vectorstore/async_vectorstore.py

@@ -0,0 +1,1244 @@
+# TODO: Remove below import when minimum supported Python version is 3.10


missing init .py for vectorstore module

We don't want this exposed to customers. They can use this via the module path if they want to use the pure Async version, but we recommend they use the mixed async/sync PGVectorStore interface

eyurtsev · 2025-04-01T19:48:43Z

langchain_postgres/vectorstore/async_vectorstore.py

+        schema_name: str = "public",
+        content_column: str = "content",
+        embedding_column: str = "embedding",
+        metadata_columns: list[str] = [],


nit: mutable default

Since this was a user unreachable code, i hadn't changed it but I've made the change now

eyurtsev · 2025-04-01T19:59:18Z

langchain_postgres/engine.py

+        query += "\n);"
+
+        async with self._pool.connect() as conn:
+            await conn.execute(text(query))


this code will still trigger static code scanners for security issues due to f string interpolation. do we have a way to pass things as params if we're using a text query?

We do have a way to pass params with a text query(CRUD queries), but according to postgres we cannot parameterize the DDL queries creating the structures.

eyurtsev · 2025-04-01T21:19:15Z

langchain_postgres/indexes.py

+        """
+        if (
+            self.extension_name
+            and re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", self.extension_name) is None


nit: If we are doing validation at the application layer, this should probably be in a standalone function and used in other places as well (e.g., any of the index classes has the same injection issue)

This function is added as a post init to the BaseIndex, which is extended by all Index classes, so all the indexes run this check after init.
Is there a different way you would like this to be implemented?

This check is only for the extension_name if I'm understanding correctly. The other index types seem to also generate SQL (e.g., https://github.com/langchain-ai/langchain-postgres/pull/168/files/cf58c2ab9bedf43df0989a7cc707d70e7e5a66a0#diff-5dbed1276479f8f27084b783f123c32e6acc95a6662c9838f3e126f603925809R106)

We can also handle this in a follow up PR if easier?

You're right! I missed that. I've seperated that function and now it validates for both extension_name and index_type.

I've also wrapped the index_name in double quotes to allow the same flexibility as tables.

eyurtsev · 2025-04-01T21:22:52Z

langchain_postgres/vectorstore/async_vectorstore.py

+                if len(self.metadata_columns) > 0
+                else ""
+            )
+            insert_stmt = f'INSERT INTO "{self.schema_name}"."{self.table_name}"("{self.id_column}", "{self.content_column}", "{self.embedding_column}"{metadata_col_names}'


The SQL query construction is fairly difficult to read right now (also difficult to audit for validity).

Are we using a mixture of parameters and f string interpolation b/c some of the values (e.g., table name) cannot be expressed as parameters in conn.exec(.., params)?

Yes according to postgres, we can only parameterize (through the connector) the values associated with the CRUD operations. All the names (schema/table/columns) has to be a part of the text query.

eyurtsev · 2025-04-01T21:25:55Z

langchain_postgres/vectorstore/v2.py

@@ -0,0 +1,842 @@
+# TODO: Remove below import when minimum supported Python version is 3.10


name spacing is a bit odd:

vectorstore.v2 <-- only contains sync
vectorstore.async_vectorstore <-- but this is also v2 implementation

You could do something like this:

langchain_postgres.v2.async_vectorstore
langchain_postgres.v2.sync_vectorstore

and import everything into langchain_postgres.init so users don't need to know about the internal namespacing.

Got it! I've made v2 to be the namespace now

tests/unit_tests/test_imports.py

averikitsch · 2025-04-02T22:21:50Z

@dishaprakash please run make format. That should fix the final lint issue.

averikitsch · 2025-04-04T16:18:39Z

Lint fix is in #174

averikitsch · 2025-04-04T19:47:28Z

Merging into langchain-ai:pg-vectorstore branch. Will make a final release PR.

feat: Add the PGVectorStore class

a21a8e2

dishaprakash requested a review from averikitsch March 19, 2025 12:27

dishaprakash marked this pull request as draft March 19, 2025 12:28

dishaprakash added 12 commits March 19, 2025 17:16

Linter and format fix

926f417

update poetry lock

a34ddbe

minor variable name change

bdd2bf6

Fix import test

239b1c3

enabled socket in one test file

544cade

enabled socket in all test files

03dcac1

Debug tests being skipped

7b4fa7f

Debug tests being skipped

1d42314

Debug tests being skipped

dc9a5b8

Debug tests being failed

4c3f93f

revert debug lines

b3a12b7

Remove IVFIndex

8b30833

dishaprakash changed the base branch from main to pg-vectorstore March 19, 2025 20:48

Minor change

1496033

averikitsch approved these changes Mar 22, 2025

View reviewed changes

eyurtsev reviewed Mar 24, 2025

View reviewed changes

dishaprakash added 10 commits April 1, 2025 06:25

Review changes

cbd0889

Refactor vectorstore packaging in import

b436df3

Change test table names

eb6954d

Linter fix

3e52c56

Minor fix

a24fe73

Fix test

c74858e

Fix tests

e52e609

Remove chat message history format

1f6a70e

Fix test

8029731

Fix indexing tests

cf58c2a

averikitsch reviewed Apr 1, 2025

View reviewed changes

Make escape sql string function private

b9526c6

eyurtsev reviewed Apr 1, 2025

View reviewed changes

dishaprakash added 2 commits April 2, 2025 09:51

Rename namespaces

1d6563a

Enable support for TypedDict along with Column

c9ad8f3

averikitsch approved these changes Apr 2, 2025

View reviewed changes

averikitsch reviewed Apr 2, 2025

View reviewed changes

tests/unit_tests/test_imports.py Show resolved Hide resolved

Fix import test

a913b5a

dishaprakash marked this pull request as ready for review April 2, 2025 17:21

Linter fix

9e539e0

dishaprakash added 2 commits April 2, 2025 23:22

Linter fix

1daac17

Add validation and quotes for indexes

5062185

dishaprakash requested a review from eyurtsev April 3, 2025 19:55

Merge branch 'pg-vectorstore' into upstream-langchain

fe62c35

averikitsch merged commit c957a09 into langchain-ai:pg-vectorstore Apr 4, 2025
3 of 5 checks passed

		@@ -0,0 +1,842 @@
		# TODO: Remove below import when minimum supported Python version is 3.10

		@@ -0,0 +1,1244 @@
		# TODO: Remove below import when minimum supported Python version is 3.10

feat: Add the PGVectorStore class #168

feat: Add the PGVectorStore class #168

Uh oh!

Conversation

dishaprakash commented Mar 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyurtsev Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eyurtsev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eyurtsev Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

averikitsch commented Apr 1, 2025

Uh oh!

eyurtsev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

eyurtsev Mar 24, 2025 •

edited

Loading

eyurtsev Mar 24, 2025 •

edited

Loading

eyurtsev left a comment •

edited

Loading