Description
I want to be able to filter vector search results by records that include a specific item in an array type field. The current filtering operators do not support doing the element-wise inclusion test I need. The supported operators are listed here:
Operator | Meaning/Category |
---|---|
$eq | Equality (==) |
$ne | Inequality (!=) |
$lt | Less than (<) |
$lte | Less than or equal (<=) |
$gt | Greater than (>) |
$gte | Greater than or equal (>=) |
$in | Special Cased (in) |
$nin | Special Cased (not in) |
$between | Special Cased (between) |
$like | Text (like) |
$ilike | Text (case-insensitive like) |
$and | Logical (and) |
$or | Logical (or) |
None of the operators from that list does exactly the inclusion test for a list of text elements. The closest is the $like
operator which only works if the field is text. It doesn't work for array type because it tries to do an exact match for the entire array rather than looking into the elements inside.
To reproduce this issue you can use this sample code.
First let's define the schema and push some sample data in:
import os
from langchain_openai import AzureOpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_postgres.vectorstores import PGVector
from datetime import datetime, timezone
# Create a sample text file for testing
sample_text = """
This is a sample document to test our Postgres setup.
We'll use this to verify our vector search functionality with filtering.
The text can contain any information you want to search through later.
"""
with open("sample_test.txt", "w") as f:
f.write(sample_text)
# Load and process the document
loader = TextLoader("sample_test.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# Initialize Azure OpenAI embeddings
embeddings = AzureOpenAIEmbeddings(
azure_deployment="myAzureDeployment", # Replace with your deployment name
model="text-embedding-ada-002", # This is typically the model name
azure_endpoint="https://yourEndpoint.openai.azure.com",
api_key="yourApiKey",
chunk_size=1,
)
connection_string = (
"postgresql+psycopg://postgres:postgres@your_url:5432/test"
)
collection_name = "test_filtering"
vectorstore = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=connection_string,
use_jsonb=True,
)
def prepare_metadata(doc):
return {
"source": doc.metadata["source"],
"category": ["A"],
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
}
content_list = []
metadata_list = []
for doc in docs:
content_list.append(doc.page_content)
metadata_list.append(prepare_metadata(doc))
ids_added = vectorstore.add_texts(content_list, metadata_list)
print(f"Added {len(ids_added)} documents to the vector store.")
Now I use the following code to retrive data from the collection we just ingested:
filter = {"category": {"$like": ["A"]}} # Exact match of the array instead of looking to the elements inside
retriever = vectorstore.as_retriever(search_kwargs={"filter": filter})
retrieved_docs = retriever .invoke("Opensearch", k=5)
print(retrieved_docs) # No results are returned
I have implemented another operator for array element-wise inclusion test which will fix this issue. But, I want to report this issue first for your review before creating a PR.
Thanks for reading.