Skip to content

Document metadata returns Fragment object instead of Dict in _results_to_docs_and_scores  #28029

Closed
@simadimonyan

Description

@simadimonyan

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

I have agent graph code

from dotenv import load_dotenv
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from langgraph.prebuilt import tools_condition
from langgraph.graph import MessagesState
from src.agents import tools
import os

load_dotenv(".env")
ollama_url = os.getenv("OLLAMA_BASE_URL")
ollama_model = os.getenv("OLLAMA_MODEL")

llm = ChatOllama(model=ollama_model, base_url=ollama_url)
llm_with_tools = llm.bind_tools(tools.tools_list)
memory = MemorySaver() #checkpoint every node state

# Node
def llm_call(state: MessagesState):
    return {"messages": [llm_with_tools.invoke(state["messages"])]}

# Build graphа
builder = StateGraph(MessagesState)
builder.add_node("llm_call", llm_call)
builder.add_node("tools", ToolNode(tools.tools_list))

builder.add_edge(START, "llm_call")
builder.add_conditional_edges("llm_call", tools_condition)
builder.add_edge("tools", "llm_call")

graph = builder.compile(memory)

I have tools code

from src.database.db import Database
from src.parsers.habrnews import Habr
from langchain.agents import tool

@tool
def habr():
    """
    - Returns the last IT news article on Habr
    """
    return Habr.getNews()

@tool
def news_database():
    """
    - Returns data about news querry from the vector database
    """
    db = Database()
    return db.search("News for the last 24h")

# list of tools
tools_list = [news_database]

I have Database code

from langchain_ollama import OllamaEmbeddings
from langchain_ollama import ChatOllama
from langchain_postgres import PGVector
from langchain_core.documents import Document
from dotenv import load_dotenv
import os
import json

class Database:

    def __init__(self):
        load_dotenv(".env")
        name = os.getenv("POSTGRES_NAME")
        pwd = os.getenv("POSTGRES_PASSWORD")
        ollama_url = os.getenv("OLLAMA_BASE_URL")
        ollama_model = os.getenv("OLLAMA_MODEL")
        self.ollama = ChatOllama(model=ollama_model, base_url=ollama_url)

        connection = f"postgresql+psycopg://{name}:{pwd}@postgresql-pgvector:5432/feedconveyor" 
        embeddings = OllamaEmbeddings(model=ollama_model, base_url=ollama_url)
        
        self.vector_database = PGVector(
            collection_name="store",
            connection=connection,
            embeddings=embeddings
        )
        self.vector_database.create_vector_extension()
        self.vector_database.create_tables_if_not_exists()


    async def store_data(self, data):
        self.vector_database.add_documents(data, ids=[doc.metadata["id"] for doc in data])

    def search(self, search):
        docs: dict[Document] = self.vector_database.similarity_search(search)
        return docs

I have Docker compose code (LangGraph Studio and the main app are in the same network)

services:
  bot:
    build:
      context: .
    container_name: telegram-bot
    develop:
      watch:
        - action: sync
          path: ./src
          target: /src
          ignore:
            - node_modules/
        - action: rebuild 
          path: requirements.txt
      depends_on:
        - postgresql-pgvector
    networks:
      - local-network

  postgresql-pgvector:
      image: ankane/pgvector
      container_name: postgresql-pgvector
      env_file:
        - ".env"
      restart: always
      environment:
        - POSTGRES_DB=feedconveyor
        - POSTGRES_USER=${POSTGRES_NAME}
        - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      volumes:
        - ./db:/var/lib/postgresql/data
      networks:
        - local-network
      ports:
        - 5432:5432
        
  pgadmin:
    image: dpage/pgadmin4:latest
    container_name: pgadmin
    restart: always
    env_file:
      - ".env"
    environment:
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_EMAIL}
      - PGADMIN_DEFAULT_PASSWORD=${PGADMIN_PASSWORD}
    ports:
      - 8080:80
    networks:
      - local-network

networks:
  local-network:
    driver: bridge
issue

Error Message and Stack Trace (if applicable)

When I use Ollama:3.1:8b to answer the question by using RAG tools for searching the data I get

1 validation error for Document
metadata
  Input should be a valid dictionary [type=dict_type, input_value=Fragment(buf=b'{"id": 1}'), input_type=Fragment]
    For further information visit https://errors.pydantic.dev/2.9/v/dict_typeTraceback (most recent call last):


  File "/usr/local/lib/python3.12/site-packages/langchain_core/tools/base.py", line 657, in run
    response = context.run(self._run, *tool_args, **tool_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_core/tools/structured.py", line 80, in _run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/deps/__outer_Feed-Conveyor/src/src/agents/tools.py", line 18, in news_database
    return db.search("News for the last 24h")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/deps/__outer_Feed-Conveyor/src/src/database/db.py", line 37, in search
    docs: dict[Document] = self.vector_database.similarity_search(search)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_postgres/vectorstores.py", line 943, in similarity_search
    return self.similarity_search_by_vector(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_postgres/vectorstores.py", line 1498, in similarity_search_by_vector
    docs_and_scores = self.similarity_search_with_score_by_vector(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_postgres/vectorstores.py", line 1043, in similarity_search_with_score_by_vector
    return self._results_to_docs_and_scores(results)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_postgres/vectorstores.py", line 1063, in _results_to_docs_and_scores
    Document(


  File "/usr/local/lib/python3.12/site-packages/langchain_core/documents/base.py", line 285, in __init__
    super().__init__(page_content=page_content, **kwargs)  # type: ignore[call-arg]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


  File "/usr/local/lib/python3.12/site-packages/langchain_core/load/serializable.py", line 125, in __init__
    super().__init__(*args, **kwargs)


  File "/usr/local/lib/python3.12/site-packages/pydantic/main.py", line 212, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
metadata
  Input should be a valid dictionary [type=dict_type, input_value=Fragment(buf=b'{"id": 1}'), input_type=Fragment]
    For further information visit https://errors.pydantic.dev/2.9/v/dict_type

Description

In #28027 (reply in thread) I discussed with AI about the issue:

The _results_to_docs_and_scores method is where the conversion of database results into Document objects occurs. The issue with metadata being returned as a Fragment object instead of a dictionary likely arises here. The metadata is being directly assigned from result.EmbeddingStore.cmetadata.

To resolve the issue, ensure that result.EmbeddingStore.cmetadata is properly deserialized into a dictionary. If cmetadata is stored as a JSON field in the database, it should be automatically deserialized by the ORM (e.g., SQLAlchemy) into a dictionary. However, if it's being returned as a Fragment, you might need to explicitly convert it to a dictionary. Here's a potential fix:

def _results_to_docs_and_scores(self, results: Any) -> List[Tuple[Document, float]]:
    """Return docs and scores from results."""
    docs = [
        (
            Document(
                id=str(result.EmbeddingStore.id),
                page_content=result.EmbeddingStore.document,
                metadata=dict(result.EmbeddingStore.cmetadata),  # Ensure it's a dictionary
            ),
            result.distance if self.embeddings is not None else None,
        )
        for result in results
    ]
    return docs

By wrapping result.EmbeddingStore.cmetadata with dict(), you ensure that the metadata is explicitly converted to a dictionary, which should resolve the validation error.

System Info

Apple M1 14 Pro (15.1 24B83) 16 GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions