Skip to content

在ensemble混合检索时,如何在fiass检索器中加入返回语义相似度的逻辑 #5315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ArronAI007 opened this issue Apr 22, 2025 · 3 comments

Comments

@ArronAI007
Copy link

def do_search(
self,
query: str,
top_k: int,
score_threshold: float = Settings.kb_settings.SCORE_THRESHOLD,
) -> List[Tuple[Document, float]]:
with self.load_vector_store().acquire() as vs:
retriever = get_Retriever("ensemble").from_vectorstore(
vs,
top_k=top_k,
score_threshold=score_threshold,
)
docs = retriever.get_relevant_documents(query)
return docs

Copy link

dosubot bot commented Apr 22, 2025

Hello @ArronAI007! I'm Dosu, here to help you solve bugs, answer questions, and guide you in becoming a contributor. While you wait for a human maintainer, feel free to ask me anything!

要在ensemble混合检索中将FAISS检索器加入返回语义相似度的逻辑,可以使用EnsembleRetrieverService类中的from_vectorstore方法。该方法已经将FAISS检索器与BM25检索器结合在一起,并通过EnsembleRetriever进行混合检索。具体实现如下:

from functools import lru_cache

@lru_cache(maxsize=10)
def create_bm25_retriever(vectorstore: VectorStore):
    docs = list(vectorstore.docstore._dict.values())
    import jieba
    bm25_retriever = BM25Retriever.from_documents(
        docs,
        preprocess_func=jieba.lcut_for_search,
    )
    return bm25_retriever

@staticmethod
def from_vectorstore(
    vectorstore: VectorStore,
    top_k: int,
    score_threshold: int | float,
):
    faiss_retriever = vectorstore.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"score_threshold": score_threshold, "k": top_k},
    )
    bm25_retriever = create_bm25_retriever(vectorstore)
    bm25_retriever.k = top_k
    ensemble_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
    )
    return EnsembleRetrieverService(retriever=ensemble_retriever, top_k=top_k)

在这个实现中,faiss_retriever是通过vectorstore.as_retriever方法创建的,使用了similarity_score_threshold作为检索类型,并设置了score_thresholdk参数。同时,BM25检索器通过create_bm25_retriever函数创建,并使用缓存来避免重复的预处理。EnsembleRetrieverbm25_retrieverfaiss_retriever结合在一起,使用权重[0.5, 0.5]来平衡两者的影响。更新知识库时,可以通过调用create_bm25_retriever.cache_clear()来清除缓存[1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@azhe1234
Copy link

Hello @ArronAI007! I'm Dosu, here to help you solve bugs, answer questions, and guide you in becoming a contributor. While you wait for a human maintainer, feel free to ask me anything!

要在ensemble混合检索中将FAISS检索器加入返回语义相似度的逻辑,可以使用EnsembleRetrieverService类中的from_vectorstore方法。该方法已经将FAISS检索器与BM25检索器结合在一起,并通过EnsembleRetriever进行混合检索。具体实现如下:

from functools import lru_cache

@lru_cache(maxsize=10)
def create_bm25_retriever(vectorstore: VectorStore):
docs = list(vectorstore.docstore._dict.values())
import jieba
bm25_retriever = BM25Retriever.from_documents(
docs,
preprocess_func=jieba.lcut_for_search,
)
return bm25_retriever

@staticmethod
def from_vectorstore(
vectorstore: VectorStore,
top_k: int,
score_threshold: int | float,
):
faiss_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": score_threshold, "k": top_k},
)
bm25_retriever = create_bm25_retriever(vectorstore)
bm25_retriever.k = top_k
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
return EnsembleRetrieverService(retriever=ensemble_retriever, top_k=top_k)
在这个实现中,faiss_retriever是通过vectorstore.as_retriever方法创建的,使用了similarity_score_threshold作为检索类型,并设置了score_thresholdk参数。同时,BM25检索器通过create_bm25_retriever函数创建,并使用缓存来避免重复的预处理。EnsembleRetrieverbm25_retrieverfaiss_retriever结合在一起,使用权重[0.5, 0.5]来平衡两者的影响。更新知识库时,可以通过调用create_bm25_retriever.cache_clear()来清除缓存[1][2].

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

如何输出根据query检索得到的相似度具体指标呢

@ArronAI007
Copy link
Author

ArronAI007 commented Apr 28, 2025

Hello @ArronAI007! I'm Dosu, here to help you solve bugs, answer questions, and guide you in becoming a contributor. While you wait for a human maintainer, feel free to ask me anything!

要在ensemble混合检索中将FAISS检索器加入返回语义相似度的逻辑,可以使用EnsembleRetrieverService类中的from_vectorstore方法。该方法已经将FAISS检索器与BM25检索器结合在一起,并通过EnsembleRetriever进行混合检索。具体实现如下:

from functools import lru_cache

@lru_cache(maxsize=10)
def create_bm25_retriever(vectorstore: VectorStore):
docs = list(vectorstore.docstore._dict.values())
import jieba
bm25_retriever = BM25Retriever.from_documents(
docs,
preprocess_func=jieba.lcut_for_search,
)
return bm25_retriever

@staticmethod
def from_vectorstore(
vectorstore: VectorStore,
top_k: int,
score_threshold: int | float,
):
faiss_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": score_threshold, "k": top_k},
)
bm25_retriever = create_bm25_retriever(vectorstore)
bm25_retriever.k = top_k
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
return EnsembleRetrieverService(retriever=ensemble_retriever, top_k=top_k)
在这个实现中,faiss_retriever是通过vectorstore.as_retriever方法创建的,使用了similarity_score_threshold作为检索类型,并设置了score_thresholdk参数。同时,BM25检索器通过create_bm25_retriever函数创建,并使用缓存来避免重复的预处理。EnsembleRetrieverbm25_retrieverfaiss_retriever结合在一起,使用权重[0.5, 0.5]来平衡两者的影响。更新知识库时,可以通过调用create_bm25_retriever.cache_clear()来清除缓存[1][2].

To continue talking to Dosu, mention @dosu.

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

非常感谢,不过你只是把源码描述了一遍,没有解答问题,请问如何在> def from_vectorstore(

vectorstore: VectorStore,
top_k: int,
score_threshold: int | float,

):
faiss_retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": score_threshold, "k": top_k},
)中加入返回相似度的逻辑呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants