This project is a Python-based tool to extract images from a given PDF file and index the surrounding context of each image for subsequent search operations. Users can query the context in natural language, and the search results will present the related images. This tool supports both English and Chinese languages. For Chinese tokenization, the jieba
library is used.
- Extracts images from PDF files.
- Indexes the context around each image.
- Allows querying using natural language.
- Supports both English and Chinese languages. The Chinese language uses the
jieba
library for word segmentation. - Provides a Gradio demo interface showcasing the search functionality and displays the top-1 image for a given query.
- Clone the repository.
- Ensure you have the necessary dependencies by installing them from
requirements.txt
. - Run the Gradio demo (
gradioDemo.py
) to launch the interface. - Upload a PDF, set the desired parameters, input your query, and retrieve the associated image.
Contains utility functions:
get_text_around_image
: Extracts surrounding text (both before and after) an image from the given PDF blocks.get_title_of_image
: Retrieves the title of an image based on the nearest occurrence of the terms '图' or 'figure'.
Core functionalities include:
ChineseTokenizer
andChineseAnalyzer
: Custom tokenizers to handle Chinese tokenization using thejieba
library.load_pdf
: Converts a given PDF into images and extracts surrounding text for each image.build_index
: Indexes the extracted text using Whoosh.search
: Allows querying the index and retrieves associated images.return_image
: Returns the image based on the search results.
Sets up the Gradio interface and binds the functions for the demo.
The Gradio demo interface (gradioDemo.py
) offers an interactive way to test the functionality.
- Upload a PDF file.
- Adjust the parameters as needed:
- DPI
- Number of front pages to skip
- Number of back pages to skip
- Number of blocks to skip
- Language (English/Chinese)
- Input your search query.
- Hit "Submit" to get the title and the top-1 image associated with your query.
Refer to requirements.txt
for a list of necessary libraries.
Feel free to fork the project and submit pull requests. If you encounter any bugs or have suggestions, please open an issue.