Description
Version:
unstructured: 0.17.2
unstructured-client: 0.36.0
unstructured-inference: 1.0.2
unstructured_paddleocr: 2.10.0
paddlepaddle: 3.0.0
Set env
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"
I'm using paddle as my OCR model, but when I run this code
raw_pdf = partition_pdf( filename=filepath, strategy="hi_res", infer_table_structure=True, extract_images_in_pdf=True, # extract_image_block_types=["Image", "Table"], # extract_image_block_output_dir=path, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, )
it shows error:
ModuleNotFoundError: No module named 'unstructured_pytesseract'
Why do I have to install unstructured_pytesseract when I already have unstructured_paddleocr?