"If a worker wants to do his job well, he must first sharpen his tools." - Confucius, "The Analects of Confucius. Lu Linggong"
Front page > Programming > How to Extract Text from PDF Files Using Updated PDFMiner API in Python?

How to Extract Text from PDF Files Using Updated PDFMiner API in Python?

Published on 2024-11-09
Browse:360

How to Extract Text from PDF Files Using Updated PDFMiner API in Python?

Extracting Text from PDF Files with PDFMiner in Python

When working with PDF documents, extracting text can be a crucial task. PDFMiner, a Python library, simplifies this process, enabling developers to parse and extract text from PDF files.

Updated PDFMiner API and Outdated Examples

Recent updates to PDFMiner have introduced changes to its API, rendering many existing examples obsolete. The transition to the latest version can leave developers lost, unsure how to perform basic tasks like text extraction.

Example Implementation

To address this issue, let's explore a working example that demonstrates how to extract text from a PDF file using the current PDFMiner library:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

This code provides a comprehensive approach to text extraction, covering all necessary steps. The convert_pdf_to_txt function takes a file path as input and handles the process of opening the file, initializing the document parser, and converting page content into a text string.

This example illustrates the updated PDFMiner syntax, eliminating the need for outdated code. It has been thoroughly tested and validated for use with the latest PDFMiner version.

Release Statement This article is reprinted at: 1729146198 If there is any infringement, please contact [email protected] to delete it
Latest tutorial More>

Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.

Copyright© 2022 湘ICP备2022001581号-3