Extracting Text from PDF Files with PDFMiner in Python
When working with PDF documents, extracting text can be a crucial task. PDFMiner, a Python library, simplifies this process, enabling developers to parse and extract text from PDF files.
Updated PDFMiner API and Outdated Examples
Recent updates to PDFMiner have introduced changes to its API, rendering many existing examples obsolete. The transition to the latest version can leave developers lost, unsure how to perform basic tasks like text extraction.
Example Implementation
To address this issue, let's explore a working example that demonstrates how to extract text from a PDF file using the current PDFMiner library:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
This code provides a comprehensive approach to text extraction, covering all necessary steps. The convert_pdf_to_txt function takes a file path as input and handles the process of opening the file, initializing the document parser, and converting page content into a text string.
This example illustrates the updated PDFMiner syntax, eliminating the need for outdated code. It has been thoroughly tested and validated for use with the latest PDFMiner version.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3