PyMuPDF4LLM is a library designed to convert PDFs into Markdown format. Here, I’ll share my experience testing this library.
Start by installing the library using the following command:
pip install pymupdf4llm
The basic usage is quite simple, requiring just three lines of code to convert a PDF to Markdown:
import pymupdf4llm md_text = pymupdf4llm.to_markdown("input.pdf") print(md_text)
You can specify arguments to adjust how content is extracted.
By default, the entire PDF is converted into a single text output. However, you can extract text page by page by specifying page_chunks=True.
md_text = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)
To extract images as files, use the write_images=True option:
md_text = pymupdf4llm.to_markdown("input.pdf", write_images=True)
It’s also possible to embed images directly in the Markdown using base64 encoding:
md_text = pymupdf4llm.to_markdown("input.pdf", embed_images=True)
For testing, various PDFs with different Markdown elements were used.
Headers are correctly converted into Markdown format. Here is a portion of the result:
# Sample Markdown Guide This is a sample markdown file that includes various features for quick reference. ## 1. Headers ... ## 3. Lists
Bold and italic formatting is also properly converted:
**Bold: **Bold Text**** _Italic: *Italic Text*_ **_Bold and Italic: ***Bold and Italic***_**
Ordered lists at the first level are converted without issues, but nested lists and unordered lists are not accurately converted.
## 3. Lists ### Unordered List Item 1 Item 2 Sub-item 1 Sub-item 2 ### Ordered List 1. First item 2. Second item 1. Sub-item A 2. Sub-item B
The URLs of links are extracted, but the entire line containing the link becomes a hyperlink, deviating from the original format.
## 4. Links and Images [You can add links using [Link Text](URL).](https://www.example.com/)
Images are not extracted by default but can be saved locally with write_images=True.
md_text = pymupdf4llm.to_markdown("input.pdf", write_images=True)
The saved images are then referenced in the Markdown as follows:
### Image Example
![](input.pdf-1-0.png)
Simple tables without vertical borders are not accurately converted (likely because ambiguous column boundaries result in tables being treated as plain text).
## 5. Tables
**Column 1** **Column 2** **Column 3**
Row 1 Data A Data B
Row 2 Data C Data D
Code blocks are correctly converted, but language specification (e.g., python) is not retained. Inline code conversion also has issues.
## 6. Code
### Inline Code
Use backticks for inline code: print("Hello, world!")
### Code Block
Use triple backticks for code blocks:
```
def greet(name):
return f"Hello, {name}!"
print(greet("Markdown"))
```
For multi-line text, the line breaks are preserved as they appear in the original PDF.
Markdown is a lightweight and versatile markup language favored by developers, writers, and bloggers alike
due to its simplicity in formatting text, enabling users to create readable and well-structured documents—
whether for documentation, blog posts, or articles—without the complexity of HTML, while also offering the
ability to convert content seamlessly into other formats like HTML, PDF, and even slideshows, making it an
ideal choice for projects that require both clarity and flexibility in presentation.
Despite challenges in accurately converting lists and links, PyMuPDF4LLM is a useful tool for converting PDFs to Markdown. It can work locally without the need for external language models, making it suitable for environments where internet access is unavailable.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3