Python processing PDF - installation and use of PyMuPDF!

01 Introduction to PyMuPDF

1 Introduction

Before introducing PyMuPDF , let's take a look at MuPDF . It can be seen from the naming form that PyMuPDF is the Python interface form of MuPDF .

MuPDF

MuPDF is a lightweight PDF, XPS and eBook viewer. MuPDF consists of software libraries, command-line tools, and viewers for various platforms.

The renderer in MuPDF is tailored for high quality antialiased graphics. It renders text with measurements and spacing accurate to within a fraction of a pixel for maximum fidelity when reproducing the appearance of a printed page on screen.

This viewer is small and fast, but complete. It supports multiple document formats such as PDF , XPS , OpenXPS , CBZ , EPUB and FictionBook 2 . You can annotate PDF documents and fill out forms using the mobile viewer (this feature will soon be coming to the desktop viewer as well).

The command line tool allows you to annotate, edit, and convert documents to other formats such as HTML, SVG, PDF, and CBZ. You can also write scripts in Javascript to manipulate documents.

PyMuPDF PyMuPDF (current version 1.18.17) is a Python binding that supports MuPDF (current version 1.18.*).

Using PyMuPDF , you can access the extension with

".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub" .

In addition, about 10 popular image formats can also be handled like documents

: ".png", ".jpg", ".bmp", ".tiff" , etc.

2. Function

For all supported document types you can:

Decrypt files - Access meta information, links and bookmarks - Render pages in raster format ( PNG and others) or SVG in vector format - Search for text - Extract text and images - Convert to other formats: PDF, (X)HTML, XML, JSON, text For PDF documents, there are a lot of additional functions: they can be created , merged or split . Pages can be inserted, deleted, rearranged or modified in a number of ways (including comments and form fields). - Images and fonts can be extracted or inserted - Embedded files are fully supported - PDF files can be reformatted to support duplex printing, tone separation, application logos or watermarks - Full support for password protection: decryption, encryption, encryption method selection, permissions Level and user/owner password settings - PDF optional content concept with support for images, text and drawings - low level PDF structure can be accessed and modified

Command line module " python -m fitz... " Versatile utility with the following features

Encryption/Decryption/Optimization - Create Subdocuments - Document Concatenation - Image/Font Extraction - Full Support for Embedded Files - Text Extraction with Layout Saved (All Documents) **NEW: Layout Saved Text Extraction!** Script `fitzcliy.py` Different formats of text extraction are provided via the subcommand `"gettext"`. Of particular interest is of course layout preservation, which generates text as close as possible to the original physical layout, with areas surrounding images, or duplicating text in tables and multi-column text.

02 Installation

PyMuPDF can be installed from source or from wheels .

For Windows, Linux and Mac OSX platforms, there are wheels in the downloads section of PyPI . This includes Python 64-bit versions 3.6 to 3.9 . There is also a 32-bit version for Windows. There have also been some issues with the Linux ARM architecture since recently - look for the platform tag manylinux2014_aarch64 .

Apart from the standard library, it has no mandatory external dependencies. There are some nice methods only if certain packages are installed:
Pillow: required when using Pixmap.pil_save() and Pixmap.pil_tobytes() - fontTools : required when using Document.subset_fonts() - pymupdf-fonts is a good font choice and can be used for text output methods using the pip install command: pip install PyMuPDF

Import the library:

A note on naming fitz

The standard Python import statement for this library is import fitz . There's a historical reason for this: MuPDF 's original rendering library was called Libart .

After Artifex Software acquired the MuPDF project, the focus of development shifted to writing a new modern graphics library called "Fitz" . Fitz started out as an R&D project to replace the aging Ghostscript graphics library, but instead became the rendering engine for MuPDF (quoted from Wikipedia).

03 How to use

1. Import the library and check the version

2. Open the document

This will create the Document object doc . filename must be a python string of an existing file. Documents can also be opened from memory data, or new empty PDFs can be created. You can also use documents as context managers.

3. Document methods and properties

4. Get metadata

PyMuPDF fully supports standard metadata. Document.metadata is a Python dictionary with the following keys . It works for all document types, but not all entries always contain data. The metadata field is a string, or None if not indicated otherwise. Also note that not all data will always contain meaningful data - even if they are not none at all.

5. Get an outline of your goals

6. Page

Page handling is at the heart of MuPDF 's functionality. • You can render pages as raster or vector ( SVG ) images, with options to scale, rotate, move, or crop pages. • You can extract page text and images in various formats , and search for text strings. • For PDF documents, there are more ways to add text or images to pages.

First, a page Page must be created. Here is one method for Document:

Any integer can be used here -inf<pno<page_count . Negative numbers count down from the end, so doc[-1] is the last page, just like a Python sequence.

A more advanced approach is to use the document as an iterator for pages:

Next, we mainly introduce the common operations of Page!

a. Check the page for links, comments, or form fields

When displaying the document with some viewer software, the link appears as =="Hot Area"==. If you click when the cursor shows the hand symbol, you will usually be taken to the marker coded in that hotspot area. Here's how to get all the links:

links is a list of Python dictionaries.

Also available as an iterator:

If dealing with PDF document pages, there may also be annotations (Annot) or form fields (Widget), each with its own iterator:

b. Rendering the page

This example creates a raster image of the page content:

pix is a Pixmap object which (in this case) contains an RGB image of the page and can be used for a variety of purposes.

The method Page.get_pixmap() provides many variants for controlling the image: resolution, color space (e.g. to generate grayscale images or images with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc. .

For example: to create an RGBA image (ie, containing an alpha channel), specify pix=page.get_pixmap(alpha=True) . \

Pixmap contains many methods and properties referenced below. These include integer width, height (per pixel), and stride (bytes for one horizontal image line). Attribute examples represent rectangular byte regions (Python bytes objects) representing image data.
You can also use page.get_svg_image() to create a vector image of a page.

c. Save the page image to a file

We can simply store the image in a PNG file:

d. Extract text and images

We can also extract all text, images, and other information of a page in many different forms and levels of detail:

Use one of the following strings for opt to get a different format:

"text": (default) plain text with newlines. No formatting, no text position details, no images - "blocks": generate a list of text blocks (paragraphs) - "words": generate a list of words (strings without spaces) - "html": create a complete visual of the page version, including any images. This can be displayed via an internet browser - "dict"/"json": same level of information as HTML, but as a Python dictionary or resp.JSON string. - "rawdict"/"rawjson": a superset of "dict"/"json". It also provides character details like XML. - "xhtml": The same level of text information as the text version, but with images. - "xml": does not contain images, but contains full position and font information for each text character. Use the XML module for interpretation.

e. Search text

You can find the exact position of a text string on the page:

This will provide a list of rectangles , each containing a string "mupdf" (case-insensitive). You can use this information to highlight these areas (PDF only) or to create cross-references to documents.

7. PDF manipulation

PDF is the only document type that can be modified with PyMuPDF . Other file types are read-only.

However, you can convert any document (including images) to PDF and then apply all PyMuPDF functionality to the conversion result, Document.convert_to_pdf() .

Document.save()始终将PDF以其当前（可能已修改）状态存储在磁盘上。

通常，您可以选择是保存到新文件，还是仅将修改附加到现有文件（“增量保存”），这通常要快得多。

下面介绍如何操作PDF文档。

a. 修改、创建、重新排列和删除页面

有几种方法可以操作所谓页面树（描述所有页面的结构）：
PDF:Document.delete_page()和Document.delete_pages()删除页面-Document.copy_page()、Document.fullcopy_page()和Document.move_page()将页面复制或移动到同一文档中的其他位置。

Document.select()将PDF压缩到选定页面，参数是要保留的页码序列。这些整数都必须在0<=i<page_count范围内。执行时，此列表中缺少的所有页面都将被删除。剩余的页面将按顺序出现，次数相同（！）正如您所指定的那样。因此，您可以轻松地使用创建新的PDF：

第一页或最后10页- 仅奇数页或偶数页（用于双面打印）- 包含或不包含给定文本的页- 颠倒页面顺序

保存的新文档将包含仍然有效的链接、注释和书签（i.a.w.指向所选页面或某些外部资源）。

Document.insert_page()和Document.new_page()插入新页面。此外，页面本身可以通过一系列方法进行修改（例如页面旋转、注释和链接维护、文本和图像插入）。

b. 连接和拆分PDF文档

方法Document.insert_pdf()在不同的pdf文档之间复制页面。下面是一个简单的joiner示例（doc1和doc2在PDF中打开）：

下面是一个拆分doc1的片段。它将创建第一页和最后10页的新文档：

c. 保存

Document.save()将始终以当前状态保存文档。

您可以通过指定选项incremental=True将更改写回原始PDF。这个过程（通常）非常快，因为更改会附加到原始文件，而不会完全重写它。

d. 关闭

在程序继续运行时，通常需要“关闭”文档以将底层文件的控制权交给操作系统。

这可以通过Document.close()方法实现。除了关闭基础文件外，还将释放与文档关联的缓冲区。

往期精彩回顾

1.厉害啦|首批ITIL4认证即将颁发！

2.项目管理PRINCE2®认证

3.AWS亚马逊云系统运维和基础架构

4.ISO27001Foundation和ISO27001LA的区别

5.部署和规划OFFICE365

6.最新云安全|CCSK V4国际认证

7.PMP成绩查询与电子版证书下载方法

8.AWS | 亚马逊云系统运维和架构课程

9. Project Management PRINCE2® Practitioner Practitioner-Level Certification

10. ITIL 4 Foundatin Certification Training

11. Power BI data analysis in practice

12. Make Point more Powe|PPT business application skills training (1)

13. Don't tangle | Clarify which PMP and PRINCE2 to learn?

Sishuo Technology:

FOURMAS is a consulting and training institution focusing on IT service management, project management, information security management, IT governance planning, and IT skills improvement. The Four Masters Training Center has a comprehensive and systematic IT training course system in China, advanced training concepts and training models, and top-level training teachers.

The students trained by Sishuo Technology are located in HP, Oracle, SAP, IBM, Fujitsu, Deloitte, PricewaterhouseCoopers, Accenture, Ericsson
, Philips, Bosch, Accenture, CA, SUN, SYMANTEC, P&G
, Johnson Controls, Johnson & Johnson, Pentairwater, Baosteel, Kawasaki, TNT, DHL
and domestic telecommunications, mobile and major insurance, securities, banking and other enterprises and institutions, we have rich training experience.