Remove all images from docx files - python

I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.
My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.
Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python
From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.
Any ideas?

If your goal is to redact images maybe this code I used for a similar usecase could be useful:
import sys
import zipfile
from PIL import Image, ImageFilter
import io
blur = ImageFilter.GaussianBlur(40)
def redact_images(filename):
outfile = filename.replace(".docx", "_redacted.docx")
with zipfile.ZipFile(filename) as inzip:
with zipfile.ZipFile(outfile, "w") as outzip:
for info in inzip.infolist():
name = info.filename
print(info)
content = inzip.read(info)
if name.endswith((".png", ".jpeg", ".gif")):
fmt = name.split(".")[-1]
img = Image.open(io.BytesIO(content))
img = img.convert().filter(blur)
outb = io.BytesIO()
img.save(outb, fmt)
content = outb.getvalue()
info.file_size = len(content)
info.CRC = zipfile.crc32(content)
outzip.writestr(info, content)
Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.

I don't think it's currently implemented in python-docx.
Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.
The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.
You could do this with win32com if you have a machine with Word and Python installed.
This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:
import win32com.client
from os import getcwd, listdir
docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file
FromTo = {"First Name":"John",
"Last Name":"Smith"} #You can insert as many as you want
word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
word.Documents.Open('{}\\{}'.format(getcwd(), doc))
for From in FromTo.keys():
word.Selection.Find.Text = From
word.Selection.Find.Replacement.Text = FromTo[From]
word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2
name = doc.rsplit('.',1)[0]
ext = doc.rsplit('.',1)[1]
word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))
word.Quit() # releases Word object from memory
In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.
word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)

I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?
The Document is explained here.
Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.

Related

Is it possible to perform an OCR in certain area in a PDF with python?

Is it possible to perform an OCR in certain area in a PDF with python? I am trying to build a program to extract some information from every PDF sheet like what Autodesk BIM360 do
disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library that plays nice with other famous libraries. e.g. you can insert PIL images, or matplotlib charts, and you can use pytesseract to apply OCR to your document.
There is a large examples repository in case you want to learn more. You can find a working example of OCR there as well.
I'll repeat the example here for completeness:
#!chapter_007/src/snippet_005.py
def main():
# set up everything for OCR
tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
assert tesseract_data_dir.exists()
l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)
# read Document
doc: typing.Optional[Document] = None
with open("output_001.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
assert doc is not None
# store Document
with open("output_002.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()
What's happening here is that the OCR process is dumping its results in an invisible layer of the PDF (these are called optional content groups). Hence the name OCRAsOptionalContentGroup.
Afterwards, you can store the Document again, and it will appear to contain text, youll be able to search it, select the text, and everything youre used to.

How do I extract text in the right order from PDF using PyPDF2?

I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.

Extract image position from .docx file using python-docx

I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file
import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)
output
21.228 15.920 IMG_20160910_220903848.jpg
In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location
This operation is not directly supported by the API.
However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.
The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).
This specimen XML might be helpful:
http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
From the inline shape containing the picture, you get the <a:blip> element with this:
blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip
The relationship id (r:id generally, but r:embed in this case) is available at:
rId = blip.embed
Then you can get the image part from the document part
document_part = document.part
image_part = document_part.related_parts[rId]
And then the binary image is available for read and write on ._blob.
If you write a new blob, it will replace the prior image when saved.
You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.
There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.
Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)
You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):
from docx import Document
image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
if 'graphicData' in par._p.xml:
image_paragraphs.append(par)
Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list. On every paragraph element you have many options how to change it. If you want to extract img process it and than insert it in the same place, than
paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)
So, I've never really written any answers here, but i think this might be the solution to your problem. With this little code you can see the position of your images given all the paragraphs. Hope it helps.
import docx
doc = docx.Document(filename)
paraGr = []
index = []
par = doc.paragraphs
for i in range(len(par)):
paraGr.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
index.append(i)
If you are using Python 3
pip install python-docx
import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
P.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
I.append(i)
print(I)
#returns list of index(Image_Reference)

Search and replace placeholder text in PDF with Python

I need to generate a customized PDF copy of a template document.
The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.
I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?
I looked at PyPDF2 and ReportLab but neither seem to be able to do so.
Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.
There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.
If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.
From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.
There is no definite solution but I found 2 solutions that works most of the time.
In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:
# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
# First convert all dash-like characters to dashes.
(
re.compile(u"Tom Xavier"),
lambda m : "XXXXXXX"
),
# Then do an actual SSL regex.
# See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
(
re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
lambda m : "XXX-XX-XXXX"
),
]
# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)
Full Example can be found here
In ruby https://github.com/gettalong/hexapdf works for black out text.
Example code:
require 'hexapdf'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page, to_hide_arr)
super()
#canvas = page.canvas(type: :overlay)
#to_hide_arr = to_hide_arr
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
if #to_hide_arr.include? boxes.string
#canvas.stroke_color(0, 0 , 0)
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
#canvas.rectangle(x, y, tx - x, ty - y).fill
end
end
end
alias :show_text_with_positioning :show_text
end
file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")
doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
processor = ShowTextProcessor.new(page, strings_to_black)
page.process_contents(processor)
end
new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)
puts "Writing updated file [#{new_file_name}]."
In this you can black out text on select text will be visible.
As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.
First thing first, install the Aspose.PDF Cloud SDK for Python
pip install asposepdfcloud
Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://aspose.cloud
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#Replace Text
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true')
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2])
response = pdf_api.post_document_text_replace(remote_name, text_replace_list)
print(response)
I'm developer evangelist at aspose.

Searching text in a PDF using Python? [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 2 months ago.
Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).
What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.
Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?
This is called PDF mining, and is very hard because:
PDF is a document format designed to be printed, not to be parsed. Inside a PDF document,
text is in no particular order (unless order is important for printing), most of the time
the original text structure is lost (letters may not be grouped
as words and words may not be grouped in sentences, and the order they are placed in
the paper is often random).
There are tons of software generating PDFs, many are defective.
Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know
what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).
An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.
So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.
I would really like to be proven wrong.
[update]
The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:
Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).
So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).
I am totally a green hand, but this script works for me:
# import packages
import PyPDF2
import re
# open the pdf file
reader = PyPDF2.PdfReader("test.pdf")
# get number of pages
num_pages = len(reader.pages)
# define key terms
string = "Social"
# extract text and do the search
for page in reader.pages:
rext = page.extract_text()
# print(text)
res_search = re.search(string, text)
print(res_search)
I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and #Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.
I recently started using ScraperWiki to do what you described.
Here's an example of using ScraperWiki to extract PDF data.
The scraperwiki.pdftoxml() function returns an XML structure.
You can then use BeautifulSoup to parse that into a navigatable tree.
Here's my code for -
import scraperwiki, urllib2
from bs4 import BeautifulSoup
def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
pageContent = urllib2.urlopen(url)
return pageContent
def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
pdfToProcess = send_Request(fileLocation)
pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
return pdfToObject
def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
soup = BeautifulSoup(contentToParse)
return soup
pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
print line
This code is going to print a whole, big ugly pile of <text> tags.
Each page is separated with a </page>, if that's any consolation.
If you want the content inside the <text> tags, which might include headings wrapped in <b> for example, use line.contents
If you only want each line of text, not including tags, use line.getText()
It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.
Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f from were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def searchInPDF(filename, key):
occurrences = 0
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuation = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuation]
for k in keywords:
if key == k: occurrences+=1
return occurrences
pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)
I agree with #Paulo PDF data-mining is a huge pain. But you might have success with pdftotext which is part of the Xpdf suite freely available here:
http://www.foolabs.com/xpdf/download.html
This should be sufficient for your purpose if you are just looking for single keywords.
pdftotext is a command line utility, but very straightforward to use. It will give you text files, which you may find easier to work with.
If you are on bash, There is a nice tool called pdfgrep,
Since, This is in apt repository, You can install this with:
sudo apt install pdfgrep
It had served my requirements well.
Trying to pick through PDFs for keywords is not an easy thing to do. I tried to use the pdfminer library with very limited success. It’s basically because PDFs are pandemonium incarnate when it comes to structure. Everything in a PDF can stand on its own or be a part of a horizontal or vertical section, backwards or forwards. Pdfminer was having issues translating one page, not recognizing the font, so I tried another direction — optical character recognition of the document. That worked out almost perfectly.
Wand converts all the separate pages in the PDF into image blobs, then you run OCR over the image blobs. What I have as a BytesIO object is the content of the PDF file from the web request. BytesIO is a streaming object that simulates a file load as if the object was coming off of disk, which wand requires as the file parameter. This allows you to just take the data in memory instead of having to save the file to disk first and then load it.
Here’s a very basic code block that should be able to get you going. I can envision various functions that would loop through different URL / files, different keyword searches for each file, and different actions to take, possibly even per keyword and file.
# http://docs.wand-py.org/en/0.5.9/
# http://www.imagemagick.org/script/formats.php
# brew install freetype imagemagick
# brew install PIL
# brew install tesseract
# pip3 install wand
# pip3 install pyocr
import pyocr.builders
import requests
from io import BytesIO
from PIL import Image as PI
from wand.image import Image
if __name__ == '__main__':
pdf_url = 'https://www.vbgov.com/government/departments/city-clerk/city-council/Documents/CurrentBriefAgenda.pdf'
req = requests.get(pdf_url)
content_type = req.headers['Content-Type']
modified_date = req.headers['Last-Modified']
content_buffer = BytesIO(req.content)
search_text = 'tourism investment program'
if content_type == 'application/pdf':
tool = pyocr.get_available_tools()[0]
lang = 'eng' if tool.get_available_languages().index('eng') >= 0 else None
image_pdf = Image(file=content_buffer, format='pdf', resolution=600)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
txt = tool.image_to_string(
PI.open(BytesIO(img_page.make_blob('jpeg'))),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
if search_text in txt.lower():
print('Alert! {} {} {}'.format(search_text, txt.lower().find(search_text),
modified_date))
req.close()
This answer follows #Emma Yu's:
If you want to print out all the matches of a string pattern on every page.
(Note that Emma's code prints a match per page):
import PyPDF2
import re
pattern = input("Enter string pattern to search: ")
fileName = input("Enter file path and name: ")
object = PyPDF2.PdfFileReader(fileName)
numPages = object.getNumPages()
for i in range(0, numPages):
pageObj = object.getPage(i)
text = pageObj.extractText()
for match in re.finditer(pattern, text):
print(f'Page no: {i} | Match: {match}')
A version using PyMuPDF. I find it to be more robust than PyPDF2.
import fitz
import re
# load document
doc = fitz.open(filename)
# define keyterms
String = "hours"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.getText()
print(f'count on page {page.number +1} is: {len(re.findall(String, text))}')
Example with pdfminer.six
from pdfminer import high_level
with open('file.pdf', 'rb') as f:
text = high_level.extract_text(f)
print(text)
Compared to PyPDF2, it can work with cyrillic

Categories