Can a text be searched Blockwise in a PDF using PyMuPDF?

Can a text be searched Blockwise in a PDF using PyMuPDF? - python

page.getTextBlocks()
Output
[(42.5, 86.45002746582031, 523.260009765625, 100.22002410888672, TEXT, 0, 0),
(65.75, 103.4000244140625, 266.780029296875, 159.59010314941406, TEXT, 1, 0),
(48.5, 86.123456, 438.292048492, 100.92920404974, TEXT, 0, 0)]
(x0, y0, x1, y1, "lines in block", block_type, block_no)
My main aim is:
to search for a text in a PDF and highlight it
The text that has to be searched can exist in a page n number of times. using tp.search(text,hit_max=1) it could limit the maximum number of occurence but it won't solve the problem because it will select the first occurence of text but for me may be the second or the third occurence is important.
My Idea is:
getTextBlocks extracts the text as mentioned above, using this information specifically the block_no, i want to perform page.searchForfunction for that particular block. Logically it should be possible, but practically i need help on how to do it.
I would appreciate any inputs on acheiving the main aim.
Thanks

As a preface let me say that your question would benefit the issue page of my repository.
Page.searchFor() searches for any number text items on the page. The restriction is the number of hits, which has a limit you must specify in the call. But you can use any number here (take 100 for example). This method extracts no text, ignores character casing and also supports non-horizontal text or text spread across multiple lines. Its output can be directly used to create text marker annotations and more.
You are of course free to extract text by using variations of Page.getText(option) and then apply your finesse to find what you want in the output. option may be "text", "words", "blocks", "dict", "rawdict", "html", "xhtml", or "xml". Each output has its pros and cons obviously. Many of the variants come with text position information, or font information including text color, etc.
But as said: it is up to you how you locate stuff. Let me suggest again we continue this conversation on the Github repo issue page, where I can better point to other resources. Or feel free to use my private e-mail.
If your question means to (1) locate text occurrences, and then (2) link each occurrence to a text block number, then just make a list of block rectangles and check each occurrence whether it is contained in a block rectangle:
for j, rect in enumerate(page.searchFor(text,...)):
for i, bbox in enumerate(block_rectangles):
if rect in bbox:
print("occurrence %i is contained in block %i" % (j, i))

Related

Pywin32 how to replace text in entire word document

I am having trouble replacing words in the entire text document. The following code works to replace words in the main paragraphs but not when text is present in text boxes.
wdFindContinue = 1
wdReplaceAll = 2
word.Selection.Find.Execute(find_str, True, True, False, False, False, True, wdFindContinue, False, replace_str, wdReplaceAll)
Is there a way to replace words in the entire document?

See the article by Word MVPs on Using a macro to replace text wherever it appears in a document.
A collaborative effort of MVP’s Doug Robbins and Greg Maxey with
enhancements by Peter Hewett and Jonathan West
Using the Find or Replace utility on the Edit menu you can find or
replace text "almost" anywhere it appears in the document. If you
record that action however, the scope or "range" of the resulting
recorded macro will only act on the text contained in the body of the
document (or more accurately, it will only act on the part of the
document that contains the insertion point). This means that if the
insertion point is located in the main body of the document when your
macro is executed it will have no effect on text that is in the
headers or footers of the document, for example, or in a textbox,
footnotes, or any other area that is outside the main body of the
document.
Even the Find and Replace utility has a shortcoming. For example, text
in a textbox located in a header or footer is outside the scope of the
Find and Replace utility search range.
***
To use a macro to find or replace text anywhere in a document, it is
necessary to loop through each individual part of the document. In
VBA, these parts are called StoryRanges. Each StoryRange is identified
by a unique wdStoryType constant.
There are eleven different wdStoryType constants that can form the
StoryRanges (or parts) of a document (ok, a few more in later versions
of Word, but they have no bearing in this discussion). Simple
documents may contain only one or two StoryRanges, while more complex
documents may contain more. The wdStoryTypes that have a role in find
and replace are:
wdCommentsStory, wdEndnotesStory, wdEvenPagesFooterStory,
wdEvenPagesHeaderStory, wdFirstPageFooterStory,
wdFirstPageHeaderStory, wdFootnotesStory, wdMainTextStory,
wdPrimaryFooterStory, wdPrimaryHeaderStory, and wdTextFrameStory.
The complete code to find or replace text anywhere is a bit complex.
Accordingly, let’s take it a step at a time to better illustrate the
process. In many cases the simpler code is sufficient for getting the
job done.
Step 1
The following code loops through each StoryRange in the active
document and replaces the specified .Text with .Replacement.Text:
Sub FindAndReplaceFirstStoryOfEachType()
Dim rngStory As Range
For Each rngStory In ActiveDocument.StoryRanges
With rngStory.Find
.Text = "find text"
.Replacement.Text = "I'm found"
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
Next rngStory
End Sub
(Note for those already familiar with VBA: whereas if you use
Selection.Find, you have to specify all of the Find and Replace
parameters, such as .Forward = True, because the settings are
otherwise taken from the Find and Replace dialog's current settings,
which are “sticky”, this is not necessary if using [Range].Find –
where the parameters use their default values if you don't specify
their values in your code).
The simple macro above has shortcomings. It only acts on the "first"
StoryRange of each of the eleven StoryTypes (i.e., the first header,
the first textbox, and so on). While a document only has one
wdMainTextStory StoryRange, it can have multiple StoryRanges in some
of the other StoryTypes. If, for example, the document contains
sections with un-linked headers and footers, or if it contains
multiple textboxes, there will be multiple StoryRanges for those
StoryTypes and the code will not act upon the second and subsequent
StoryRanges. To even further complicate matters, if your document
contains unlinked headers or footers and one of the headers or footers
are empty then VBA can have trouble "jumping" that empty header or
footer and process subsequent headers and footers.
The page has more, but the above should help.

How to search through a paragraph of text for a list of words, and once found, add to a dictionary

I am using a GET request to receive XML information back from a website. I have parsed the information and have gotten a point where I want to set up a new scenario.
The Scenario is, when I run a FOR loop through the xml(now turned into text in a dictionary []) how can I:
search the whole text for a variety of words? For example, say I know I'm looking for "value", "Lastupdatedate", and "Createdby" and want to grab that information everytime I see it
IF I find one of those words above, grab the word, and everything in between them (this is xml, so if I find "value" I want to grab the information in between like (value - 5 - value) ?. So the final information would have value: 5
finalvalues = []
MASTER = [This is the text I'm searching through]
variables = ["value", "Lastupdatedate", "Createdby"]
for i in MASTER:
#need something here to grab the actual name that it found
tst = re.findall(variables(.*?)variables,i)
finalvalues.append(tst)

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

Substring with multiple instances of the same character

So I am using a Magtek USB reader that will read card information,
As of right now I can swipe a card and I get a long string of information that goes into a Tkinter Entry textbox that looks like this
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
All of the data has been randomized, but that's the format
I've got a tkinter button (it gets the text from the entry box in the format I included above and runs this)
def printCD(self):
print(self.carddata.get())
self.card_data_get = self.carddata.get()
self.creditnumber =
self.card_data_get[self.card_data_get.find("B")+1:
self.card_data_get.find("^")]
print(self.creditnumber)
print(self.card_data_get.count("^"))
This outputs:
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
8954756016548963
This yields no issues, but if I wanted to get the next two variables firstname, and lastname
I would need to reuse self.variable.find("^") because in the format it's used before LAST and after INITIAL
So far when I've tried to do this it hasn't been able to reuse "^"
Any takers on how I can split that string of text up into individual variables:
Card Number
First Name
Last Name
Expiration Date

Regex will work for this. I didn't capture everything because you didn't detail what's what but here's an example of capturing the name:
import re
data = "%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?"
matches = re.search(r"\^(?P<name>.+)\^", data)
print(matches.group('name'))
# LAST/FIRST INITIAL
If you aren't familiar with regex, here's a way of testing pattern matching: https://regex101.com/r/lAARCP/1 and an intro tutorial: https://regexone.com/
But basically, I'm searching for (one or more of anything with .+ between two carrots, ^).
Actually, since you mentioned having first and last separate, you'd use this regex:
\^(?P<last>.+)/(?P<first>.+)\^
This question may also interest you regarding finding something twice: Finding multiple occurrences of a string within a string in Python

If you find regex difficult you can divide the problem into smaller pieces and attack one at a time:
data = '%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?'
pieces = data.split('^') # Divide in pieces, one of which contains name
for piece in pieces:
if '/' in piece:
last, the_rest = piece.split('/')
first, initial = the_rest.split()
print('Name:', first, initial, last)
elif piece.startswith('%B'):
print('Card no:', piece[2:])

how to regain the original font properties and its associated properties like bold, italics using python-docx while text replacement

I am using python-docx for a automation tool. I have a issue like once after I run the code for replacement of certain words in one list with corresponding in another list it is removing all the properties (like font size, font name, part of a text in bold or italics, bookmarks in the paragraphs or table) of the text in the paragraph and table and its coming with a plain text in "Calibri" with a font size of '12'.
The code that I used is:
wrongWord = "xyz"
correctWord = "abcd"
def iter_block_items(parent):
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document(r"F:\python\documentSample.docx")
for block in iter_block_items(document):
if isinstance(block, Paragraph):
if wrongWord in block.text:
block.text = block.text.replace(wrongWord, correctWord)
else:
for row in block.rows:
for cell in row.cells:
if wrongWord in cell.text:
cell.text = cell.text.replace(wrongWord, correctWord)
document.save(r"F:\python\documentSampleAfterChanges.docx")
Could you help me to get the same font size, font name and other associated properties to be copied from the original file after the text replacement.

Search and replace is a hard problem in the general case, which is the main reason that feature hasn't been added yet.
What's happening here is that assigning to the .text attribute on the cell is removing all the existing runs and the font-related attributes are removed with those runs.
Font information (e.g. bold, italic, typeface, size) is stored at the run level (a paragraph is composed of zero or more runs). Assigning to the .text attribute removes all the runs and replaces them with a single new run containing the assigned text.
So the challenge is to find the text within the multiple runs somewhere, and preserve as much of the font formatting settings as possible.
This is a hard problem because Word breaks paragraph text into separate runs for many reasons, and runs tend to proliferate. There's no guarantee at all that your search term will be completely enclosed in a single run or start at a run boundary. So perhaps you start to see the challenge of a general-case solution.
One thing you can do that might work in your case is something like this:
# ---replace text of first run with new cell value---
runs = table_cell.paragraphs[0].runs
runs[0].text = replacement_text
# ---delete all remaining runs---
for run in runs[1:]:
r = run._element
r.getparent().remove(r)
Basically this replaces the text of the first run and deletes any remaining runs. Since the first run often contains the formatting you want, this can often work. If the first word is formatted differently though, say bold, then all the replacement text will be bold too. You'll have to see how this approach works in your specific case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.