How to replace text in a PDF using Python? - python

I have taken the code from another thread here that uses the library PyPDF2 to parse and replace the text of a PDF. The given example PDF in the thread is parsed as a PyPDF2.generic.DecodedStreamObject. I am currently working with a PDF that the company has provided me that was created using Microsoft Word's Export to PDF feature. This generates a PyPDF2.generic.EncodedStreamObject. From exploration, the main difference is that there is what appears to be kerning in some places in the text.
This caused two problems for me with the sample code. Firstly, the line if len(contents) > 0: in main seems to get erroneously triggered and attempts to use the key of the EncodedStreamObject dictionary instead of the EncodedStreamObject itself. To work around this, I commented out the if block and used the code in the else block for both cases.
The second problem was that the (what I assume are) kerning markings broke up the text I was trying to replace. I noticed that kerning was not in every line, so I made the assumption that the kerning markers were not strictly necessary, and tried to see what the output would look like with them removed. The text was structured something like so: [(Thi)4(s)-1(is t)2(ext)]. I replaced the line in the sample code replaced_line = line in replace_text with replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ". This preserved the observed structure while allowing the text to be searched for replacements. I verified this was actually replacing the text of the line.
Neither of those changes prevented the code from executing, however the output PDF seems to be completely unchanged despite the code appearing to work using print statements to check if the replaced line has the new text. I initially assumed this was because of the if block in process_data that determined if it was Encoded or Decoded. However, I dug through the actual source code for this library located here, and it seems that if the object is Encoded, it generates a Decoded version of itself which the if block reflects. My only other idea is that the if block that I commented out in main wasn't erroneously catching my scenario, but was instead handling it incorrectly. I have no idea how I would fix it so that it handles it properly.
I feel like I'm incredibly close to solving this, but I'm at my wits end as to what to do from here. I would ask the poster of the linked solution in a comment, but I do not have enough reputation to comment on SO. Does anyone have any leads on how to solve this problem? I don't particularly care what library or file format is used, but it must retain the formatting of the Word document I have been provided. I have already tried exporting to HTML, but that removes most of the formatting and also the header. I have also tried converting the .docx to PDF in Python, but that requires me to actually have Word installed on the machine, which is not a cross-platform solution. I also explored using RTF, but from what I found the solution for that file type is to convert it to a .docx and then to PDF.
Here is the full code that I have so far:
import PyPDF2
import re
def replace_text(content, replacements=dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(obj, replacements):
data = obj.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if obj.decodedSelf is not None:
obj.decodedSelf.setData(encoded_data)
else:
obj.setData(encoded_data)
pdf = PyPDF2.PdfFileReader("template.pdf")
# pdf = PyPDF2.PdfFileReader("sample.pdf")
writer = PyPDF2.PdfFileWriter()
replacements = {
"some text": "replacement text"
}
for page in pdf.pages:
contents = page.getContents()
# if len(contents) > 0:
# for obj in contents:
# streamObj = obj.getObject()
# process_data(streamObj, replacements)
# else:
process_data(contents, replacements)
writer.addPage(page)
with open("output.pdf", 'wb') as out_file:
writer.write(out_file)
EDIT:
I've somewhat tracked down the source of my problems. The line obj.decodedSelf.setData(encoded_data) seems to not actually set the data properly. After that line, I added
print(encoded_data[:2000])
print("----------------------")
print(obj.getData()[:2000])
The first print statement was different from the second print statement, which definitely should not be the case. To really test see if this was true, I replaced every single line with [()], which I know to be valid as there are many lines that are already that. For the life of me, though, I can't figure out why this function call fails to do any lasting changes.
EDIT 2:
I have further identified the problem. In the source code for an EncodedStreamObject in the getData method, it returnsself.decodedSelf.getData() if self.decodedSelf is True. HOWEVER, after doing obj.decodedSelf.setData(encoded_data), if I do print(bool(obj.decodedSelf)), it prints False. This means that when the EncodedStreamObject is getting accessed to be written out to the PDF, it is re-parsing the old PDF and overriding the self.decodedSelf object! Short of going in and fixing the source code, I'm not sure how I would solve this problem.
EDIT 3:
I have managed to convince the library to use the decoded version that has the replacements! By inserting the line page[PyPDF2.pdf.NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), it forces the page to have the updated contents. Unfortunately, my previous assumption about the text kerning was incorrect. After I replaced things, some of my text mysteriously disappeared from the PDF. I assume this is because the format is incorrect somehow.
FINAL EDIT:
I figure I'd put this in here in case anyone else stumbles across this. I never did manage to get it to finally work as expected. I instead moved to a solution to mimic the PDF with HTML/CSS. If you add the following style tag in your HTML, you can get it to print more like how you'd expect a PDF to print:
<style type="text/css" media="print">
#page {
size: auto;
margin: 0;
}
</style>
I'd recommend this solution for anyone looking to do what I was doing. There are Python libraries to convert HTML to CSS, but they do not support HTML5 and CSS3 (notably they do not support CSS flex or grid). You can just print the HTML page to PDF from any browser to accomplish the same thing. It definitely doesn't answer the question, so I felt it best to leave it as an edit. If anyone manages to complete what I have attempted, please post an answer for me and any others.

Related

Spaces are different in the same sentence/string

I have some stuff in an Excel spreadsheet, which is loaded into a webpage, where the content is displayed. However, what I have noticed is, that some of the content has weird formatting, i.e. a sudden line shift or something.
Then I just tried to copy the text from the spreadsheet, and pasting it into Notepad++, and enabled "Show White Space and Tab", and then the output was this:
The second line is the one directly copied from the spreadsheet, where the first one is just where I copied the string into a variable in Python, printed it, and then copied the output from the output console.
And as you can see the first line has all dots for space, where the other misses some dots. And I have an idea that that is what is doing this trickery, especially because it's at those place the line shift happens.
I have tried to just do something like:
import pandas as pd
data = pd.read_excel("my_spreadsheet.xlsx")
data["Strings"] = [str(x).replace(" ", " ") for x in data["Strings"]]
data.to_excel("my_spreadsheet.xlsx", index=False)
But that didn't change anything, as if I copied it straight from the output console.
So yeah, is there any easy way to make spaces the same type of spaces, or do I have to do something else ?
I think you would need to figure out which exact character is being used there.
You can load the file and print out the characters one by one together with the character code to figure out what's what.
See the code example below. I added some code to skip alphanumeric characters to reduce the actual output somewhat...
with open("filename.txt") as infile:
text = infile.readlines()
def print_ordinal(text: str, skip_alphanum: bool=True):
for line in text:
for character in line:
if not(skip_alphanum and character.isalnum()):
print(f"{character} - {ord(character)}")
print_ordinal(text)

Is it possible to save the results of the "google" package to a txt file in python?

Link to program:
https://replit.com/#MichaelGordon5/DetailedSearch#main.py
I've coded a program that in essence, does an advanced google search, and Im planning on adding more to it later.
It uses the google search package
The issue I'm having with it is that when I print out the search results into the console:
#userquery = search term, amt= amount of results
for j in search(userquery, tld="com", num=amt, stop=amt, pause=0):
print(j)
I am unable to save the result to a text file. I have had some experience using f.write/open/close in the past but nothing I do seems to work. Any advice?
Note that I haven't actually tried your code, but the code below should work. I'm not sure if j needs to be cast as a str but it can't hurt.
sidenote: fix your indentation. It hurts my eyes. 4 spaces is what should always be used...
results = []
for j in search(userquery, tld="com", num=amt, stop=amt, pause=0):
print(j)
results.append(str(j))
with open("filename_goes_here.txt", "w") as outfile:
outfile.writelines(results)
Note that you could also keep the file open and write to it whenever you print out j...

How to access remote and encrypted PDF text without writing to local drive [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am very new to the coding world and have been stuck on this one problem for 3 days now, searching everywhere for an answer, so any help will be greatly appreciated. I am needing to extract a small amount of text from a url-located Pdf file. I'm using sessions.get(chart_PDF) as the driver for locating the URL where chart_PDF is the example url below.
Example url is https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
I know I am able to write it to my local drive but I don't want to do that, I want to be able to do it remotely, since I only need a couple of numbers from it.
I have tried finding the password from the url page for decrypting, couldn't find. I've tried to use PyPDF2, pdfminer and pikepdf (probably not well).
I only need to retrieve two numbers near the bottom of the PDF that can be used for the rest of my code. Please help, even if it is a simple fix, I'm new to all this and need some help. Thanks.
from io import BytesIO
from pikepdf import Pdf as PDF
from pdfminer import high_level
chart_PDF = https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
retrieve = s.get(chart_PDF)
content = retrieve.content
response =urllib.request.urlopen(chart_PDF)
p = BytesIO(content)
p.getbuffer()
check = PDFPage.get_pages(p, check_extractable=False)
extract = high_level.extract_text(p)
I'm getting:
PDFTextExtractionNotAllowedWarning: The PDF <_io.BytesIO object at 0x000001B007ABEC20> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding.warnings.warn(warning_msg, PDFTextExtractionNotAllowedWarning)
Alternately, if I try this:
from pikepdf import Pdf as PDF
from pdfminer.pdfpage import PDFPage
from PyPDF2 import PdfFileReader
new_pdf = PDF.new()
with PDF.open(p) as pdf:
print(len(pdf.pages))
page1 = pdf.pages[0]
if PdfFileReader.getIsEncrypted(pdf):
print(True)
PdfFileReader.decrypt(page1, password='')
pdf.close()
I get:
line 1987, in decrypt
return self._decrypt(password)
AttributeError: _decrypt
UPDATE 3/8/21
Thank you so much K J! You've seriously been a huge help!
from io import BytesIO
from pdfminer.pdfpage import PDFPage
from pdfminer import high_level
retrieve = s.get(chart_PDF)
content = retrieve.content
bytes = BytesIO(content)
bytes.getbuffer()
PDFPage.get_pages(bytes, check_extractable=False)
extract = high_level.extract_text(bytes, password='') #THIS LINE THROWS ERROR
joined = ''.join(extract)
find_txt = re.findall(r'[(]\d*[-]\d[.]\d[)]', joined)
print(find_txt)
bytes.close()
This is now working well and I have been able to pull the numbers that I need (I have basically pulled all numbers from inside brackets off the PDF). I'll sort through that to find which one I need.
Strangely enough, although its giving me what I need, my extract = high_level.extract_text(bytes, password='') line still throws the Warning: (warning_msg, PDFTextExtractionNotAllowedWarning) which is rather annoying. Not sure how this process works but its still letting the info out.
I can't use try except or it skips over it. What is the way around this? how can I stop that error coming up?
FINAL UPDATE
I got around the warning and it works well now.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
extract = high_level.extract_text(bytes)
Cheers fellas for putting up with my ignorance, you've helped so much.
The whole file has to be downloaded to a device via RAM so the blob as a FILE can be parsed at the very END for one OR more %%EOF and the location of page 0 (it gets converted to 1 or i) it could be ANYWHERE IN THE STREAM,.
THEN you can navigate to other sequential numbered pages in the RANDOM order they are built. Any complaints please contact Adobe.
However it is easiest if it is cached as a physical FILE object. If you dont want that on disk use a ram drive for your browser.
Again those two objects at bottom of page one could be anywhere mixed into the content of "page" 99's objects, or otherwise. each letter in a PDF can in its extreme be more than one object anywhere in the file. but a good authoring editor would try to keep them as lines by lines. (there is no such PDF thing as a word or paragraph.)
We can Print the file as Plain Text to see how it is composited and although (secured) that is allowed.
I tried printing from browser with little success but know that can depend on browser system and OS print drivers. Here I have printed the page as text using Acrobat portable, so we can see the sequential offsets of each text block from Left Hand margin JUST LIKE a PDF VIEWER would need to rebuild them.
UPDATE
You said your target is (1380-4.4) to the RIGHT of ALTERNATE but again A PDF has no concept of Left and Right or BEFORE or AFTER so we find IN THIS FILE the variable target is in 2 separate pieces PRIOR to the KNOWN characters which luckily is a complete single block (alternate). Thus here proximity of plain text could well work if the capture is confined to that nearby locality. However there is no guarantee that ALTERNATE would always be a single block.
It was perhaps not a good Idea To show the way a Printer would be given a stream of sequential data
Here is the way one PDF viewer goes about decrypting the file
As stated on this occasion the word ALTERNATE is defined as text however the next item is the "3" under "B" which is text as a vector path it is not called a "character" although it looks like one but a numbered glyph from a font table. We do see later that some of those numbers are stored as "text" and for your target it is mixed in with similar text in the same object.
Thus you need to call a PDF interpreter to give you a meaningful translation of all bits and pieces of objects so that you can extract the "right" text.
The easiest way for a "simple" one line target in a complex file is to use MuPDF to first tidy up the file
mutool clean -gggg -D infile.pdf outfile.pdf
combined with
PDFTOTXT -layout outfile.pdf outfile.txt
or similar to hopefully export that text on a line by line basis, such that you can consistently find your target instantly before ! or after ALTERNATE.
N.B Mutool convert to HTML would place the target value in a table entry AFTER the key word, and if the lines are consistent in number that would be a simpler way to find or grep.

Adding emoji's to a PDF using PyFPDF

I am writing a program that is supposed to turn a paragraph into a PDF. A lot of said paragraphs have emoji's in them and I cannot figure out how I am supposed to make them show up on the PDF.
Whenever there is an emoji in a paragraph I get the following error
File "C:\Python38\lib\site-packages\fpdf\fpdf.py", line 1449, in _putTTfontwidths
if (font['cw'][cid] == 0):
IndexError: list index out of range
Now from my understanding this basically says that an emoji was not found at this uncode location in the font. But I have looked at the font in detail and it does indeed contain emojis.
getting the length of font['cw'] reveals that it goes up to 65536 when the emoji in question is located at position 128522 which is almost twice as far.
Now if I edit the fpdf code from this
if (font['cw'][cid] == 0):
continue
to this
try:
if (font['cw'][cid] == 0):
continue
except:
continue
It prints 2 boxes instead of emojis but if I copy paste the boxes into a web browser they are displayed correctly.
I am assuming this is an encoding problem. But I haven't really meddled with encoding so i am unsure how to proceed.
Seems to be a known bug: https://github.com/reingart/pyfpdf/issues/131
It looks like Fpdf hasn't been updated in a while. There's apparently a fork called fpdf2: https://pypi.org/project/fpdf2/
If that fails too, you could see if the ReportLab or WeasyPrint libraries work better for you.

Python not splitting CRLF correctly

I'm writing a script to convert very simple function documentation to XML in python. The format I'm using would convert:
date_time_of(date) Returns the time part of the indicated date-time value, setting the date part to 0.
to:
<item name="date_time_of">
<arg>(date)</arg>
<help> Returns the time part of the indicated date-time value, setting the date part to 0.</help>
</item>
So far it works great (the XML I posted above was generated from the program) but the problem is that it should be working with several lines of documentation pasted, but it only works for the first line pasted into the application. I checked the pasted documentation in Notepad++ and the lines did indeed have CRLF at the end, so what is my problem?
Here is my code:
mainText = input("Enter your text to convert:\r\n")
try:
for line in mainText.split('\r\n'):
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
Any idea of what the issue is here?
Thanks.
input() only reads one line.
Try this. Enter a blank line to stop collecting lines.
lines = []
while True:
line = input('line: ')
if line:
lines.append(line)
else:
break
print(lines)
The best way to handle reading lines from standard input (the console) is to iterate over the sys.stdin object. Rewritten to do this, your code would look something like this:
from sys import stdin
try:
for line in stdin:
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
That said, It's worth noting that your parsing code could be significantly simplified with a little help from regular expressions. Here's an example:
import re, sys
for line in sys.stdin:
result = re.match(r"(.*?)\((.*?)\)(.*)", line)
if result:
name = result.group(1)
arg = result.group(2).split(",")
hlp = result.group(3)
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
else:
print "There was an error parsing this line: '%s'" % line
I hope this helps you simplify your code.
Patrick Moriarty,
It seems to me that you didn't particularly mention the console and that your main concern is to pass several lines together at one time to be treated. There's only one manner in which I could reproduce your problem: it is, executing the program in IDLE, to copy manually several lines from a file and pasting them to raw_input()
Trying to understand your problem led me to the following facts:
when data is copied from a file and pasted to raw_input() , the newlines \r\n are transformed into \n , so the string returned by raw_input() has no more \r\n . Hence no split('\r\n') is possible on this string
pasting in a Notepad++ window a data containing isolated \r and \n characters, and activating display of the special characters, it appears CR LF symbols at all the extremities of the lines, even at the places where there are \r and \n alone. Hence, using Notepad++ to verify the nature of the newlines leads to erroneous conclusion
.
The first fact is the cause of your problem. I ignore the prior reason of this transformation affecting data copied from a file and passed to raw_input() , that's why I posted a question on stackoverflow:
Strange vanishing of CR in strings coming from a copy of a file's content passed to raw_input()
The second fact is responsible of your confusion and despair. Not a chance....
.
So, what to do to solve your problem ?
Here's a code that reproduce this problem. Note the modified algorithm in it, replacing your repeated splits applied to each line.
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
print "open 'funcdoc.txt' to manually copy its content, and paste it on the following line"
mainText = raw_input("Enter your text to convert:\n")
print "OK, copy-paste of file 'funcdoc.txt' ' s content has been performed"
print "\nrepr(mainText)==",repr(mainText)
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
Here's the solution mentioned by delnan : « read from the source instead of having a human copy and paste it. »
It works with your split('\r\n') :
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
#####################################
with open('funcdoc.txt','rb') as f:
mainText = f.read()
print "\nfile 'funcdoc.txt' has just been opened and its content copied and put to mainText"
print "\nrepr(mainText)==",repr(mainText)
print
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
And finally, here's the solution of Python to process the altered human copy: providing the splitlines() function that treat all kind of newlines (\r or \n or \r\n) as splitters. So replace
for line in mainText.split('\r\n'):
by
for line in mainText.splitlines():

Categories