Convert pdf files to raw text in new directory

Convert pdf files to raw text in new directory - python

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?

You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

Related

pdftotext return blank but pdf has multiple lines and multiple pages why?

import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

I have a docx file in which I need to edit its paragraphs (the paragraphs might contain equations). I tried to do these jobs using python-docx but it was not successful since editing the text of each paragraph and replacing it with the edited new paragraph needs to call p.add_paragraphs(editText(paragraph.text)) which ignores and omits any mathematical equation.
By searching for a method to gain this goal I found that this job is possible through XML codes by finding <w:t> tags and editing their content like this:
tree= ET.parse(filename)
root=tree.getroot()
for par in root.findall('w:p'):
if par.find('w:r'):
myText= par.find('w:r').find('w:t')
myText.text= editText(myText.text)
Then I must save the result as docx.
My quation is: what the format of filename is? should it be a document.xml file? If so, how can I reach that from my original document.docx file? and one more question is that how can I save the result as a .docx file again?
For saving docx as xml, I have given a try to save it by document.save('Document2.xml'). But the content of the result was not correct.
Would you give me some advice how to do them?

Not experienced with this at all, but perhaps this is what you were looking for?
https://virantha.com/2013/08/16/reading-and-writing-microsoft-word-docx-files-with-python/
From the article:
import zipfile
from lxml import etree
class DocsWriter:
def __init__(self, docx_file):
self.zipfile = zipfile.ZipFile(docx_file)
def _write_and_close_docx (self, xml_content, output_filename):
""" Create a temp directory, expand the original docx zip.
Write the modified xml to word/document.xml
Zip it up as the new docx
"""
tmp_dir = tempfile.mkdtemp()
self.zipfile.extractall(tmp_dir)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
xmlstr = etree.tostring (xml_content, pretty_print=True)
f.write(xmlstr)
# Get a list of all the files in the original docx zipfile
filenames = self.zipfile.namelist()
# Now, create the new zip file and add all the filex into the archive
zip_copy_filename = output_filename
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
for filename in filenames:
docx.write(os.path.join(tmp_dir,filename), filename)
# Clean up the temp dir
shutil.rmtree(tmp_dir)
From what I can tell, this code block writes an xml document as .docx. Refer to the article for more context.

Python is not the best tool for this. Use VBA if you need to automate something in a Word document, or multiple Word documents. I can't tell what you are even trying to do here, but let's start at the beginning, with something simple. If, for instance, you want to loop through all paragraphs in your Word document, and select only the equations, you can run the code below to do just that.
Sub SelectAllEquations()
Dim xMath As OMath
Dim I As Integer
With ActiveDocument
.DeleteAllEditableRanges wdEditorEveryone
For I = 1 To .OMaths.Count
Set xMath = .OMaths.Item(I)
xMath.Range.Paragraphs(1).Range.Editors.Add wdEditorEveryone
Next
.SelectAllEditableRanges wdEditorEveryone
.DeleteAllEditableRanges wdEditorEveryone
End With
End Sub
Again, I don't know what your end game is, but I think it's worthwhile to start with something like that, and build on your foundation.

How to encapsulate method to print console out to Word

I want to print my summary stats sometimes to console, and also other times to Word.
I don't want my code to be littered with lines calling to Word, because then I'd need to find and comment out like 100 lines each time I just wanted the console output.
I've thought about using a flag variable up at the front and changing it to false when I wanted to print versus not, but that's also a hassle.
The best solution I came up with was to write a separate script that opens a document, writes by calling my first summary stats script, and then closes the document:
import sys
import RunSummaryStats
from docx import Document
filename = "demo.docx"
document = Document()
document.save(filename)
f = open(filename, 'w')
sys.stdout = f
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
f.close()
However, when I tried doing the above with python docx, upon opening my docs file I received the error We're sorry, we can't open this document because some parts are missing or invalid. As you can see the code above just printed out one number so it can't be a problem with the data I'm trying to write.
Finally, it needs to go to Word and not other file formats, to format some data tables.
By the way, this is an excerpt of RunSummaryStats. You can see how it's already filled with print lines which are helpful when I'm still exploring the data, and which I don't want to get rid of/replace with adding into a list:

The easy thing is to let cStringIO do the work, and separate collecting all your data from writing it into a file. That is:
import RunSummaryStats
import sys
# first, collect all your data into a StringIO object
orig_stdout = sys.stdout
stat_buffer = cStringIO.StringIO()
sys.stdout = stat_buffer
try:
# actually call my summary stats script here: Call RunSummaryStats etc.
print("5")
finally:
sys.stdout = orig_stdout
# then, write the StringIO object's contents to a Document.
from docx import Document
filename = "demo.docx"
document = Document()
document.write(add_paragraph(stat_buffer.getvalue()))
document.save(filename)

The Document() constructor essentially creates the .docx file package (this is actually a .zip archive of lots of XML and other stuff, which later the Word Application parses and renders etc.).
This statement f = open(filename, 'w') opens that file object (NB: this does not open Word Application, nor does it open a Word Document instance) and then you dump your stdout into that object. That is 100% of the time going to result in a corrupted Word Document; because you simply cannot write to a word document that way. You're basically creating a plain text file with a docx extension, but none of the underlying "guts" that make a docx a docx. As a result, Word Application doesn't know what to do with it.
Modify your code so that this "summary" procedure returns an iterable (the items in this iterable will be whatever you want to put in the Word Document). Then you can use something like the add_paragraph method to add each item to the Word Document.
def get_summary_stats(console=False):
"""
if console==True, results will be printed to console
returns a list of string for use elsewhere
"""
# hardcoded, but presume you will actually *get* these information somehow, modify as needed:
stats = ["some statistics about something", "more stuff about things"]
if console:
for s in stats:
print(s)
return stats
Then:
filename = "demo.docx"
document = Document()
# True will cause this function to print to console
for stat in get_summary_stats(True):
document.add_paragraph(stat)
document.save(filename)

So maybe there was a better way to do it, but in the end I
created a single function out of my summary stats script def run_summary
created a function based on #Charles Duffy's answer def print_word where StringIO reads from RunSummaryStats.run_summary(filepath, filename)
called def_print_word in my final module. There I set the variables for path, filename, and raw data source like so:
PrintScriptToWord.print_word(ATSpath, RSBSfilename, curr_file + ".docx")
I welcome any suggestions to improve this or other approaches.

Edit text in PDF with python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing document is also alright.
I have previously done this so far:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("abc.pdf")
page = reader.pages[0]
writer = PdfWriter()
writer.add_page(reader.pages[0])
pdf_doc = writer.update_page_form_field_values(
reader.pages[0], {"BIRTHDAY DD/MM/YYYY": "123"}
)
with open("new_abc1.pdf", "wb") as fh:
writer.write(fh)
But this update_page_form_field_values() doesn't change the desired value, maybe because this is not a form field?
Screenshot of pdf showing the value to be changed:
Any clues?

I'm the current maintainer of pypdf and PyPDF2 (Please use pypdf; PyPDF2 is deprecated)
It is not possible to change a text with pypdf at the moment.
Changing form contents is a different story. However, we have several issues with form fields: https://github.com/py-pdf/pypdf/labels/workflow-forms
The update_page_form_field_values is the correct function to use.

How to parse text extracted from PDF file with delimiter using Python?

I have tried PyPDF2 to extract and parse text from PDF using following code segment;
import PyPDF2
import re
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")
Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,
In this case, line break or new line can't be found in both rawText or extractedText and results like below-
input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.
Case 2: And for following case,
It gives result as-
2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11
which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert pdf files to raw text in new directory - python

Related

pdftotext return blank but pdf has multiple lines and multiple pages why?

How to open/save docx file to be aditted as xml and save the result as docx after editting using python

How to encapsulate method to print console out to Word

Edit text in PDF with python

How to parse text extracted from PDF file with delimiter using Python?

Categories

Resources