PyPDF2 output incorrect: missing words. Coding issue? - python

I try to extract text from some .pdf files using PyPDF:
pdfFileObject = open(filepath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
text = ''
for i in range(count):
page = pdfReader.getPage(i)
text += page.extractText()
Sometimes text returns (nearly) correct string (with some coding issues, like in some German texts everything is ok except of Umlauts (äöü)), but mostly (irrespective of language) it looks just like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nÈ\n\n'
I have not recognize any pattern in the selection of (almost) correctly and completely incorrectly parsed files. Some of them have tables (these are the worst, that's truth), but some of them look plain and simple.
Regrettably, PyPDF2 seems to be the only working module for me... Does this something to do with encoding? I'd like all characters be represented correctly anyway, although it's not really important in this case.
I'll be grateful for any suggestion.

Related

The extractText() fucntion does not return text

pdfFileObject = open('MDD.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText()
Above is my code and when i run the script it just outputs a bunch of numbers and numerical(s) and not the text of the file. Could anyone help me with that?
This function doesn't work for all PDF files. This is explained in documentation:
This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
Try your code on this file. I'm sure it should work, so it seems that the problem is not in your code.
If you really need to parse files that are created the same way as your original MDD.pdf you have to choose another library.

Searching for a string in a file and saving the results

I have a few quite large text files with data on them. I need to find a string that repeats from the data and the string will always have an id number after it. I will need to then save that number.
Ive done some simple scripting with python but I am unsure where to start from with this or if python is even a good idea for this problem. Any help is appreciated.
I will post more information next time (my bad), but I managed to get something to work that should do it for me.
import re
with open("test.txt", "r") as opened:
text = opened.read()
output = re.findall(r"\bdata........", text)
out_str = ",".join(output)
print (out_str)
#with open("output.txt", "w") as outp:
#outp.write(out_str)

How to parse XML data manually (without xml.etree)

I've been playing with XML data in a text file. Just some general stuff.
I played around with xml.etree and its commands, but now I am wondering how to get rid of the tags manually and write all that data into a new file.
I figure it would take a lot of str.splits or a loop to get rid of the tags.
I right now have this to start (not working, just copies the data):
def summarizeData(fileName):
readFile = open(fileName, "r").read()
newFile = input("")
writeFile = open(newFile, "w")
with open(fileName, "r") as file:
for tags in file:
Xtags = tags.split('>')[1].split('<')[0]
writeFile.write(readFile)
writeFile.close
So far it just copies the data, including the tags. I figured splitting the tags would do the trick, but it seems like it doesn't do anything. Would it be possible to do manually, or do I have to use xml.etree?
The reason you don't see any changes is that you're just writing the data you read from fileName into readFile in this line:
readFile = open(fileName, "r").read()
... straight back to writeFile in this line:
writeFile.write(readFile)
Nothing you do inbetween (with Xtags etc.) has any effect on readFile, so it's getting written back as-is.
Apart from that issue, which you could fix with a little work ... parsing XML is nowhere near as straightforward as you think it is. You have to think about tags which span multiple lines, angle brackets that can appear inside attribute values, comments and CDATA sections, and a host of other subtle issues.
In summary: use a real XML parser like xml.etree.

Weird symbols / encoding showing up in output Python txt files

I'm having a frustrating issue outputting to text files from Python. Actually, the files appear perfectly normal when opened up in a text editor, but I am uploading these files into QDA miner, a data analysis suite and once they are uploaded into QDA miner, this is what the text looks like:
. 

"This problem really needs to be focused in a way that is particular to its cultural dynamics and tending in the industry,"
As you can see, many of these weird ( 

) symbols show up throughout the texts. The text that my python script parses initially is a RTF file that I convert to plain text using OSX's built in text editor.
Is there an easy way to remove these symbols? I am parsing over singular 100+mb text files and separating them into thousands of separate articles, I have to have a way to batch convert them otherwise it will be near impossible. I should also mention that the origin of these text files is copied from webpages.
Here is some relevant code from the script I wrote:
test1 = filedialog.askopenfile()
newFolder = ((str(test1)[25:])[:-32])
folderCreate(newFolder)
masterFileName = newFolder+"/"+"MASTER_FILE"
masterOutput = open(masterFileName,"w")
edit = test1.readlines()
for i,line in enumerate(edit):
for j in line.split():
if j in ["Author","Author:"]:
try:
outputFileName = "-".join(edit[i-2].lower().title().split())+".txt"
output = open(newFolder+"/"+outputFileName,"w") # create file with article name # backslashed changed to front slash windows
print("File created - ","-".join(edit[i-2].lower().title().split()))
counter2 = counter2+1
except:
print("Filename error.")
counter = counter+1
pass
#Count number of words in each article
wordCount = 0
for word in edit[i+1].split():
wordCount+=1
fileList.append((outputFileName,str(wordCount)))
#Now write to file
output.write(edit[i-2])
output.write("\n")
author = line
output.write(author) # write article author
output.write("\n")
output.write("\n")
content = edit[i+1]
output.write(content) # write article content
Thanks

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

Categories