The extractText() fucntion does not return text - python

pdfFileObject = open('MDD.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText()
Above is my code and when i run the script it just outputs a bunch of numbers and numerical(s) and not the text of the file. Could anyone help me with that?

This function doesn't work for all PDF files. This is explained in documentation:
This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
Try your code on this file. I'm sure it should work, so it seems that the problem is not in your code.
If you really need to parse files that are created the same way as your original MDD.pdf you have to choose another library.

Related

PyPDF2 output incorrect: missing words. Coding issue?

I try to extract text from some .pdf files using PyPDF:
pdfFileObject = open(filepath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
text = ''
for i in range(count):
page = pdfReader.getPage(i)
text += page.extractText()
Sometimes text returns (nearly) correct string (with some coding issues, like in some German texts everything is ok except of Umlauts (äöü)), but mostly (irrespective of language) it looks just like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nÈ\n\n'
I have not recognize any pattern in the selection of (almost) correctly and completely incorrectly parsed files. Some of them have tables (these are the worst, that's truth), but some of them look plain and simple.
Regrettably, PyPDF2 seems to be the only working module for me... Does this something to do with encoding? I'd like all characters be represented correctly anyway, although it's not really important in this case.
I'll be grateful for any suggestion.

How to store a txt file in your program and reference it

Let me preface by saying I am very new to programming. I'm creating a fun program that I can use to start my day at work. One of the things I want it to do is display a random compliment. I made a text file that has multiple lines in it. How do I store that text file then open it?
I've opened text files before that were on my desktop but I want this one to be embedded in the code so when I compile the program I can take it to any computer.
I've googled a ton of different key words and keep finding the basics of opening and reading txt files but that's not exactly what I need.
Perhaps start with defining a default path to your file; this makes it easier to change the path when moving to another computer. Next, define a function in your program to read and return the contents of the file:
FILE_PATH = "my/path/to/file/"
def read_file(file_name):
with open(FILE_PATH + file_name) as f:
return f.read()
With that in place, you can use this function to read, modify, or display the file contents, for example to edit something from your file:
def edit_comments():
text = read_file("daily_comments.txt")
text = text.replace("foo", "foo2")
return text
There are obviously many ways to approach this task, this is just a simple example to get you started.

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

How to split large wikipedia dump .xml.bz2 files in Python?

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.
I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.
One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.
Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).
Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.
At last I have written a Python Script myself:
import os
import bz2
def split_xml(filename):
''' The function gets the filename of wiktionary.xml.bz2 file as input and creates
smallers chunks of it in a the diretory chunks
'''
# Check and create chunk diretory
if not os.path.exists("chunks"):
os.mkdir("chunks")
# Counters
pagecount = 0
filecount = 1
#open chunkfile in write mode
chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
# Read line by line
bzfile = bz2.BZ2File(filename)
for line in bzfile:
chunkfile.write(line)
# the </page> determines new wiki page
if '</page>' in line:
pagecount += 1
if pagecount > 1999:
#print chunkname() # For Debugging
chunkfile.close()
pagecount = 0 # RESET pagecount
filecount += 1 # increment filename
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
try:
chunkfile.close()
except:
print 'Files already close'
if __name__ == '__main__':
# When the script is self run
split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')
well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module
The method you are referencing is quite a dirty hack :)
I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.
As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.
Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.

instantiate python class from class available as string , only in memory!

I'm using Reportlab to create PDFs. I'm creating two PDFs which I want to merge after I created them. Reportlab provides a way to save a pycanvas (source) (which is basically my pdf file in memory) as a python file, and calling the method doIt(filename) on that python file, will recreate the pdf file. This is great, since you can combine two PDFs on source code basis and create one merge pdf.
This is done like this:
from reportlab.pdfgen import canvas, pycanvas
#create your canvas
p = pycanvas.Canvas(buffer,pagesize=PAGESIZE)
#...instantiate your pdf...
# after that, close the PDF object cleanly.
p.showPage()
p.save()
#now create the string equivalent of your canvas
source_code_equiv = str(p)
source_code_equiv2 = str(p)
#merge the two files on str. basis
#not shown how it is exactly done, to make it more easy to read the source
#actually one just have to take the middle part of source_code_equiv2 and add it into source_code_equiv
final_pdf = source_code_equiv_part1 + source_code_equiv2_center_part + source_code_equiv_part2
#write the source-code equivalent of the pdf
open("n2.py","w").write(final_pdf)
from myproject import n2
p = n2.doIt(buffer)
# Get the value of the StringIO buffer and write it to the response.
pdf = buffer.getvalue()
buffer.close()
response.write(pdf)
return response
This works fine, but I want to skip the step that I save the n2.py to the disk. Thus I'm looking for a way to instantiate from the final_pdf string the corresponding python class and use it directly in the source. Is this possible?
It should work somehow like this..
n2 = instantiate_python_class_from_source(final_pdf)
p = n2.doIt(buffer)
The reason for this is mainly that there is not really a need to save the source to the disk, and secondly that it is absolutely not thread save. I could name the created file at run time, but then I do not know what to import!? If there is no way to prevent the file saving, is there a way to define the import based on the name of the file, which is defined at runtime!?
One might ask why I do not create one pdf in advance, but this is not possible, since they are coming from different applications.
This seems like a really long way around to what you want. Doesn't Reportlab have a Canvas class from which you can pull the PDF document? I don't see why generated Python source code should be involved here.
But if for some reason it is necessary, then you can use StringIO to "write" the source to a string, then exec to execute it:
from cStringIO import StringIO
source_code = StringIO()
source_code.write(final_pdf)
exec(source_code)
p = doIt(buffer)
Ok, I guess you could use code module which provides standard interpreter’s interactive mode. The following would execute function doIt.
import code
import string
coded_data = """
def doIt():
print "XXXXX"
"""
script = coded_data + "\ndoIt()\n"
co = code.compile_command(script, "<stdin>", "exec")
if co:
exec co
Let me know, if this helped.

Categories