Python3 pdf parsing

Python3 pdf parsing - python

I have python3.2 installed (I'm not sure if it matters, but it's a 64-bit version) on a Windows machine. I need to open a bunch of pdf files, find certain numbers from the text and store them. Work I should have to ( and the maximum I'd like to) put in is one day.
Can I parse the pdf with plain python without too much of a hassle?
Is there a library that would achieve this easily?
If it's too complicated to do it with this python installation I can install a different set, but that requires alot of extra work, so other solutions appreciated.

Related

looking for pure python package to create images of websites

I've previously used http://code.google.com/p/wkhtmltopdf/ with http://pypi.python.org/pypi/wkhtmltopdf/0.2 to create screenshots of websites from the command line. However, I was wondering whether a pure python package exists, that can do the same. Currently I always need to download the correct binary of http://code.google.com/p/wkhtmltopdf/ if I switch computers. A pure python package would relieve me from this. Any ideas?

That would require a browser engine written in pure python. And this means you need a CSS processor und, more important, a complete Javascript engine written in Python. While this is undoubtedly possible, I'm pretty sure nobody has done it.

Easier way to access the OSX defaults system through Python and Ruby?

Recently I have become a fan of storing various settings used for my testing scripts in the OSX defaults system as it allows me to keep various scripts in git and push them to github without worrying about leaving passwords/settings/etc hardcoded into the script.
When writing a shell script using simple bash commands, it is easy enough to use backticks to call the defaults binary to read the preferences and if there is an error reading the preference, the script stops execution and you can see the error and fix it. When I try to do a similar thing in Python or Ruby it tends to be a little more annoying since you have to do additional work to check the return code of defaults to see if there is an error.
I have been attempting to search via google off and on for a library to use the OSX defaults system which ends up being somewhat difficult when "defaults" is part of your query string.
I thought of trying to read the plist files directly but it seems like the plist libraries I have found (such as the built in python one) are only able to read the XML ones (not the binary ones) which is a problem if I ever set anything with the defaults program since it will convert it back to a binary plist.
Recently while trying another search for a Python library I changed the search terms to something using something like NSUserDefaults (I have now forgotten the exact term) I found a Python library called userdefaults but it was developed for an older version of OSX (10.2) with an older version of Python (2.3) and I have not had much luck in getting it to compile on OSX 10.6 and Python 2.6
Ideally I would like to find a library that would make it easy to read from (and as a bonus write to) the OSX defaults system in a way similar to the following python psudo code.
from some.library.defaults import defaults
settings = defaults('com.example.app')
print settings['setting_key']
Since I am also starting to use Ruby more, I would also like to find a Ruby library with similar functionality.
It may be that I have to eventually just 'give up' and write my own simple library around the defaults binary but I thought it wouldn't hurt to try to query others to see if there was an existing solution.

You´ll want to use PyObjC: have a look at this article at mactech.com (in specific: scroll down to "Accessing plists Via Python"). And this article from oreilly on PyObjC.
Run this, for example:
from Foundation import *
standardUserDefaults = NSUserDefaults.standardUserDefaults()
persistentDomains = standardUserDefaults.persistentDomainNames()
persistentDomains.objectAtIndex_(14)
aDomain = standardUserDefaults.persistentDomainForName_(persistentDomains[14])
aDomain.keys()

How to extract a windows cabinet file in python

Is it somehow possible to extract .cab files in python?

Not strictly answering what you asked, but if you are running on a windows platform you could spawn a process to do it for you.
Taken from Wikipedia:
Microsoft Windows provides two
command-line tools for creation and
extraction of CAB files. They are
MAKECAB.EXE (included within Windows
packages such as 'ie501sp2.exe' and
'orktools.msi'; also available from
the SDK, see below) and EXTRACT.EXE
(included on the installation CD),
respectively. Windows XP also provides
the EXPAND.EXE command.

I had the same problem last week so I implemented this in python. Comments, additions and especially pull requests welcome: https://github.com/hughsie/python-cabarchive

Oddly, the msilib can only create or append to .CAB files, but not extract them. :(
However, the hachoir parser module can apparently read & edit Cabinets. (I have not used it, though, so I couldn't tell you how fitting it is or not!)

Converting a PDF to a series of images with Python

I'm attempting to use Python to convert a multi-page PDF into a series of JPEGs. I can split the PDF up into individual pages easily enough with available tools, but I haven't been able to find anything that can covert PDFs to images.
PIL does not work, as it can't read PDFs. The two options I've found are using either GhostScript or ImageMagick through the shell. This is not a viable option for me, since this program needs to be cross-platform, and I can't be sure either of those programs will be available on the machines it will be installed and used on.
Are there any Python libraries out there that can do this?

ImageMagick has Python bindings.

Here's whats worked for me using the python ghostscript module (installed by '$ pip install ghostscript'):
import ghostscript
def pdf2jpeg(pdf_input_path, jpeg_output_path):
args = ["pdf2jpeg", # actual value doesn't matter
"-dNOPAUSE",
"-sDEVICE=jpeg",
"-r144",
"-sOutputFile=" + jpeg_output_path,
pdf_input_path]
ghostscript.Ghostscript(*args)
I also installed Ghostscript 9.18 on my computer and it probably wouldn't have worked otherwise.

You can't avoid the Ghostscript dependency. Even Imagemagick relies on Ghostscript for its PDF reading functions. The reason for this is the complexity of the PDF format: a PDF doesn't just contain bitmap information, but mostly vector shapes, transparencies etc.
Furthermore it is quite complex to figure out which of these objects appear on which page.
So the correct rendering of a PDF Page is clearly out of scope for a pure Python library.
The good news is that Ghostscript is pre-installed on many windows and Linux systems, because it is also needed by all those PDF Printers (except Adobe Acrobat).

If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm

Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

Tiny python executable?

I plan to use PyInstaller to create a stand-alone python executable. PythonInstaller comes with built-in support for UPX and uses it to compress the executable but they are still really huge (about 2,7 mb).
Is there any way to create even smaller Python executables? For example using a shrinked python.dll or something similiar?

If you recompile pythonxy.dll, you can omit modules that you don't need. Going by size, stripping off the unicode database and the CJK codes creates the largest code reduction. This, of course, assumes that you don't need these. Remove the modules from the pythoncore project, and also remove them from PC/config.c

Using a earlier Python version will also decrease the size considerably if your really needing a small file size. I don't recommend using a very old version, Python2.3 would be the best option. I got my Python executable size to 700KB's! Also I prefer Py2Exe over Pyinstaller.

You can't go too low in size, because you obviously need to bundle the Python interpreter in, and only that takes a considerable amount of space.
I had the same concerns once, and there are two approaches:
Install Python on the computers you want to run on and only distribute the scripts
Install Python in the internal network on some shared drive, and rig the users' PATH to recognize where Python is located. With some installation script / program trickery, users can be completely oblivious to this, and you'll get to distribute minimal applications.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 pdf parsing - python

Related

looking for pure python package to create images of websites

Easier way to access the OSX defaults system through Python and Ruby?

How to extract a windows cabinet file in python

Converting a PDF to a series of images with Python

Tiny python executable?

Categories

Resources