How To Extract Data From PDF In Python Using PDFrw

How To Extract Data From PDF In Python Using PDFrw - python

I am trying to use PDFrw to get data from a certain PDF (Let's say the one at the top right of the page HERE). I am using PDFrw to do this. I have looked through the documentation that they provide (I couldn't find much) and looked at the example code that they posted on git, but I can't seem to get enough information together to do what I would like to do. How would I make a simple program to go into the PDF using PDFrw (Or another if there is a better one) and extract a certain piece of text. I was thinking about converting it to html... Would that be easier? Look at the PDF I provided above as an example, I would like to get the (let's say) the voltage, which in the PDF is 600 w... How would I go about doing this in the simplest way? I couldn't find any other stack overflow questions about this, so hopefully someone can help that has used it before!
Thanks!

I am the author of pdfrw, and it's not really designed for this. You should probably look at pdfminer.

Related

How to convert fb2 file in to txt using Python

Hello I am struggling to convert hundreds of fb2 files to txt using Python. I find pyandoc and EbookLib but I didn't find in their functionality this option, or I didn't search carefully.
Can someone suggest me something relevant in my case ? Maybe free API, but I think there could be a library.

something relevant in my case
I did look for fb2 and FictionBook2 at PyPI and found 2 potentially useful to you: catpandoc and FB2. 1st does Cat multiple documents to the terminal. and support numerous file formats. 2nd is Python package for working with FictionBook2. For FB2 example is given how to create FB2 file, but not how to read. I do not know if it means documentation is poor or it does not have read support at all.
EDIT: After some research I found that FictionBook2 files are XML files. Example can be seen here. That being said I encourage you to first try existing FB2 tools and only if they fail to deliver desired result implement extraction by XML parsing.

What's the appropriate template system to export an HTML output as pdf or odt?

I'm a python newbie trying to build my first app with Google App Engine (and Python) that save time when you need to write a form contract (like a house rental or car rental contract (link: http://contractpy.appspot.com ).
I'd like to know what is the best and simplest way to export the final output (at the moment, an HTML page) to a pdf, odt or googledoc file. In other words: instead of copying and pasting, the user gets what he wants: the contract file ready to print.
This is the current state of the output (a sample):
http://contractpy.appspot.com/your_contract?resident=Nicola%20Tesla&nacionality=Serbian-American&SSN=1234567&driverLicense=0000001&email=&witness=Carl%20Sagan&owner=House%20owner&contractType=House%20Rental%20Contract&city=Smiljan
But I'd like to get something like this:
https://docs.google.com/open?id=0B8TXLR_e14aCeVVfZVZGdUVNUEE
How could I get this?
Thanks in advance for any help!!

You need to render an html file with some paged media tags. for example:
#page {
size: 21cm 29.7cm;
margin: 1.5cm;
}
than convert it with Conversion API and serve the output file

Take a look to the Conversion API.

I had the same type of issue a while ago and couldn't find a solution to build the report on my own.
After some research, I went for Docmosis. They offer the report generation as a SAAS solution. It took me about one hour to generate custom reports. Their documentation is very good and their support has been great so far. It can generate .doc and .pdf documents.
The only problem I see is that their standard documentation is for Java. However their SAAS solution talks through JSON or XML. If you are able to generate JSON or XML from Python, I guess it will be easy for you to generate the documents you are looking for.
Cheers,
Hugues

As far as my experience with browser engine based conversion tools goes, the documents are overly complicated and usually way several MBs.
I have never used pandocs, but it might be what you are looking for. My preferred way of conversion to PDFs has been reportlab, and sphinx for documentation. For odt you might wanna look into open office docs.

text extraction project - best tool for extracting only specific rows / items out of a PDF?

I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is

With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.

For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.

You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.

want to add url links to .csv datafeed using python

ive looked through the current related questions but have not managed to find anything similar to my needs.
Im in the process of creating a affiliate store using zencart - now one of the issues is that zencart is not designed for redirects and affiliate stores but it can be done. I will be changing the store so it acts like a showcase store showing prices.
There is a mod called easy populate which allows me to upload datafeeds. This is all well and good however my affiliate link will not be in each product. I can do it manually after uploading the data feed and going to each product and then adding it as an image with a redirect link - However when there are over 500 items its going to be a long repetitive and time consuming job.
I have been told that I can add the links to the data feed before uploading it to zencart and this should be done using python. Ive been reading about python for several days now and feel im looking for the wrong things. I was wondering if someone could please advise the simplest way for me to get this done.
I hope the question makes sense
thanks
abs

You could craft a python script using csv module like this:
>>> import csv
>>> cartWriter = csv.writer(open('yourcart.csv', 'wb'))
>>> cartWriter.writerow(['Product', 'yourinfo', 'yourlink'])
You need to know how link should be formatted hoping that it could be composed using the other parameters present on csv file.

First, use the CSV module as systempuntoout told you, secondly, you will want to change your header to:
mimetype='text/csv'
Content-Disposition = 'attachment; filename=name_of_your_file.csv'
The way to do it depends very much of your website implementation. In pure Python you would probably do that with an HttpResponse object. In django, as well, but there are some shortcuts.
You can find a video demonstrating how to create CSV files with Python on showmedo. It's not free however.
Now, to provide a link to download the CSV, this depends of your Website. What is the technology behinds it : pure Python, Django, Pylons, Tubogear ?
If you can't answer the question, you should ask your boss a training about your infrastructure before trying to make change to it.

What program to write pdf including other pdf on Linux from Python?

On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?

I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.

Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How To Extract Data From PDF In Python Using PDFrw - python

I am the author of pdfrw, and it's not really designed for this. You should probably look at pdfminer.

Related

How to convert fb2 file in to txt using Python

What's the appropriate template system to export an HTML output as pdf or odt?

text extraction project - best tool for extracting only specific rows / items out of a PDF?

want to add url links to .csv datafeed using python

What program to write pdf including other pdf on Linux from Python?

Categories

Resources