Export PDF with pre-filled and unfilled editable fields - python

I am trying to print out custom commercial invoices based on a known set of data and allowing for an unknown set of data. My known data includes addresses, general contact information, etc.
What I want is to be able to use a PDF template of a "commercial invoice" and have the known data auto-populated into the form where available. Then, the user can download the (incomplete) PDF and fill in the empty / optional form fields using their own collection of information - stuff like Tax ID, recipient care-of names, internal tracking ids, etc.
How can I use JSON / XML + python + HTML + a PDF template to auto-fill some info and leave some info empty, on an editable PDF form?
Thanks!

You essentially want server-side filling of the form.
There are several possible approaches.
An industry-strength approach would be using a dedicated application which could be called via command line (FDFMerge by Appligent comes to my mind, which is very easy to integrate, as all you'd have to do is to assemble the FDF data, and then the command string).
Another approach is to use one of the PDF creating libraries out there (iText, pdflib or Adobe's PDF Library come to my mind here). In this case, you have considerably more programming effort, but may have somewhat more flexibility.

Related

Best method to extract information from unstructured text

My aim is to extract information from old scanned reports and store in the structured database. I have already extracted text from these reports using Solr.
All of these are scientific reports and have a different structure in terms of the content of the report, but all of these has similar information. I wanted to a create a structured database using these reports such as name of the company involved in the report, name of the software involved in the report, name of the location, date of the experiment etc. For each of these fields, I have some keywords which shall be used for extraction, For example for the Location information: Location, Place of experiment, Place, Facility etc. What will be the best way to proceed in the direction?
Also, in some of these files, there are no sentences to process. Information is given in Form like structure, for example:
Location: Canada
Date of the experiment: 1985-05-01.
Which techniques will be best to extract the information? also which software, libraries should I use?

CKAN - Different datasets

I am starting to get involved with CKAN. Until now I have done some of the tutorials and currently I am installing some of the available extensions.
Does anybody know if there is any other extension for customizing metadata datasets fields according differences between datasources?
In example:
Uploading text files or documents like PDF: only I want 5 concrete
metadata fields to be requested
Uploading CSV files with Coordinates Fields (georeferenced): I want
10 fields requested metadata fields. These fields could be different
fields than PDF's ones.
In fact, I would like to add a new page where the user could specify first the tipology of the datasource and then the application could request for those fields which are necesary to be requested.
I have seen how to customize In the tutorial a schema with some extra metadata fields but I don't know how to work with different metadata schemas. And also this extension could be useful for customizing dataset fields.
But, does someone have any idea about how to have different schemas depending of the type of a dataset?
Thanks for helping me :)
Jordi.
I think with the ckan-scheming extension you get everything you want.
As you can see in their documentation, you can specify different schemas according to your needs:
Camel
Standard dataset
Feel free to create your own, customized schema, with exactly the fields that you need.
Once you have your schema (in fact you want to create two different ones, one for the text files and one for the georeferenced CSVs), you can simply use the generated form to enter those specific types of datasets.
The important bit here is, that you specify a new type of dataset in the schema, e.g. {"dataset_type": "my-custom-text-dataset",}. If everything is configured as it should be, you can find and add your datasets here: http://my-ckan-instance.com/my-custom-text-dataset

How do I access PDF form fields with python

I need to automatically save pdf form fields to a database and write some of them later to new forms I am sending out. I can save the fields no problem but I don't know how to write to a PDF form field .. I am using pdf miner but I can't find anything in it to do this.
Can any one point me in the direction of a solution?
Reports labs has a open source PDF kit that let you write PDFs, including form fields http://www.reportlab.com. They also have commercial product the reads PDFs. But I've only used the open source version.
I've never used it, but people seem to like PyPDF

RSS aggregation packages

We are looking to add a news/articles section to an existing site which will be powered by aggregating content via RSS feeds. The requirements are
Be able to aggregate lots of feeds. Initially we will start with small number of and eventually we may be aggregating few hundreds of them.
We don't want to display the whole post on our site. We will display summary or short description and when user clicks on read more, he will be taken to the original post on external site.
We would like to grab the image/s related to a post and display that as a small thumbnail with a post on our site.
Create an automated tag cloud out of all the aggregated content.
Categorize aggregated content by using category/sub-category structure.
The aggregation piece should perform well.
Our web app is built using Django and so I am looking into selecting one the following packages. Based on our requirements, which package would you recommend?
django-planet
django-news
planetplanet
feedjack
If you have a good idea of what you want, why not just try them all? If you have pretty strict requirements, write it yourself, roll your own aggregator with feedparser.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Categories