create pdf from python - python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Related

the difference between .bin file and .mat files

can the tensorflow read a file contain a normal images for example in JPG, .... or the tensorflow just read the .bin file contains images
what is the difference between .mat file and .bin file
Also when I rename the .bin file name to .mat, does the data of the file changed??
sorry maybe my language not clear because I cannot speak English very well
A file-name suffix is just a suffix (which sometimes help to get info about that file; e.g. Windows decides which tool is called when double-clicked). A suffix does not need to be correct. And of course, changing the suffix will not change the content.
Every format will need their own decoder. JPG, PNG, MAT and co.
To some extent, these are automatically used by reading out metadata (giving some assumptions!). Many image-tools have some imread-function which works for jpg and png, even if there is no suffix (because there is checking for common and supported image-formats).
I'm not sure what tensorflow does automatically, but:
jpg, png, bmp should be no problem
worst-case: use scipy to read and convert
mat is usually a matrix (with infinite different encodings) and often matlab-based
scipy can read many matlab-based formats
bin can be anything (usually stands for binary; no clear mapping like the above)
Don't get me wrong, but i expect someone trying to use tensorflow (not a small, not a simple tool) to know that changing a suffix should never magically transform the content to the new format (especially in the lossless/lossy case like png, jpg). I hope you evaluated this decision and you are not running blindly into using a popular tool.
A '.mat' file contains Matlab formatted Data (not matlab code like you would expect from a '.m' file). I'm not sure if you're even using Matlab since you didn't include the the tag in your question. '.mat' files are associated with matlab workspace; if you wanted to save your current workspace in Matlab, you would save it as a '.mat' file.
A '.bin' file is a binary file read by the computer. In general, executable (ready-to-run) programs are often identified as binary files. I think this is what you would want to use. I am unsure what you really want though because the wording of the question is difficult to understand and it seems like you have two questions here.
Changing the suffix of a file just changes what will run the file. For example, if I were to change test.txt to test.py, the data inside the text file remains the same, but the way the file is opened has changed. In this case, the file was a text file usually opened using Notepad (or some variation) then it was opened by python once changed. If you were to change a .jpg file to a txt file, you wouldn't be able to view it as a picture again, but instead, you would open a text file with a bunch of seemingly random characters which describe the picture. The picture data never changed, but the way you see it and are able to use it does.
Take a look at this website which describes the .bin extension pretty well. Also, a quick Google search goes a long way especially with questions like this.

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

Need suggestions for designing a content organizer

I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.

Insert png image programatically in pdf file at specific location

Looking for easiest way to do the following:
I have created 10,000 unique QR-codes, with unique filenames.
I have one postcard design (.ai, eps, pdf - doesn't matter) with place holder for the qr code and for a the unique filename (sans .png extension).
How would I go about inserting each of the 10.000 png's into 10,000 copies of the pdf files? (and I need to do the same with the unique filename /textstring that represents each QR code).
since I am really no good with programming it' doesn't matter which tools to use. As long as you hold my hand - or there is a link to a beginners documentation.
however:
I am trying to learn python - so that is preferred.
I work a little bit with R - but that will not be the easiest solution.
If this can be done directly from the terminal with a shell script then halliluja :-)
But really - if you know of a solution - then please post it, regardless of the tools.
Thanks in advance.
You can do it in Python using pyPdf to merge documents.
Basically, you create a PDF with your QRCode placed where you want it in the end.
You can use the (c)StringIO module to store the created PDF file in memory.
You can find pyPDF here; there's an example that shows how you would add a watermark to a file, you should be following the same logic.

Categories