I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls
Related
Hello I am struggling to convert hundreds of fb2 files to txt using Python. I find pyandoc and EbookLib but I didn't find in their functionality this option, or I didn't search carefully.
Can someone suggest me something relevant in my case ? Maybe free API, but I think there could be a library.
something relevant in my case
I did look for fb2 and FictionBook2 at PyPI and found 2 potentially useful to you: catpandoc and FB2. 1st does Cat multiple documents to the terminal. and support numerous file formats. 2nd is Python package for working with FictionBook2. For FB2 example is given how to create FB2 file, but not how to read. I do not know if it means documentation is poor or it does not have read support at all.
EDIT: After some research I found that FictionBook2 files are XML files. Example can be seen here. That being said I encourage you to first try existing FB2 tools and only if they fail to deliver desired result implement extraction by XML parsing.
I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.
I am working on bio project.
I have .pdb (protein data bank) file which contains information about the molecule.
I want to find out the following of a molecule in the .pdb file:
Molecular Mass.
H bond donor.
H bond acceptor.
LogP.
Refractivity.
Is there any module in python which can deal with .pdb file in finding this?
If not then can anyone please let me know how can I do the same?
I found some modules like sequtils and protienparam but they don't do such things.
I have researched first and then posted, so, please don't down-vote.
Please comment, if you still down-vote as to why you did so.
Thanks in advance.
I don't know if it fits your needs, but Biopython looks like it might help.
PDB file also outputs an XML file PDBML that can be easily parsed using an xml parsing library
http://pdbml.pdb.org/
A pdb file can contain pretty much anything.
A lot of projects allows you to parse them. Some specific to biology and pdb files, other less specific but that will allow you to do more (setup calculations, measure distances, angles, etc.).
I think you got downvoted because these projects are numerous: you are not the only one wanting to do that so the chances that something perfectly fitting your needs exists are really high.
That said, if you just want to parse pdb files for this specific need, just do it yourself:
Open the files with a text editor.
Identify where the relevant data are (keywords, etc.).
Make a Python function that opens the file and look for the keywords.
Extract the figures from the file.
Done.
This can be done with a short script written in less than 10 minutes (other reason why downvoting).
Looking for easiest way to do the following:
I have created 10,000 unique QR-codes, with unique filenames.
I have one postcard design (.ai, eps, pdf - doesn't matter) with place holder for the qr code and for a the unique filename (sans .png extension).
How would I go about inserting each of the 10.000 png's into 10,000 copies of the pdf files? (and I need to do the same with the unique filename /textstring that represents each QR code).
since I am really no good with programming it' doesn't matter which tools to use. As long as you hold my hand - or there is a link to a beginners documentation.
however:
I am trying to learn python - so that is preferred.
I work a little bit with R - but that will not be the easiest solution.
If this can be done directly from the terminal with a shell script then halliluja :-)
But really - if you know of a solution - then please post it, regardless of the tools.
Thanks in advance.
You can do it in Python using pyPdf to merge documents.
Basically, you create a PDF with your QRCode placed where you want it in the end.
You can use the (c)StringIO module to store the created PDF file in memory.
You can find pyPDF here; there's an example that shows how you would add a watermark to a file, you should be following the same logic.