Writing data to a zip archive in Python - python

I've been told in the past that there is simply no easy way to write a string a zip file. It's okay to READ from a zip archive, but if you want to write to a zip file, the best option is to extract it, make the changes, and then zip it back up again. However, the library I am using (openpyxl) accomplishes the feat of writing to a zip file without any extraction. This package uses the writestr() function in the python ZipFile library to make changes. Can someone explain to me how exactly this is possible? I know it has something to do with writing bytes but I can't fine a good explanation.
I'm aware of the vagueness of this question, but that's a circumstance of my lack of knowledge on the topic.

openpyxl does not modify the files in place because you can't do this with zipfiles. You must extract, modify and archive. We just hide this process in the library.

Related

Python ZipFile extractall fails when external_attr is 0x10

I have a peculiar case, where Python's ZipFile.extractall fails. I have a few zip files that seem to have been created in an uncommon way (I did not create them, I need to open them). For example, let's discuss a zip files that contains the following files:
a/f1.txt
a/f2.txt
b/f3.txt
However, the zip file contain the following file entries (as seen by using ZipFile.filelist)
a/f1.txt, a/f2.txt, a, b/f3.txt, b
When trying to use extractall, I get an error, saying that the file a cannot be created, because it is already a directory (makes sense, as a/f1.txt was handled before).
Looking further, the external_attr of the proper files is 0x20. The external_attr of the extra files is 0x10. They are also zero-length.
Window's internal unzip works properly, as does 7zip, but Python's extractall fails.
Is this a bug in Python's ZipFile implementation? Or are these badly encoded zip files that Windows and 7zip just happen to understand? What is going on here?
Note: making the Python code work is simple, just pass all the files without external_attr==0x10 to extractall, but I still want to know what's going on.

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

Need suggestions for designing a content organizer

I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.

Insert png image programatically in pdf file at specific location

Looking for easiest way to do the following:
I have created 10,000 unique QR-codes, with unique filenames.
I have one postcard design (.ai, eps, pdf - doesn't matter) with place holder for the qr code and for a the unique filename (sans .png extension).
How would I go about inserting each of the 10.000 png's into 10,000 copies of the pdf files? (and I need to do the same with the unique filename /textstring that represents each QR code).
since I am really no good with programming it' doesn't matter which tools to use. As long as you hold my hand - or there is a link to a beginners documentation.
however:
I am trying to learn python - so that is preferred.
I work a little bit with R - but that will not be the easiest solution.
If this can be done directly from the terminal with a shell script then halliluja :-)
But really - if you know of a solution - then please post it, regardless of the tools.
Thanks in advance.
You can do it in Python using pyPdf to merge documents.
Basically, you create a PDF with your QRCode placed where you want it in the end.
You can use the (c)StringIO module to store the created PDF file in memory.
You can find pyPDF here; there's an example that shows how you would add a watermark to a file, you should be following the same logic.

Categories