I'm trying to read a field from an Active Directory entry which contains raw jpeg binary data. I'd like to read that data and convert it to an image file for use in my django-based application. I cannot for the life of me figure out how to handle this data in a nice way. Any ideas?
Edit:
To anyone who might come across this in the future: there's a method in python's OS library:
os.tmpfile()
it creates a file and destroys it once the file descriptor is closed. Very useful for this situation.
Here is somebody who was having the same problem -- check out the latest post at the bottom.
http://groups.google.com/group/django-users/browse_thread/thread/4214db6699863ded/5d816b02daca3186
Looks like passing raw data to SimpleUploadedFile is what you are looking for.
request._raw_post_data
The raw HTTP POST data as a byte
string. This is useful for processing
data in different formats than of
conventional HTML forms: binary
images, XML payload etc.
http://docs.djangoproject.com/en/dev/ref/request-response/#httprequest-objects
I know this isn't part of the question, but this looks pretty awesome! "HttpRequest.read() file-like interface"
http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.read
Related
I have a data set that looks like this.
b'\xa3\x95\x80\x80YFMT\x00BBnNZ\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Type,Length,Name,Format,Columns\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa3\x95\x80\x81\x17PARMNf\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Name,Value\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa3\x95\x80\x82-GPS\x00BIHBcLLeeEefI\x00\x00\x00Status,TimeMS,Week,NSats,HDop,Lat,Lng,RelAlt,Alt,Spd,GCrs,VZ,T\x00\x00\xa3\x95\x80\x83\x1fIMU\x00Iffffff\x00\x00\x00\x00\x00\x00\x00\x00\x00TimeMS,GyrX,GyrY,G
I have been reading around to try and find how do I implement a code into python that will allow me to parse this data so that I can plot some of the column against each other (Mostly time).
Some things I found that may help in doing this:
There is a code that will allow me to convert this data into a CSV file. I know how to use the code and convert it to a CSV file and plot from there, but for a learning experience I want to be able to do this without converting it to a CSV file. Now I tried reading that code but I am clueless since I am very new to python. Here is the link to the code:
https://github.com/PX4/Firmware/blob/master/Tools/sdlog2/sdlog2_dump.py
Also, Someone posted this saying this might be the log format, but again I couldn't understand or run any code on that page.
http://dev.px4.io/advanced-ulog-file-format.html
A good starting point for parsing binary data is the struct module https://docs.python.org/3/library/struct.html and it's unpack function. That's what the CSV dump routine you linked to is doing as well. If you walk through the process method, it's doing the following:
Read a chunk of binary data
Figure out if it has a valid header
Check the message type - if it's a FORMAT message parse that. If it's
a description message, parse that.
Dump out a CSV row
You could modify this code to essentially replace the __printCSVRow method with something that captures the data into a pandas dataframe (or other handy data structure) so that when the main routine is all done you can grab all the data from the dataframe and plot it.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.
I am converting a Python 2 program to Python 3 and I'm not sure about the approach to take.
The program reads in either a single email from STDIN, or file(s) are specified containing emails. The program then parses the emails and does some processing on them.
SO we need to work with the raw data of the email input, to store it on disk and do an MD5 hash on it. We also need to work with the text of the email input in order to run it through the Python email parser and extract fields etc.
With Python 3 it is unclear to me how we should be reading in the data. I believe we need the raw binary data in order to do an md5 on it, and also to be able to write it to disk. I understand we also need it in text form to be able to parse it with the email library. Python 3 has made significant changes to the IO handling and text handling and I can't see the "correct" approach to read the email raw data and also use the same data in text form.
Can anyone offer general guidance on this?
The general guidance is convert everything to unicode ASAP and keep it that way until the last possible minute.
Remember that str is the old unicode and bytes is the old str.
See http://docs.python.org/dev/howto/unicode.html for a start.
With Python 3 it is unclear to me how we should be reading in the data.
Specify the encoding when you open the file it and it will automatically give you unicode. If you're reading from stdin, you'll get unicode. You can read from stdin.buffer to get binary data.
I believe we need the raw binary data in order to do an md5 on it
Yes, you do. encode it when you need to hash it.
and also to be able to write it to disk.
You specify the encoding when you open the file you're writing it to, and the file object encodes it for you.
I understand we also need it in text form to be able to parse it with the email library.
Yep, but since it'll get decoded when you open the file, that's what you'll have.
That said, this question is really too open ended for Stack Overflow. When you have a specific problem / question, come back and we'll help.
Usually I would download it to StringIO object, then run this:
m = magic.Magic()
m.from_buffer(thefile.read(1024))
But this time , I can't download the file, because the image might be 20 Megabytes. I want to use Python magic to find the file type without downloading the entire file.
If python-magic can't do it...is the next best way to observe the mime type in the headers? But how accurate is this??
I need accuracy.
You can call read(1024) without downloading the whole file:
thefile = urllib2.urlopen(someURL)
Then, just use your existing code. urlopen returns a file-like object, so this works naturally.
If it is one of the common image formats like png of jpg, and you see the server is a reliable one, then you can use the 'Content-Type' header to give what you are looking for.
But this is not as reliable as using the portion of the file and passing it to python-magic, because if server had not identified the proper format and it might have set it to application/octet-stream. This is more common with video formats, but pictures, I think Content-Type is okay.
Sorry, I can't find any statistics or research on Content-Type's accuracy. The suggested answer of downloading only part of the file is a good option too.
I'm trying to use the nntplib that comes with python to make some posts to usenet. However I can't figure out how to post binary files using the .post method.
I can post plain text files just fine, but not binary files. any ideas?
-- EDIT--
So thanks to Adrian's comment below I've managed to make one step towards my goal.
I now use the email library to make a multipart message and attach the binary files to the message. However I can't seem to figure out how to pass that message directly to the nttplib post method.
I have to first write a temporary file, then read it back in to the nttplib method. There has to be a way to do this all in memory....any suggestions?
you have to MIME-encode your post: a binary post in an NNTP newsgroup is like a mail with an attachment.
the file has to be encoded in ASCII, generally using the base64 encoding, then the encoded file is packaged iton a multipart MIME message and posted...
have a look at the email module: it implements all that you want.
i encourage you to read RFC3977 which is the official standard defining the NNTP protocol.
for the second part of your question:
use StringIO to build a fake file object from a string (the post() method of nntplib accepts open file objects).
email.Message objects have a as_string() method to retrieve the content of the message as a plain string.