Text output to a temporary file, compatible with Python 2.7? - python

I'd like to have a way to write Unicode text output to a temporary file created with tempfile API, which would support Python 3 style options for encoding and newline conversion, but would work also on Python 2.7 (for unicode values).
To open files with regular predictable names, a portable way is provided by io.open. But with temporary files, the secure way is to get an OS handle to the file, to ensure that the file name cannot be hijacked by a concurrent malicious process. There are no io workalikes to tempfile.NamedTemporaryFile or os.fdopen, and on Python 2.7 there are issues with the file objects obtained that way:
the built-in file objects cannot be wrapped by io.TextIoWrapper which supports both encoding and newline conversion;
the codecs API can produce an encoding writer, but that does not perform newline conversion. The underlying file must be opened in binary mode, otherwise the same code breaks in Python 3 (and it's generally not sane to expect correct newline conversion on arbitary character-encoded data).
I've come up with two ways to deal with the portability problem, each of which has certain disadvantages:
Close the file object (or the OS descriptor) without removing the file, and reopen the file by name with io.open. When using NamedTemporaryFile, this means the delete construction parameter has to be set to false and the user has the responsibility to delete the file when it's no longer needed. There is also an added security hazard, in the rather unusual case when the directory where temporary file is created is writable to potential attackers and the sticky bit is not set in its permission mode bits.
Write the entire output to an io.StringIO buffer created with newline parameter as appropriate, then write the buffered string into the encoding writer obtained from codecs. This is bad for performance and memory usage on large files.
Are there other alternatives?

Related

Compressing text string with existing compression header

I wish to compress a given string with a pre-existing header retrieved from an already compressed file in an archive (a local file header).
I have attempted to look at zlib and while their compression/decompressing works nicely I can not find an option to set the compression header.
I want to avoid decompressing a file, add a string to the file, and then re-compress the file. Instead I simply want to "append" a given string to a given compressed file.
I have made attempts using the existing Zipfile module in Python, here I have tried to modify the Zipfile module to deal with a pre-set header, however from this I can conclude that the Zipfile module relies too heavily on the zlib library for this to be possible.
While my attempts have been in Python I am happy using any programming language.
What you want to do is more complicated than you think. However the code has already been written. Look at gzlog.h and gzlog.c in the examples directory of the zlib distribution.

Is InMemoryUploadedFile really "in memory"?

I understand that opening a file just creates a file handler that takes a fixed memory irrespective of the size of the file.
Django has a type called InMemoryUploadedFile that represents files uploaded via forms.
I get the handle to my file object inside the django view like this:
file_object = request.FILES["uploadedfile"]
This file_object has type InMemoryUploadedFile.
Now we can see for ourselves that, file_object has the method .read() which is used to read files into memory.
bytes = file_object.read()
Wasn't file_object of type InMemoryUploadedFile already "in memory"?
The read() method on a file object is way to access content from within a file object irrespective of whether that file is in memory or stored on the disk. It is similar to other utility file access methods like readlines or seek.
The behavior is similar to what is built into Python which in turn is built over the operating system's fread() method.
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in
non-blocking mode, less data than was requested may be returned, even
if no size parameter was given.
On the question of where exactly the InMemoryUploadedFile is stored, it is a bit more complicated.
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast.
However, if an uploaded file is too large, Django will write the
uploaded file to a temporary file stored in your system’s temporary
directory. On a Unix-like platform this means you can expect Django to
generate a file called something like /tmp/tmpzfp6I6.upload. If an
upload is large enough, you can watch this file grow in size as Django
streams the data onto disk.
These specifics – 2.5 megabytes; /tmp; etc. – are simply “reasonable
defaults”. Read on for details on how you can customize or completely
replace upload behavior.
One thing to consider is that in python file like objects have an API that is pretty strictly adhered to. This allows code to be very flexible, they are abstractions over I/O streams. These allow your code to not have to worry about where the data is coming from, ie. memory, filesystem, network, etc.
File like objects usually define a couple methods, one of which is read
I am not sure of the actually implementation of InMemoryUploadedFile, or how they are generated or where they are stored (I am assuming they are totally in memory though), but you can rest assured that they are file like objects and contain a read method, because they adhere to the file api.
For the implementation you could start checking out the source:
https://github.com/django/django/blob/master/django/core/files/uploadedfile.py#L90
https://github.com/django/django/blob/master/django/core/files/base.py
https://github.com/django/django/blob/master/django/core/files/uploadhandler.py

Efficient reading of 800 GB XML file in Python 2.7

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
A point in the right direction would be much appreciated.
The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).
If you want to control the buffersize, use the buffering keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.
You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.
Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.
Have you tried a lazy function?: Lazy Method for Reading Big File in Python?
this seems to already answer your question. However, I would consider using this method to write your data to a DATABASE, mysql is free: http://dev.mysql.com/downloads/ , NoSQL is also free and might be a little more tailored to operations involving writing 800gb of data, or similar amounts: http://www.oracle.com/technetwork/database/nosqldb/downloads/default-495311.html
I haven't tried it with such epic xml files, but last time I had to deal with large (and relatively simple) xml files, I used a sax parser.
It basically gives you callbacks for each "event" and leaves it to you to store the data you need. You can give an open file so you don't have to read it in all at once.

does Python for windows ever insert '\r\n' when told to insert '\n'?

I use a PC at home and a Mac at work. I've never had any problems with line breaks in python script or their outputs, but whenever i send something to my boss i get an angry e-mail back about windows line breaks in it.
The most recent was the output of a python script where i'd told it to end every line with '\n', but on closer inspection (on my Mac at work) it seems that each line did in fact end with '\r\n'.
What's going on, and how do i stop it? I used to run all my scripts in a Linux virtual machine at home, but i found that was too slow and fiddly, surely there's a simpler fix?
This is because you have files opened in text mode and Python is normalizing the newlines in accordance with the platform you're using (Windows used \r\n and Linux just uses \n). You need to open files in binary mode like this:
f = open("myfile.txt","wb")
It does the same thing in reverse when you read in files (\r\n will be replaced by \n) unless you also specify binary mode:
f = open("myfile.txt", "rb")
The behavior you are seeing is not python-specific. It comes from the buffered file-handling functions in the C standard library that underlies python and other high level languages. Unless told not to, it will convert newline characters to the current platform's native text file line break sequence when writing, and do the reverse when reading. See the documentation for fopen() on your local system for details. On Windows, it means \n will be converted to \r\n on writes.
The python docs mention newline conversion and other open() mode options here.
One solution would be to use open("filename", "wb") instead of open("filename", "w") when opening the output file in the first place. That will avoid the automatic newline conversion. It ought to solve the problem for your boss, so long as your boss is using some form of unix (including OSX). Unfortunately, it will also mean that some Windows text editors (e.g. notepad?) will present your file strangely:
Windows acts like a teletype
when it sees new lines
without carriage returns.
Another approach would be to convert your files as needed before sending them to someone who doesn't use Windows. Various conversion programs exist for this purpose, such as dos2unix and flip.

File encryption with Python

Is there a way to encrypt files (.zip, .doc, .exe, ... any type of file) with Python?
I've looked at a bunch of crypto libraries for Python including pycrypto and ezpycrypto but as far as I see they only offer string encryption.
In Python versions prior to version 3.0, the read method of a file object will return a string, provide this string to the encryption library of your choice, the resulting string can be written to a file.
Keep in mind that on Windows-based operating systems, the default mode used when reading files may not accurately provide the contents of the file. I suggest that you be familiar with the nuances of file modes and how they behave on Windows-based OSes.
You can read the complete file into a string, encrypt it, write the encrypted string in a new file. If the file is too large, you can read in chunks.
Every time you .read from a file, you get a string (in Python < 3.0).

Categories