I understand that opening a file just creates a file handler that takes a fixed memory irrespective of the size of the file.
Django has a type called InMemoryUploadedFile that represents files uploaded via forms.
I get the handle to my file object inside the django view like this:
file_object = request.FILES["uploadedfile"]
This file_object has type InMemoryUploadedFile.
Now we can see for ourselves that, file_object has the method .read() which is used to read files into memory.
bytes = file_object.read()
Wasn't file_object of type InMemoryUploadedFile already "in memory"?
The read() method on a file object is way to access content from within a file object irrespective of whether that file is in memory or stored on the disk. It is similar to other utility file access methods like readlines or seek.
The behavior is similar to what is built into Python which in turn is built over the operating system's fread() method.
Read at most size bytes from the file (less if the read hits EOF
before obtaining size bytes). If the size argument is negative or
omitted, read all data until EOF is reached. The bytes are returned as
a string object. An empty string is returned when EOF is encountered
immediately. (For certain files, like ttys, it makes sense to continue
reading after an EOF is hit.) Note that this method may call the
underlying C function fread() more than once in an effort to acquire
as close to size bytes as possible. Also note that when in
non-blocking mode, less data than was requested may be returned, even
if no size parameter was given.
On the question of where exactly the InMemoryUploadedFile is stored, it is a bit more complicated.
Before you save uploaded files, the data needs to be stored somewhere.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast.
However, if an uploaded file is too large, Django will write the
uploaded file to a temporary file stored in your system’s temporary
directory. On a Unix-like platform this means you can expect Django to
generate a file called something like /tmp/tmpzfp6I6.upload. If an
upload is large enough, you can watch this file grow in size as Django
streams the data onto disk.
These specifics – 2.5 megabytes; /tmp; etc. – are simply “reasonable
defaults”. Read on for details on how you can customize or completely
replace upload behavior.
One thing to consider is that in python file like objects have an API that is pretty strictly adhered to. This allows code to be very flexible, they are abstractions over I/O streams. These allow your code to not have to worry about where the data is coming from, ie. memory, filesystem, network, etc.
File like objects usually define a couple methods, one of which is read
I am not sure of the actually implementation of InMemoryUploadedFile, or how they are generated or where they are stored (I am assuming they are totally in memory though), but you can rest assured that they are file like objects and contain a read method, because they adhere to the file api.
For the implementation you could start checking out the source:
https://github.com/django/django/blob/master/django/core/files/uploadedfile.py#L90
https://github.com/django/django/blob/master/django/core/files/base.py
https://github.com/django/django/blob/master/django/core/files/uploadhandler.py
Related
I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
Some context
I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a BlobInputStream is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. A BlobOutputStream, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:
Get a CloudBlockBlob object, open it's BlobInputStream and essentially get back an InputStream that was 'tied' to the CloudBlockBlob. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).
Perform some operations on that data.
Retrieve the CloudBlockBlob object that I am uploading to, get it's BlobOutputStream, and write to it the data I did some operations on.
A good example of this is if I wanted to compress a file. I had a GzipStreamReader class that would accept an BlobInputStream and an BlobOutputStream. It would read data from the BlobInputStream and, whenever it has compressed some amount of data, write to the BlobOutputStream. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.
Now for Python
Now, the Python SDK is a little different, and obviously for good reason; the io module works differently than Java's InputStream and OutputStream classes (which the Blob{Input/Output}Stream classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and pass that stream to the StorageStreamDownloader's readinto method.
For uploads, I would call the BlobClient's upload method. The upload method accepts a data parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]].
I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:
When I call download_blob, I get back a StorageStreamDownloader with all the data in the blob. Some investigation has shown that I can use the offset and length to download the amount of data I want. Perhaps I can call it once with a download_blob(offset=0, length=4MB), process the data I get back, then again call download_bloc(offset=4MB, length=4MB), process the data, etc. This is unfavorable. The other thing I could do is utilize the max_chunk_get_size parameter for the BlobClient and turn on the validate_content flag (make it true) so that the StorageStreamDownloader only downloads 4mb. But this all results in several problems: that's not really streaming from a stream object. I'll still have to call download and readinto several times. And fine, I would do that, if it weren't for the second problem:
How the heck do I stream an upload? The upload can take a stream. But if the stream doesn't auto-update itself, then I can only upload once, because all the blobs I deal with must be BlockBlobs. The docs for the upload_function function say that I can provide a param overwrite that does:
keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.
If True, upload_blob will overwrite the existing data. If set to False, the
operation will fail with ResourceExistsError. The exception to the above is with Append
blob types: if set to False and the data already exists, an error will not be raised
and the data will be appended to the existing blob. If set overwrite=True, then the existing
append blob will be deleted, and a new one created. Defaults to False.
And this makes sense because BlockBlobs, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?
Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the io module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.
To recap
Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:
Retrieving some data from a blob (the data received in one call should not be more than 10MB) in a stream.
Compressing the data in that stream.
Upload the data to a blob.
This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
I hope this makes sense! I just want to stream data. This shouldn't be that hard.
Edit:
Some people may want to close this and claim the question is a duplicate. I have forgotten to include something very important: I am currently using the newest, mot up-to-date azure-sdk version. My azure-storage-blob package's version is 12.5.0. There have been other questions similar to what I have asked for severely outdated versions. I have searched for other answers, but haven't found any for 12+ versions.
If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code
import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'
source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='',
blob_name='',
credential=key,
max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
max_single_get_size=4*1024*1024)
des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
container_name='',
blob_name='',
credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
#process your data
# use the put block rest api to upload the chunk to azure storage
blk_id = str(uuid.uuid4())
des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
block_list.append(BlobBlock(block_id=blk_id))
#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)
Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url
I have a python script which will run various other scripts when it sees various files have been updated. It rapidly polls the files to check for updates by looking at the file modified dates.
For the most part this has worked as expected. When one of my scripts updates a file, another script is triggered and the appropriate action(s) are taken. For reference I am using pickles as the file type.
However, adding a new file and corresponding script into the mix just now, I've noticed an issue where the file has its modified date updated twice. Once when I perform the pickle.dump() and again when I exit the "with" statement (when the file closes). This means that the corresponding actions trigger twice rather than once. I guess this makes sense but what's confusing is this behaviour doesn't happen with any of my other files.
I know a simple workaround would be to poll the files slightly less frequently since the gap between the file updates is extremely small. But I want to understand why this issue is occuring some times but not other times.
I think what you observe is 2 actions: file created and file updated.
To resolve this, create and populate file outside of monitored folders, and once "with" block is over (file is closed), move it from temporary location to a proper place.
to do this, look at tempfile module in standard library
If the pickle is big enough (typically somewhere around 4+ KB, though it will vary by OS/file system), this would be expected behavior. The majority of the pickle would be written during the dump call as buffers filled and got written, but whatever fraction doesn't consume the full file buffer would be left in the buffer until the file is closed (which implicitly flushes any outstanding buffered data before closing the handle).
I agree with the other answer that the usual solution is to write the file in a different folder (but on the same file system), then immediately after closing it, us os.replace to perform an atomic rename that moves it from the temporary location to the final location, so there is no gap between file open, file population, and file close; the file is either there in its entirety, or not at all.
I'd like to have a way to write Unicode text output to a temporary file created with tempfile API, which would support Python 3 style options for encoding and newline conversion, but would work also on Python 2.7 (for unicode values).
To open files with regular predictable names, a portable way is provided by io.open. But with temporary files, the secure way is to get an OS handle to the file, to ensure that the file name cannot be hijacked by a concurrent malicious process. There are no io workalikes to tempfile.NamedTemporaryFile or os.fdopen, and on Python 2.7 there are issues with the file objects obtained that way:
the built-in file objects cannot be wrapped by io.TextIoWrapper which supports both encoding and newline conversion;
the codecs API can produce an encoding writer, but that does not perform newline conversion. The underlying file must be opened in binary mode, otherwise the same code breaks in Python 3 (and it's generally not sane to expect correct newline conversion on arbitary character-encoded data).
I've come up with two ways to deal with the portability problem, each of which has certain disadvantages:
Close the file object (or the OS descriptor) without removing the file, and reopen the file by name with io.open. When using NamedTemporaryFile, this means the delete construction parameter has to be set to false and the user has the responsibility to delete the file when it's no longer needed. There is also an added security hazard, in the rather unusual case when the directory where temporary file is created is writable to potential attackers and the sticky bit is not set in its permission mode bits.
Write the entire output to an io.StringIO buffer created with newline parameter as appropriate, then write the buffered string into the encoding writer obtained from codecs. This is bad for performance and memory usage on large files.
Are there other alternatives?
I am using pythonista on iOS, although I hope that does not matter. Some lib calls require a path to a json file to render the content into a form/user interface. However as far as I can see, no API to render the JSON data from a variable . I can read in the JSON data and write it out again as a file and use that file, all works correctly. However, I would like to have some type of virtual filename that points to a file object in memory that I can pass to the function. Basically so the function being called is oblivious to the fact that the path i have provided is a memory file handle. I have searched here, it seems this subject is not delt with well. Or I have searched incorrectly. I could imagine, this functionality very sort after.
I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
A point in the right direction would be much appreciated.
The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).
If you want to control the buffersize, use the buffering keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.
You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.
Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.
Have you tried a lazy function?: Lazy Method for Reading Big File in Python?
this seems to already answer your question. However, I would consider using this method to write your data to a DATABASE, mysql is free: http://dev.mysql.com/downloads/ , NoSQL is also free and might be a little more tailored to operations involving writing 800gb of data, or similar amounts: http://www.oracle.com/technetwork/database/nosqldb/downloads/default-495311.html
I haven't tried it with such epic xml files, but last time I had to deal with large (and relatively simple) xml files, I used a sax parser.
It basically gives you callbacks for each "event" and leaves it to you to store the data you need. You can give an open file so you don't have to read it in all at once.