Writing a BytesIO object to a file, 'efficiently' - python

So a quick way to write a BytesIO object to a file would be to just use:
with open('myfile.ext', 'wb') as f:
f.write(myBytesIOObj.getvalue())
myBytesIOObj.close()
However, if I wanted to iterate over the myBytesIOObj as opposed to writing it in one chunk, how would I go about it? I'm on Python 2.7.1. Also, if the BytesIO is huge, would it be a more efficient way of writing by iteration?
Thanks

shutil has a utility that will write the file efficiently. It copies in chunks, defaulting to 16K. Any multiple of 4K chunks should be a good cross platform number. I chose 131072 rather arbitrarily because really the file is written to the OS cache in RAM before going to disk and the chunk size isn't that big of a deal.
import shutil
myBytesIOObj.seek(0)
with open('myfile.ext', 'wb') as f:
shutil.copyfileobj(myBytesIOObj, f, length=131072)
BTW, there was no need to close the file object at the end. with defines a scope, and the file object is defined inside that scope. The file handle is therefore closed automatically on exit from the with block.

Since Python 3.2 it's possible to use the BytesIO.getbuffer() method as follows:
from io import BytesIO
buf = BytesIO(b'test')
with open('path/to/file', 'wb') as f:
f.write(buf.getbuffer())
This way it doesn't copy the buffer's content, streaming it straight to the open file.
Note: The StringIO buffer doesn't support the getbuffer() protocol (as of Python 3.9).
Before streaming the BytesIO buffer to file, you might want to set its position to the beginning:
buf.seek(0)

Related

Python: Read compressed (.gz) HDF file without writing and saving uncompressed file

I have a large number of compressed HDF files, which I need to read.
file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...
I can read in uncompressed HDF files with the following method
from pyhdf.SD import SD, SDC
import os
os.system('gunzip < file1.HDF.gz > file1.HDF')
HDF = SD('file1.HDF')
and repeat this for each file. However, this is more time consuming than I want.
I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.
Am I correct in this thinking? And if so, is there a way to do what I want?
According to the pyhdf package documentation, this is not possible.
__init__(self, path, mode=1)
SD constructor. Initialize an SD interface on an HDF file,
creating the file if necessary.
There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.
Unzipping it as a file is likely your most performant option.
If you would like to stay in Python, use the gzip module (docs):
import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.

how to decompress .tar.bz2 in memory with python

How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()

Compressing A Series of JSON Objects While Maintaining Serial Reading?

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.
Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a
decompress error due to a truncated stream,
which I believe is due to the compressed strings containing new lines.
Anyone know of a good method to do this?
Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.
The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.
import gzip
import json
# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
for obj in objects:
outfile.write(json.dumps(obj) + '\n')
# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
for line in infile:
obj = json.loads(line)
# process obj
This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.
You might want to try an incremental json parser, such as jsaone.
That is, create a single json with all your objects, and parse it like
with gzip.GzipFile(file_path, 'r') as f_in:
for key, val in jsaone.load(f_in):
...
This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.
EDIT: oh, by the way, it's probably fair to clarify that I wrote jsaone.

Adding a file-like object to a Zip file in Python

The Python ZipFile API seems to allow the passing of a file path to ZipFile.write or a byte string to ZipFile.writestr but nothing in between. I would like to be able to pass a file like object, in this case a django.core.files.storage.DefaultStorage but any file-like object in principle. At the moment I think I'm going to have to either save the file to disk, or read it into memory. Neither of these is perfect.
You are correct, those are the only two choices. If your DefaultStorage object is large, you may want to go with saving it to disk first; otherwise, I would use:
zipped = ZipFile(...)
zipped.writestr('archive_name', default_storage_object.read())
If default_storage_object is a StringIO object, it can use default_storage_object.getvalue().
While there's no option that takes a file-like object, there is an option to open a zip entry for writing (ZipFile.open). [doc]
import zipfile
import shutil
with zipfile.ZipFile('test.zip','w') as archive:
with archive.open('test_entry.txt','w') as outfile:
with open('test_file.txt','rb') as infile:
shutil.copyfileobj(infile, outfile)
You can use your input stream as the source instead, and not have to copy the file to disk first. The downside is that if something goes wrong with your stream, the zip file will be unusable. In my application, we bypass files with errors, so we end up getting a local copy of the file anyway to ensure integrity and keep a usable zip file.

How do the compression codecs work in Python?

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

Categories