I would like to read compressed files directly from Google Cloud Storage and open them with the Python csv package.
The code for a local file would be:
def reader(self):
print "reading local compressed file: ", self._filename
self._localfile = gzip.open(self._filename, 'rb')
csvReader = csv.reader(self._localfile, delimiter=',', quotechar='"')
return csvReader
I have played with several GCS APIs (JSON based, cloud.storage), but none of them seem to give me something that I can stream through gzip. What is more, even if the file was uncompressed, I could not open the file and give it to cv.reader (Iterator type).
My compressed CSV files are about 500MB, while uncompressed they use up to a few GB. I don't think it would be a good idea to: 1 - locally download the files before opening them (unless I can overlap download and computation) or 2 - Open it entirely in memory before computing.
Finally, I current run this code on my local machine, but ultimately, I will move to AppEngine, so it must work there too.
Thanks!!
Using GCS, cloudstorage.open(filename, 'r') will give you a read-only file-like object (earlier created similarly but with 'w':-) which you can use, a chunk at a time, with the standard Python library's zlib module, specifically a zlib.decompressobj, if, of course, the GS object was originally created in the complementary way (with a zlib.compressobj).
Alternatively, for convenience, you can use the standard Python library's gzip module, e.g for the reading phase something like:
compressed_flo = cloudstorage.open('objname', 'r')
uncompressed_flo = gzip.GzipFile(fileobj=compressed_flo,mode='rb')
csvReader = csv.reader(uncompressed_flo)
and vice versa for the earlier writing phase, of course.
Note that when you run locally (with the dev_appserver), the GCS client library uses local disk files to simulate GCS -- in my experience that's good for development purposes, and I can use gsutil or other tools when I need to interact with "real" GCS storage from my local workstation... GCS is for when I need such interaction from my GAE app (and for developing said GAE app locally in the first place:-).
So, you have gzipped files stored on GCS. You can process the data stored on GCS in a stream-like fashion. That is, you can download, unzip, and process simultaneously. This avoids
to have the unzipped file on disk
to have to wait until the download is complete before being able to process the data.
gzip files have a small header and footer, and the body is a compressed stream, consisting of a series of blocks, and each block is decompressable on its own. Python's zlib package helps you with that!
Edit: This is example code for how to decompress and analzye a zlib or gzip stream chunk-wise, purely based on zlib:
import zlib
from collections import Counter
def stream(filename):
with open(filename, "rb") as f:
while True:
chunk = f.read(1024)
if not chunk:
break
yield chunk
def decompress(stream):
# Generate decompression object. Auto-detect and ignore
# gzip wrapper, if present.
z = zlib.decompressobj(32+15)
for chunk in stream:
r = z.decompress(chunk)
if r:
yield r
c = Counter()
s = stream("data.gz")
for chunk in decompress(s):
for byte in chunk:
c[byte] += 1
print c
I tested this code with an example file data.gz, created with GNU gzip.
Quotes from http://www.zlib.net/manual.html:
windowBits can also be greater than 15 for optional gzip decoding. Add
32 to windowBits to enable zlib and gzip decoding with automatic
header detection, or add 16 to decode only the gzip format (the zlib
format will return a Z_DATA_ERROR). If a gzip stream is being decoded,
strm->adler is a crc32 instead of an adler32.
and
Any information contained in the gzip header is not retained [...]
Related
I am currently writing json files to disk using
print('writing to disk .... ')
f = open('mypath/myfile, 'wb')
f.write(getjsondata.read())
f.close()
Which works perfectly, except that the json files are very large and I would like to compress them. How can I do that automatically? What should I do?
Thanks!
Python has a standard module for zlib, which can compress and decompress data for you. You can use this immediately on your data and write (and read) a custom format, or use the module gzip, which wraps the inner workings of zlib to read and write gzip compatible files, while
automatically compressing or decompressing the data so that it looks like an ordinary file object.
It thus neatly replaces the default open format to interact with files, and all you need is this:
import gzip
print('writing to disk .... ')
with gzip.open('mypath/myfile', 'wb') as f:
f.write(getjsondata.read())
(with a change in the open line because I highly recommend using the with syntax to handle file objects.)
According to S3.Client.upload_file and S3.Client.upload_fileobj, upload_fileobj may sound faster. But does anyone know specifics? Should I just upload the file, or should I open the file in binary mode to use upload_fileobj? In other words,
import boto3
s3 = boto3.resource('s3')
### Version 1
s3.meta.client.upload_file('/tmp/hello.txt', 'mybucket', 'hello.txt')
### Version 2
with open('/tmp/hello.txt', 'rb') as data:
s3.upload_fileobj(data, 'mybucket', 'hello.txt')
Is version 1 or version 2 better? Is there a difference?
The main point with upload_fileobj is that file object doesn't have to be stored on local disk in the first place, but may be represented as file object in RAM.
Python have standard library module for that purpose.
The code will look like
import io
import boto3
s3 = boto3.client('s3')
fo = io.BytesIO(b'my data stored as file object in RAM')
s3.upload_fileobj(fo, 'mybucket', 'hello.txt')
In that case it will perform faster, since you don't have to read from local disk.
TL;DR
in terms of speed, both methods will perform roughly the same, both written in python and the bottleneck will be either disk-io (read file from disk) or network-io (write to s3).
use upload_file() when writing code that only handles uploading files from disk.
use upload_fileobj() when you writing generic code to handle s3 upload that may be reused in future for not only file from disk usecase.
What is fileobj anyway?
there is convention in multiple places including the python standard library, that when one is using the term fileobj she means file-like object.
There are even some libraries exposing functions that can take file path (str) or fileobj (file-like object) as the same parameter.
when using file object your code is not limited to disk only, for example:
for example you can copy data from one s3 object into another in streaming fashion (without using disk space or slowing down the process for doing read/write io to disk).
you can (de)compress or decrypt data on the fly when writing objects to S3
example using python gzip module with file-like object in generic way:
import gzip, io
def gzip_greet_file(fileobj):
"""write gzipped hello message to a file"""
with gzip.open(filename=fileobj, mode='wb') as fp:
fp.write(b'hello!')
# using opened file
gzip_greet_file(open('/tmp/a.gz', 'wb'))
# using filename from disk
gzip_greet_file('/tmp/b.gz')
# using io buffer
file = io.BytesIO()
gzip_greet_file(file)
file.seek(0)
print(file.getvalue())
tarfile on the other hand has two parameters file & fileobj:
tarfile.open(name=None, mode='r', fileobj=None, bufsize=10240, **kwargs)
Example compression on-the-fly with s3.upload_fileobj()
import gzip, boto3
s3 = boto3.resource('s3')
def upload_file(fileobj, bucket, key, compress=False):
if compress:
fileobj = gzip.GzipFile(fileobj=fileobj, mode='rb')
key = key + '.gz'
s3.upload_fileobj(fileobj, bucket, key)
Neither is better, because they're not comparable. While the end result is the same (an object is uploaded to S3), they source that object quite differently. One expects you to supply the path on disk of the file to upload while the other expects you to provide a file-like object.
If you have a file on disk and want to upload it, then use upload_file. If you have a file-like object (which could ultimately be many things including an open file, a stream, a socket, a buffer, a string) then use upload_fileobj.
A 'file-like object' in this context is anything that implements the read method, and returns bytes.
As per the documentation in https://boto3.amazonaws.com/v1/documentation/api/1.9.185/guide/s3-uploading-files.html
"The upload_file and upload_fileobj methods are provided by the S3 Client, Bucket, and Object classes. The method functionality provided by each class is identical. No benefits are gained by calling one class's method over another's. Use whichever class is most convenient."
The answers above seems to be false
How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()
According to this FAQ on zlib.net it is possible to:
access data randomly in a compressed stream
I know about the module Bio.bgzf of Biopyton 1.60, which:
supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.
But for my use case I don't want to use that format. Basically I want something, which emulates the code below:
import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
f.seek(large_integer_new_line_start)
but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?
I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:
zcat large_file.gz | bgzip > large_file.bgz
Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:
from Bio import bgzf
file='large_file.bgz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.
You are looking for dictzip.py, part of the serpento package. However, you have to compress the files with dictzip, which is a random seekable backward compatible variant of the gzip compression.
The indexed_gzip program might be what you wanted. It also uses zran.c under the hood.
If you just want to access the file from a random point can't you just do:
from random import randint
with open(filename) as f:
f.seek(0, 2)
size = f.tell()
f.seek(randint(0, size), 2)
I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.