Importing bz2 compressed binary file as numpy array - python

I have a bz2 compressed binary (big endian) file containing an array of data. Uncompressing it with external tools and then reading the file in to Numpy works:
import numpy as np
dim = 3
rows = 1000
cols = 2000
mydata = np.fromfile('myfile.bin').reshape(dim,rows,cols)
However, since there are plenty of other files like this I cannot extract each one individually beforehand. Thus, I found the bz2 module in Python which might be able to directly decompress it in Python. However I get an error message:
dfile = bz2.BZ2File('myfile.bz2').read()
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
>>IOError: first argument must be an open file
Obviously, the BZ2File function does not return a file object. Do you know what is the correct way read the compressed file?

BZ2File does return a file-like object (although not an actual file). The problem is that you're calling read() on it:
dfile = bz2.BZ2File('myfile.bz2').read()
This reads the entire file into memory as one big string, which you then pass to fromfile.
Depending on your versions of numpy and python and your platform, reading from a file-like object that isn't an actual file may not work. In that case, you can use the buffer you read in with frombuffer.
So, either this:
dfile = bz2.BZ2File('myfile.bz2')
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
… or this:
dbuf = bz2.BZ2File('myfile.bz2').read()
mydata = np.frombuffer(dbuf).reshape(dim,rows,cols)
(Needless to say, there are a slew of other alternatives that might be better than reading the whole buffer into memory. But if your file isn't too huge, this will work.)

Related

How do I read in wav files in .gz?

I am learning machine learning and data analysis on wav files.
I know if I have wav files directly I can do something like this to read in the data
import librosa
mono, fs = librosa.load('./small_data/time_series_audio.wav', sr = 44100)
Now I'm given a gz-file "music_feature_extraction_test.tar.gz"
I'm not sure what to do now.
I tried:
with gzip.open('music_train.tar.gz', 'rb') as f:
for files in f :
mono, fs = librosa.load(files, sr = 44100)
but it gives me:
TypeError: lstat() argument 1 must be encoded string without null bytes, not str
Can anyone help me out?
There are several things going on:
The file you are given is a gzipped-compressed tarball. Take a look at the tarfile module, it can read gzip-compressed files directly. You'll get an iterator over it's members, each of which is an individual file.
AFAIKS librosa can't read from an in-memory buffer so you have to unpack the tar-members to temporary files. The tempfile-module is your friend here, a NamedTemporaryFile will provide you with a self-deleting file that you can uncompress to and provide to librosa.
You probably want to implement this as a simple generator function that takes the tarfile-name as it's input, iterates over it's members and yields what librosa.load() provides you. That way everything gets cleaned up automatically.
The basic loop would therefore be
Open the tarball using the tarfile-module. For each member
Get a new temporary file using NamedTemporaryFile. Copy the content of the tarball-member to that file. You may want to use shutil.copyfileobj to avoid reading the entire wav-file into memory before writing it to disk.
The NamedTemporaryFile has a filename-attribute. Pass that to librosa.open.
yield the return value of librosa.open to the caller.
You can use PySoundFile to read from the compressed file.
https://pysoundfile.readthedocs.io/en/0.9.0/#virtual-io
import soundfile
with gzip.open('music_train.tar.gz', 'rb') as gz_f:
for file in gz_f :
fs, mono = soundfile.read(file, samplerate=44100)
Maybe you should also check if you need to resample the data before processing it with librosa:
https://librosa.github.io/librosa/ioformats.html#read-specific-formats

How to make sure a netcdf file is closed in python?

It's probably simple, but I haven't been able to find a solution online...
I'm trying to work with a series of datasets stored as netcdf files. I open each one up, read in some keys points, then move onto the next file. I am finding that I constantly hit mmap errors/the script slows down as more files are being read in. I believe it may be because the netcdf files are not being properly closed by the .close() command.
I've been testing this:
from scipy.io.netcdf import netcdf_file as ncfile
f=ncfile(netcdf_file,mode='r')
f.close()
then if I try
>>>f
<scipy.io.netcdf.netcdf_file object at 0x24d29e10>
and
>>>f.variables['temperature'][:]
array([ 1234.68034431, 1387.43136567, 1528.35794546, ..., 3393.91061952,
3378.2844357 , 3433.06715226])
So it appears the file is still open? What does close() actually do? how do I know it has worked?
Is there a way to close/clear all open files from python?
Software:
Python 2.7.6, scipy 0.13.2, netcdf 4.0.1
The code for f.close is:
Definition: f.close(self)
Source:
def close(self):
"""Closes the NetCDF file."""
if not self.fp.closed:
try:
self.flush()
finally:
self.fp.close()
f.fp is the file object. So
In [451]: f.fp
Out[451]: <open file 'test.cdf', mode 'wb' at 0x939df40>
In [452]: f.close()
In [453]: f.fp
Out[453]: <closed file 'test.cdf', mode 'wb' at 0x939df40>
But I see from playing around with the f, that I can still create dimensions and variables. But f.flush() returns an error.
It does not look like it uses mmap during data writes, just during read.
def _read_var_array(self):
....
if self.use_mmap:
mm = mmap(self.fp.fileno(), begin_+a_size, access=ACCESS_READ)
data = ndarray.__new__(ndarray, shape, dtype=dtype_,
buffer=mm, offset=begin_, order=0)
else:
pos = self.fp.tell()
self.fp.seek(begin_)
data = fromstring(self.fp.read(a_size), dtype=dtype_)
data.shape = shape
self.fp.seek(pos)
I don't have much experience with mmap. It looks like it sets up a mmap object based on a block of bytes in the file, and uses that as the data buffer for the variable. I don't know what happens to that access if the underlying file is closed. I wouldn't be surprised if there is some sort of mmap error.
If the file is opened with mmap=False, then the whole variable is read into memory, and accessed like a regular numpy array.
mmap : None or bool, optional
Whether to mmap `filename` when reading. Default is True
when `filename` is a file name, False when `filename` is a
file-like object
My guess is that if you open a file without specifying the mmap mode, read an variable from it, and then close the file, that it is unsafe to reference that variable and its data later. Any reference that requires loading more data could result in a mmap error.
But if you open the file with mmap=False, you should be able slice the variable even after closing the file.
I don't see how the mmap for one file or variable could interfer with access to other files and variables. But I'd have to read more on mmap to be sure of that.
And from the netcdf docs:
Note that when netcdf_file is used to open a file with mmap=True (default for read-only), arrays returned by it refer to data directly on the disk. The file should not be closed, and cannot be cleanly closed when asked, if such arrays are alive. You may want to copy data arrays obtained from mmapped Netcdf file if they are to be processed after the file is closed, see the example below.

How do the compression codecs work in Python?

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

parsing large compressed xml files, python

file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file)
Here's code that tries to parse xml file compressed with bz2. Unfortunately it fails with a message:
TypeError: Parse() argument 1 must be string or read-only buffer, not bz2.BZ2File
Is there a way to parse on the fly compressed bz2 xml files?
Note: p.Parse(file.read()) is not an option here. I want to parse a file which is larger than available memory, so I need to have a stream.
Just use p.ParseFile(file) instead of p.Parse(file).
Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required.
Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile
Use .read() on the file object to read in the entire file as a string, and then pass that to Parse?
file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file.read())
Can you pass in an mmap()'ed file? That should take care of automatically paging the needed parts of the file in, and avoid memory overflow. Of course if expat builts a parse tree, it might still run out of memory.
http://docs.python.org/library/mmap.html
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file.

How do I extract a ieee-be binary file embedded in a zipfile?

I have a set of zip files which contains several ieee-be encoded binary and text files. I have used Pythons ZipFile module and can extract the contents of the text file
def readPropFile(myZipFile):
zf = zipfile.ZipFile(myZipFile,'r') # Open zip file for reading
zFileList=zf.namelist() # extract list of files embedded in _myZipFile_
# text files in _myZipFile_ contain the word 'properties' so
# get a list of the property files here
for f in zFileList:
if f.find('properties')>0:
propFileList.append(f)
# open first file in propFileList
pp2 = cStringIO.StringIO(zf.read(propFileList[0]))
fileLines = []
for ll in pp2:
fileLines.append(ll)
# return the lines in the property text file
return fileLines
Now I would like to do the same sort of thing except reading the data in the binary files and creating an array of floats. So how would I proceed?
Update 1
The format of the binary files is such that in MATLAB I after extracting them to a temporary location I can read them with the following
>>fid=fopen('dataFile.bin','r','ieee-be');
>>dat=fread(fid,[1 inf],'float');
Update 2
I now have a simple function in attempt to read the binary data something like
def readBinaryFile(myZipFile):
zFile = zipfile.ZipFile(myZipFile,'r')
dataFileName = 'dataFile.bin'
stringData = zFile.read(dataFileName)
ss=stringData[0:4]
data=struct.unpack('>f',ss)
but the value I get does the same as the value reported in MATLAB.
Update 3
first float in my binary file
HEX value: BD 98 99 3D
float : -.07451103
Most of what you need is in this answer How do I convert a Python float to a hexadecimal string in python 2.5? Nonworking solution attached
See the stuff about struct.pack.
More details on struct are in the Python docs
You could also try the Numpy extension (here), which is a bit lighter than SciPy. Numpy has lots of I/O routines. For example,
import numpy
f = file ('example.dat')
data_type = numpy.dtype ('float32').newbyteorder ('>')
x = numpy.fromfile (f, dtype=data_type)
gives you a numpy array. There's probably a less-clunky way to specify the data type.
In the snippet from the question, the "properties files" (whatever that is) are detected, in a rather loose fashion, by the presence of the string 'properties' in their contents, when read as text. I don't know what the equivalent of this would be for binary IEEE-le files.
However, with Python, a easy way to read ieee-le (or other formats) files is with SciPy's io.fopen module.
Edit
Since anyway reading such a binary file requires one to know the structure, you can express this in a struct.pack() format as desribed in Michael Dillon's response! This only requires the std library, and is just as easy!

Categories