Python: Read compressed (.gz) HDF file without writing and saving uncompressed file - python

I have a large number of compressed HDF files, which I need to read.
file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...
I can read in uncompressed HDF files with the following method
from pyhdf.SD import SD, SDC
import os
os.system('gunzip < file1.HDF.gz > file1.HDF')
HDF = SD('file1.HDF')
and repeat this for each file. However, this is more time consuming than I want.
I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.
Am I correct in this thinking? And if so, is there a way to do what I want?

According to the pyhdf package documentation, this is not possible.
__init__(self, path, mode=1)
SD constructor. Initialize an SD interface on an HDF file,
creating the file if necessary.
There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.
Unzipping it as a file is likely your most performant option.
If you would like to stay in Python, use the gzip module (docs):
import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.

Related

how to compress json files?

I am currently writing json files to disk using
print('writing to disk .... ')
f = open('mypath/myfile, 'wb')
f.write(getjsondata.read())
f.close()
Which works perfectly, except that the json files are very large and I would like to compress them. How can I do that automatically? What should I do?
Thanks!
Python has a standard module for zlib, which can compress and decompress data for you. You can use this immediately on your data and write (and read) a custom format, or use the module gzip, which wraps the inner workings of zlib to read and write gzip compatible files, while
automatically compressing or decompressing the data so that it looks like an ordinary file object.
It thus neatly replaces the default open format to interact with files, and all you need is this:
import gzip
print('writing to disk .... ')
with gzip.open('mypath/myfile', 'wb') as f:
f.write(getjsondata.read())
(with a change in the open line because I highly recommend using the with syntax to handle file objects.)

How do I read in wav files in .gz?

I am learning machine learning and data analysis on wav files.
I know if I have wav files directly I can do something like this to read in the data
import librosa
mono, fs = librosa.load('./small_data/time_series_audio.wav', sr = 44100)
Now I'm given a gz-file "music_feature_extraction_test.tar.gz"
I'm not sure what to do now.
I tried:
with gzip.open('music_train.tar.gz', 'rb') as f:
for files in f :
mono, fs = librosa.load(files, sr = 44100)
but it gives me:
TypeError: lstat() argument 1 must be encoded string without null bytes, not str
Can anyone help me out?
There are several things going on:
The file you are given is a gzipped-compressed tarball. Take a look at the tarfile module, it can read gzip-compressed files directly. You'll get an iterator over it's members, each of which is an individual file.
AFAIKS librosa can't read from an in-memory buffer so you have to unpack the tar-members to temporary files. The tempfile-module is your friend here, a NamedTemporaryFile will provide you with a self-deleting file that you can uncompress to and provide to librosa.
You probably want to implement this as a simple generator function that takes the tarfile-name as it's input, iterates over it's members and yields what librosa.load() provides you. That way everything gets cleaned up automatically.
The basic loop would therefore be
Open the tarball using the tarfile-module. For each member
Get a new temporary file using NamedTemporaryFile. Copy the content of the tarball-member to that file. You may want to use shutil.copyfileobj to avoid reading the entire wav-file into memory before writing it to disk.
The NamedTemporaryFile has a filename-attribute. Pass that to librosa.open.
yield the return value of librosa.open to the caller.
You can use PySoundFile to read from the compressed file.
https://pysoundfile.readthedocs.io/en/0.9.0/#virtual-io
import soundfile
with gzip.open('music_train.tar.gz', 'rb') as gz_f:
for file in gz_f :
fs, mono = soundfile.read(file, samplerate=44100)
Maybe you should also check if you need to resample the data before processing it with librosa:
https://librosa.github.io/librosa/ioformats.html#read-specific-formats

how to decompress .tar.bz2 in memory with python

How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()

Writing a BytesIO object to a file, 'efficiently'

So a quick way to write a BytesIO object to a file would be to just use:
with open('myfile.ext', 'wb') as f:
f.write(myBytesIOObj.getvalue())
myBytesIOObj.close()
However, if I wanted to iterate over the myBytesIOObj as opposed to writing it in one chunk, how would I go about it? I'm on Python 2.7.1. Also, if the BytesIO is huge, would it be a more efficient way of writing by iteration?
Thanks
shutil has a utility that will write the file efficiently. It copies in chunks, defaulting to 16K. Any multiple of 4K chunks should be a good cross platform number. I chose 131072 rather arbitrarily because really the file is written to the OS cache in RAM before going to disk and the chunk size isn't that big of a deal.
import shutil
myBytesIOObj.seek(0)
with open('myfile.ext', 'wb') as f:
shutil.copyfileobj(myBytesIOObj, f, length=131072)
BTW, there was no need to close the file object at the end. with defines a scope, and the file object is defined inside that scope. The file handle is therefore closed automatically on exit from the with block.
Since Python 3.2 it's possible to use the BytesIO.getbuffer() method as follows:
from io import BytesIO
buf = BytesIO(b'test')
with open('path/to/file', 'wb') as f:
f.write(buf.getbuffer())
This way it doesn't copy the buffer's content, streaming it straight to the open file.
Note: The StringIO buffer doesn't support the getbuffer() protocol (as of Python 3.9).
Before streaming the BytesIO buffer to file, you might want to set its position to the beginning:
buf.seek(0)

Python Gzip - Appending to file on the fly

Is it possible to append to a gzipped text file on the fly using Python ?
Basically I am doing this:-
import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()
A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).
Explicitly specifying the compression level does not seem to make a difference either.
If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.
Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?
Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?
That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.
However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.
You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.
You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.

Categories