I'm running a WSGI server and part of the API I'm writing returns some (rather large) files along with meta-data about them. I'd like to tar/gzip the files together both to conserve bandwidth and so only one file has to be downloaded. Since WSGI lets you return an iterable object, I'd like to return an iterable that returns chunks of the tar.gz file as it's produced.
My question is what's a good way to tar/gzip files together in Python in a way that's amenable to streaming the output back to the user?
EDIT:
To elaborate on my response to Oben Sonne below, I'll have a function such as:
def iter_file(f,chunk=32768): return iter(lambda: f.read(chunk), '')
Which will let me specify a chunk size to return from the file when returning it to the WSGI server.
Then it's a simple matter of:
return iter_file(subprocess.Popen(["tar", "-Ocz"] + files, stdout=subprocess.PIPE).stdout)
or, if I want to return a file:
return iter_file(open(filename, "rb"))
The bz2 module provides sequential compressing. And it seems the zlib package can compress data sequentially too. So with these modules you could:
tar your files (shouldn't take that long),
read the archive iteratively in binary mode,
pass read chunks to a sequential compression function, and
yield the compressed output of these functions so it may be consumed iteratively by some other component (WSGI)
AFAIK Python's tar-API does not support sequential tar'ing (correct me if I'm wrong). But if your files are so large that you really need to tar sequentially, you could use the subprocess module to run tar on the command line and read its standard output in chunks. In that case you could also use the tar command to compress your data. Then you only had to read the stdout of your subprocess and yield read chunks.
Related
I am learning machine learning and data analysis on wav files.
I know if I have wav files directly I can do something like this to read in the data
import librosa
mono, fs = librosa.load('./small_data/time_series_audio.wav', sr = 44100)
Now I'm given a gz-file "music_feature_extraction_test.tar.gz"
I'm not sure what to do now.
I tried:
with gzip.open('music_train.tar.gz', 'rb') as f:
for files in f :
mono, fs = librosa.load(files, sr = 44100)
but it gives me:
TypeError: lstat() argument 1 must be encoded string without null bytes, not str
Can anyone help me out?
There are several things going on:
The file you are given is a gzipped-compressed tarball. Take a look at the tarfile module, it can read gzip-compressed files directly. You'll get an iterator over it's members, each of which is an individual file.
AFAIKS librosa can't read from an in-memory buffer so you have to unpack the tar-members to temporary files. The tempfile-module is your friend here, a NamedTemporaryFile will provide you with a self-deleting file that you can uncompress to and provide to librosa.
You probably want to implement this as a simple generator function that takes the tarfile-name as it's input, iterates over it's members and yields what librosa.load() provides you. That way everything gets cleaned up automatically.
The basic loop would therefore be
Open the tarball using the tarfile-module. For each member
Get a new temporary file using NamedTemporaryFile. Copy the content of the tarball-member to that file. You may want to use shutil.copyfileobj to avoid reading the entire wav-file into memory before writing it to disk.
The NamedTemporaryFile has a filename-attribute. Pass that to librosa.open.
yield the return value of librosa.open to the caller.
You can use PySoundFile to read from the compressed file.
https://pysoundfile.readthedocs.io/en/0.9.0/#virtual-io
import soundfile
with gzip.open('music_train.tar.gz', 'rb') as gz_f:
for file in gz_f :
fs, mono = soundfile.read(file, samplerate=44100)
Maybe you should also check if you need to resample the data before processing it with librosa:
https://librosa.github.io/librosa/ioformats.html#read-specific-formats
I use multiple python scripts that collect data and write it into one single json data file.
It is not possible to combine the scripts.
The writing process is fast and it happens often that errors occur (e.g. some chars at the end duplicate), which is fatal, especially since I am using json format.
Is there a way to prevent a python script to write into a file if there are other script currently trying to write into the file? (It would be absolutely ok, if the data that the python script tries to write into the file gets lost, but it is important that the file syntax does not get somehow 'injured'.)
Code Snipped:
This opens the file and retrieves the data:
data = json.loads(open("data.json").read())
This appends a new dictionary:
data.append(new_dict)
And the old file is overwritten:
open("data.json","w").write( json.dumps(data) )
Info: data is a list which contains dicts.
Operating System: The hole process takes place on linux server.
On Windows, you could try to create the file, and bail out if an exception occurs (because file is locked by another script). But on Linux, your approach is bound to fail.
Instead, I would
write one file per new dictionary, suffixing filename by process ID and a counter
consuming process(es) don't read a single file, but the sorted files (according to modification time) and build the data from it
So in each script:
filename = "data_{}_{}.json".format(os.getpid(),counter)
counter+=1
open(filename ,"w").write( json.dumps(new_dict) )
and in the consumers (reading each dict of sorted files in a protected loop):
files = sorted(glob.glob("*.json"),key=os.path.getmtime())
data = []
for f in files:
try:
with open(f) as fh:
data.append(json.load(fh))
except Exception:
# IO error, malformed json file: ignore
pass
I will post my own solution, since it works for me:
Every single python script checks (before opening and writing the data file) whether a file called data_check exists. If so, the pyhthon script does not try to read and write the file and dismisses the data, that was supposed to be written into the file. If not, the python script creates the file data_check and then starts to read and wirte the file. After the writing process is done the file data_check is removed.
I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)
Is it possible to append to a gzipped text file on the fly using Python ?
Basically I am doing this:-
import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()
A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).
Explicitly specifying the compression level does not seem to make a difference either.
If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.
Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?
Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?
That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.
However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.
You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.
You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.
I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.