I am learning machine learning and data analysis on wav files.
I know if I have wav files directly I can do something like this to read in the data
import librosa
mono, fs = librosa.load('./small_data/time_series_audio.wav', sr = 44100)
Now I'm given a gz-file "music_feature_extraction_test.tar.gz"
I'm not sure what to do now.
I tried:
with gzip.open('music_train.tar.gz', 'rb') as f:
for files in f :
mono, fs = librosa.load(files, sr = 44100)
but it gives me:
TypeError: lstat() argument 1 must be encoded string without null bytes, not str
Can anyone help me out?
There are several things going on:
The file you are given is a gzipped-compressed tarball. Take a look at the tarfile module, it can read gzip-compressed files directly. You'll get an iterator over it's members, each of which is an individual file.
AFAIKS librosa can't read from an in-memory buffer so you have to unpack the tar-members to temporary files. The tempfile-module is your friend here, a NamedTemporaryFile will provide you with a self-deleting file that you can uncompress to and provide to librosa.
You probably want to implement this as a simple generator function that takes the tarfile-name as it's input, iterates over it's members and yields what librosa.load() provides you. That way everything gets cleaned up automatically.
The basic loop would therefore be
Open the tarball using the tarfile-module. For each member
Get a new temporary file using NamedTemporaryFile. Copy the content of the tarball-member to that file. You may want to use shutil.copyfileobj to avoid reading the entire wav-file into memory before writing it to disk.
The NamedTemporaryFile has a filename-attribute. Pass that to librosa.open.
yield the return value of librosa.open to the caller.
You can use PySoundFile to read from the compressed file.
https://pysoundfile.readthedocs.io/en/0.9.0/#virtual-io
import soundfile
with gzip.open('music_train.tar.gz', 'rb') as gz_f:
for file in gz_f :
fs, mono = soundfile.read(file, samplerate=44100)
Maybe you should also check if you need to resample the data before processing it with librosa:
https://librosa.github.io/librosa/ioformats.html#read-specific-formats
Related
I need to convert a MP4 downloaded with the Requests module, into a MP3 with the Moviepy module.
The 2 operations work perfectly.
However, in order to convert the MP4 into MP3 (using the moviepy audio.write_audiofile() method),
I need to save the MP4 onto the disk (using the requests write() method)
which basically is useless since I will delete right after.
Please do you know if there's a method that takes the content downloaded with Requests, and converts it directly into MP3 file?
Thank you in advance!
I am not so familiar with this, but i think you can use io.BytesIO (https://docs.python.org/3/library/io.html#io.BytesIO), with this you can write data (from requets) into a BytesIO-Object instead of a file (you can use it on every file read/write actions instead of a real file). An example:
import io
b = io.BytesIO()
with open("file.dat", "br") as f:
b.write(f.read())
b.seek(0)
with open("new_file.dat", "bw") as f:
f.write(b.read())
b.close()
I am trying to edit length of wav files using the wave module. However it seems that I can't get anywhere because i keep getting the same error that number of channels is not specified. Still, when i write something to see the number of channels i still get that error or when i try to set the number of channels as seen here:
def editLength(wavFile):
file = wave.open(wavFile, 'w')
file.setnchannels(file.getnchannels())
x = file.getnchannels()
print (x)
from https://docs.python.org/3.7/library/wave.html#wave.open
wave.open(file, mode=None)
If file is a string, open the file by that name, otherwise treat it as a file-like
object.
mode can be:
'rb' Read only mode.
'wb' Write only mode.
Note that it does not allow read/write WAV files.
You attempt to read and write from a WAV file, the file object has at the time of the first file.getnchannels() not specified the number of channels.
def editLength(wavFile):
with open(wavFile, "rb") as file:
x = file.getnchannels()
print(x)
if you want to edit the file you should first read from the original file and write to a temporary file. then copy the temporary file over the original file.
It's maybe not super obvious from the docs: https://docs.python.org/3/library/wave.html
The Wave_write object expects you to explicitly set all params for the object.
After a little trial and error I was able to read my wav file and write a specific duration to disk.
For example if I have a 44.1k sample rate wav file with 2 channels...
import wave
with wave.open("some_wavfile.wav", "rb") as handle:
params = handle.getparams()
# only read the first 10 seconds of audio
frames = handle.readframes(441000)
print(handle.tell())
print(params)
params = list(params)
params[3] = len(frames)
print(params)
with wave.open("output_wavfile.wav", "wb") as handle:
handle.setparams(params)
handle.writeframes(frames)
This should leave you with an stdout looking something like this.
441000
_wave_params(nchannels=2, sampwidth=2, framerate=44100, nframes=10348480, comptype='NONE', compname='not compressed')
[2, 2, 44100, 1764000, 'NONE', 'not compressed']
nframes here is 1764000 probably because nchannels=2 and sampwidth=2, so 1764000/4=441000 (I guess)
Oddly enough, setparams was able to accept a list instead of a tuple.
ffprobe shows exactly 10 seconds of audio for the output file and sounds perfect to me.
I have a large number of compressed HDF files, which I need to read.
file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...
I can read in uncompressed HDF files with the following method
from pyhdf.SD import SD, SDC
import os
os.system('gunzip < file1.HDF.gz > file1.HDF')
HDF = SD('file1.HDF')
and repeat this for each file. However, this is more time consuming than I want.
I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.
Am I correct in this thinking? And if so, is there a way to do what I want?
According to the pyhdf package documentation, this is not possible.
__init__(self, path, mode=1)
SD constructor. Initialize an SD interface on an HDF file,
creating the file if necessary.
There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.
Unzipping it as a file is likely your most performant option.
If you would like to stay in Python, use the gzip module (docs):
import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.
How to decompress *.bz2 file in memory with python?
The bz2 file comes from a csv file.
I use the code below to decompress it in memory, it works, but it brings some dirty data such as filename of the csv file and author name of it, is there any other better way to handle it?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import StringIO
import bz2
with open("/app/tmp/res_test.tar.bz2", "rb") as f:
content = f.read()
compressedFile = StringIO.StringIO(content)
decompressedFile = bz2.decompress(compressedFile.buf)
compressedFile.seek(0)
with open("/app/tmp/decompress_test", 'w') as outfile:
outfile.write(decompressedFile)
I found this question, it is in gzip, however my data is in bz2 format, I try to do as instructed in it, but it seems that bz2 could not handle it in this way.
Edit:
No matter the answer of #metatoaster or the code above, both of them will bring some more dirty data into the final decompressed file.
For example: my original data is attached below and in csv format with the name res_test.csv:
Then I cd into the directory where the file is in and compress it with tar -cjf res_test.tar.bz2 res_test.csv and get the compressed file res_test.tar.bz2, this file could simulate the bz2 data that I will get from internet and I wish to decompress it in memory without cache it into disk first, but what I get is data below and contains too much dirty data:
The data is still there, but submerged in noise, does it possible to decompress it into pure data just the same as the original data instead of decompress it and extract real data from too much noise?
For generic bz2 decompression, BZ2File class may be used.
from bz2 import BZ2File
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
content = f.read()
content should contain the decompressed contents of the file.
However, given that this is a tar file (an archive file that is normally extracted to disk as a directory of files), the tarfile module could be used instead, and it has extended mode flags for handling bz2. Assuming the target file contains a res_test.csv, the following can be used
tf = tarfile.open('/app/tmp/res_test.tar.bz2', 'r:bz2')
csvfile = tf.extractfile('res_test.csv').read()
The r:bz2 flag opens the tar archive in a way that makes it possible to seek backwards, which is important as the alternative method r|bz2 makes it impractical to call extract files from the members it return by extractfile. The second line simply calls extractfile to return the contents of 'res_test.csv' from the archive file as a string.
The transparent open mode ('r:*') is typically recommended, however, so if the input tar file is compressed using gzip instead no failure will be encountered.
Naturally, the tarfile module has a lower level open method which may be used on arbitrary stream objects. If the file was already opened using BZ2File already, this can also be used
with BZ2File("/app/tmp/res_test.tar.bz2") as f:
tf = tarfile.open(fileobj=f, mode='r:')
csvfile = tf.extractfile('res_test.csv').read()
I'm running a WSGI server and part of the API I'm writing returns some (rather large) files along with meta-data about them. I'd like to tar/gzip the files together both to conserve bandwidth and so only one file has to be downloaded. Since WSGI lets you return an iterable object, I'd like to return an iterable that returns chunks of the tar.gz file as it's produced.
My question is what's a good way to tar/gzip files together in Python in a way that's amenable to streaming the output back to the user?
EDIT:
To elaborate on my response to Oben Sonne below, I'll have a function such as:
def iter_file(f,chunk=32768): return iter(lambda: f.read(chunk), '')
Which will let me specify a chunk size to return from the file when returning it to the WSGI server.
Then it's a simple matter of:
return iter_file(subprocess.Popen(["tar", "-Ocz"] + files, stdout=subprocess.PIPE).stdout)
or, if I want to return a file:
return iter_file(open(filename, "rb"))
The bz2 module provides sequential compressing. And it seems the zlib package can compress data sequentially too. So with these modules you could:
tar your files (shouldn't take that long),
read the archive iteratively in binary mode,
pass read chunks to a sequential compression function, and
yield the compressed output of these functions so it may be consumed iteratively by some other component (WSGI)
AFAIK Python's tar-API does not support sequential tar'ing (correct me if I'm wrong). But if your files are so large that you really need to tar sequentially, you could use the subprocess module to run tar on the command line and read its standard output in chunks. In that case you could also use the tar command to compress your data. Then you only had to read the stdout of your subprocess and yield read chunks.