noise reduction on multiple wav file in python

noise reduction on multiple wav file in python - python

I have used the code from
https://github.com/davidpraise45/Audio-Signal-Processing
to make a function to run it on an entire folder which contains around 100 wav files, but unable to get output cant understand what seems to be the problem.
def noise_reduction(dirName):
types = ('*.wav', '*.aif', '*.aiff', '*.mp3', '*.au', '*.ogg')
wav_file_list = []
for files in types:
wav_file_list.extend(glob.glob(os.path.join(dirName, files)))
wav_file_list = sorted(wav_file_list)
wav_file_list2 = []
for i, wavFile in enumerate(wav_file_list):
#samples = get_samples(wavFile,)
(Frequency, samples)=read(wavFile)
FourierTransformation = sp.fft(samples) # Calculating the fourier transformation of the signal
scale = sp.linspace(0, Frequency, len(samples))
b,a = signal.butter(5, 9800/(Frequency/2), btype='highpass') # ButterWorth filter 4350
filteredSignal = signal.lfilter(b,a,samples)
c,d = signal.butter(5, 200/(Frequency/4), btype='lowpass') # ButterWorth low-filter
newFilteredSignal = signal.lfilter(c,d,filteredSignal) # Applying the filter to the signal
write(New,wavFile, Frequency, newFilteredSignal)
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav")

scipy.io.wavfile.read only supports the WAV format. It can not read aif, aiff, mp3, au, or ogg files.
You have four arguments to the scipy.io.wavfile.write function that only takes three. New,wavFile should most likely be os.path.join(os.path.dirname(wavFile), "New"+os.path.basename(wavFile)). This create a file with the New prefix in the same directory as the original. If you want to create them in the current directory instead use "New"+os.path.basename(wavFile).
You are passing a filename, not the name of a directory to your function:
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav")
should likely be:
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad")
This causes the glob pattern to end up being: C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav\\*.wav. This pattern has no matches unless sad_1.wav was a directory and it had files in it that ended with .wav.

Related

Store output from for loop in an array

I am still at the very beginning of my Python journey, so this question might be basic for more advanced programmers.
I would like to analyse a bunch of .wav files, that are all stored in the same directory, so I created a list of all the filenames so I could then get their audiosignal and samplerate.
dirPath=r"path_to_directory"
global files
files = [f for f in os.listdir(dirPath) if os.path.isfile(os.path.join(dirPath, f)) and f.endswith(".wav")]
for file_name in files:
path_to_file= dirPath+"\\"+file_name
audio_signal, sample_rate = sf.read(path_to_file)
with sf being the soundfile library.
audio_signal is an array, sample_rate is a number.
Now I would like to be able to store both audio_signal and sample_rate together with the corresponding file_name, so I can access them later. How do I do this?
I tried
files = [f for f in os.listdir(dirPath) if os.path.isfile(os.path.join(dirPath, f)) and f.endswith(".wav")],[]
for file_name in files[0]:
path_to_file= dirPath+"\\"+file_name
audio_signal, sample_rate = sf.read(path_to_file)
files[1].append(audio_signal)
files[2].append(sample_rate)
which seems to work, but is there a more elegant way? I feel like the audio_signal, sample_rate and file_name are individual values rather than codependent.

The data structure you are looking for is an associative array which associates a key with a value – the key being the file name and the value being a tuple consisting of the audio signal and sample rate in this example.
An implementation of associative arrays is built into Python as the dictionary type.
You can learn about dictionaries here:
https://docs.python.org/3/tutorial/datastructures.html#dictionaries
The application in your code would look like:
result = {}
for file_name in files:
# as before:
# ...
audio_signal, sample_rate = ...
# new:
result[file_name] = audio_signal, sample_rate

Pytables: Can Appended Earray be reduced in size?

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

compute mfcc for audio file over fixed chunks and save them separately

WHAT I WAS ABLE TO DO:
Currently I'm able to generate mfcc for all files in a given folder and save them as follows:
def gen_features(in_path, out_path):
src = in_path + '/'
output_path = out_path + '/'
sr = 22050
path_to_audios = [os.path.join(src, f) for f in os.listdir(src)]
for audio in path_to_audios:
audio_data = librosa.load(audio_path, sr=22050)[0] # getting y
mfcc_feature_list = librosa.feature.mfcc(y=audio_data,sr=sr) # create mfcc features
np.savetxt(blah blahblah , mfcc_feature_list, delimiter ="\t")
gen_features('/home/data','home/data/features')
DIFFICULTY:
my input audio recordings are pretty long, each is atleast 3-4 hours long.
this program is very inefficient as the file size after np.savetxt is becoming pretty big ~ 1.5MB txt file for 1 minute of audio. I plan to combine mfcc with more features in the future. So the saved file text size will explode. I want to keep it smaller 5 minute chunks for easy of processing.
WHAT I WANT TO DO:
add one more parameter len to gen_features, this must specify the length of audio to be processed at a time.
So if the input audio abc.mp3is 13 minutes long, and I specify len = 5 meaning 5minutes,
then mfcc's should be computed for [0.0,5.0) [5.0,10.0) and [10.0,13.0] and they should be saved
as
mfcc_filename_chunk_1.csv
mfcc_filename_chunk_2.csv
mfcc_filename_chunk_3.csv
Like this I want to do it for all files in that directory.
I want to do achieve this using librosa.
I am unable to get any ideas on how to proceed.
More awesome thing to do would be, compute this over overlapping intervals, example if len =5 is passed ,
then
chunk one should be over [0.0,5.1]
chunk two should be over [5.0,10.1]
chunk three should be over [10.0,13.0]

Time-series data analysis using scientific python: continuous analysis over multiple files

The Problem
I'm doing time-series analysis. Measured data comes from the sampling the voltage output of a sensor at 50 kHz and then dumping that data to disk as separate files in hour chunks. Data is saved to an HDF5 file using pytables as a CArray. This format was chosen to maintain interoperability with MATLAB.
The full data set is now multiple TB, far too large to load into memory.
Some of my analysis requires me to iterative over the full data set. For analysis that requires me to grab chunks of data, I can see a path forward through creating a generator method. I'm a bit uncertain of how to proceed with analysis that requires a continuous time series.
Example
For example, let's say I'm looking to find and categorize transients using some moving window process (e.g. wavelet analysis) or apply a FIR filter. How do I handle the boundaries, either at the end or beginning of a file or at chunk boundaries? I would like the data to appear as one continuous data set.
Request
I would love to:
Keep the memory footprint low by loading data as necessary.
Keep a map of the entire data set in memory so that I can address the data set as I would a regular pandas Series object, e.g. data[time1:time2].
I'm using scientific python (Enthought distribution) with all the regular stuff: numpy, scipy, pandas, matplotlib, etc. I only recently started incorporating pandas into my work flow and I'm still unfamiliar with all of its capabilities.
I've looked over related stackexchange threads and didn't see anything that exactly addressed my issue.
EDIT: FINAL SOLUTION.
Based upon the helpful hints I built a iterator that steps over files and returns chunks of arbitrary size---a moving window that hopefully handles file boundaries with grace. I've added the option of padding the front and back of each of the windows with data (overlapping windows). I can then apply a succession of filters to the overlapping windows and then remove the overlaps at the end. This, I hope, gives me continuity.
I haven't yet implemented __getitem__ but it's on my list of things to do.
Here's the final code. A few details are omitted for brevity.
class FolderContainer(readdata.DataContainer):
def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)
self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)
def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.
Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}
# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')
def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object
"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)
def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.
Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]
Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]
Yields:
chunk: generator expression
"""
# PARSE INPUT ARGUMENTS
# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)
# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)
# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)
frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)
# CREATE FILE LIST
# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")
# MOVING WINDOW GENERATOR ALGORITHM
# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()
# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples
def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")
def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time
def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname

#Sean here's my 2c
Take a look at this issue here which I created a while back. This is essentially what you are trying to do. This is a bit non-trivial.
Without knowing more details, I would offer a couple of suggestions:
HDFStore CAN read in a standard CArray type of format, see here
You can easily create a 'Series' like object that has nice properties of a) knowing where each file is and its extents, and uses __getitem__ to 'select' those files, e.g. s[time1:time2]. From a top-level view this might be a very nice abstraction, and you can then dispatch operations.
e.g.
class OutOfCoreSeries(object):
def __init__(self, dir):
.... load a list of the files in the dir where you have them ...
def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....
def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)
This can very easily get quite complicated. For instance, if you apply an operation that does a reduction that you can fit in memory, Results can simpley be a pandas.Series (or Frame). Hoever,
you may be doing a transformation which necessitates you writing out a new set of transformed data files. If you so, then you have to handle this.
Several more suggestions:
You may want to hold onto your data in possibly multiple ways that may be useful. For instance you say that you are saving multiple values in a 1-hour slice. It may be that you can split these 1-hour files instead into a file for each variable you are saving but save a much longer slice that then becomes memory readable.
You might want to resample the data to lower frequencies, and work on these, loading the data in a particular slice as needed for more detailed work.
You might want to create a dataset that is queryable across time, e.g. say high-low peaks at varying frequencies, e.g. maybe using the Table format see here
Thus you may have multiple variations of the same data. Disk space is usually much cheaper/easier to manage than main memory. It makes a lot of sense to take advantage of that.

From Python Code to a Working Executable File (Downsizing Grid Files Program)

I posted a question earlier about a syntax error here Invalid Syntax error in Python Code I copied from the Internet. Fortunately, my problem was fixed really fast thanks to you. However now that there is no syntax error I found myself helpless as I don't know what to do now with this code. As I've said I've done some basic Python Training 3 years ago but the human brain seems to forget things so fast.
So in a few words, I need to reduce the grid resolution of some files to half and I've been searching for a way to do it for weeks. Luckily I found some python code that seems to do exactly what I am looking for. The code is this :
#!/bin/env python
# -----------------------------------------------------------------------------
# Reduce grid data to a smaller size by averaging over cells of specified
# size and write the output as a netcdf file. xyz_origin and xyz_step
# attributes are adjusted.
#
# Syntax: downsize.py <x-cell-size> <y-cell-size> <z-cell-size>
# <in-file> <netcdf-out-file>
#
import sys
import Numeric
from VolumeData import Grid_Data, Grid_Component
# -----------------------------------------------------------------------------
#
def downsize(mode, cell_size, inpath, outpath):
from VolumeData import fileformats
try:
grid_data = fileformats.open_file(inpath)
except fileformats.Uknown_File_Type as e:
sys.stderr.write(str(e))
sys.exit(1)
reduced = Reduced_Grid(grid_data, mode, cell_size)
from VolumeData.netcdf.netcdf_grid import write_grid_as_netcdf
write_grid_as_netcdf(reduced, outpath)
# -----------------------------------------------------------------------------
# Average over cells to produce reduced size grid object.
#
# If the grid data sizes are not multiples of the cell size then the
# final data values along the dimension are not included in the reduced
# data (ie ragged blocks are not averaged).
#
class Reduced_Grid(Grid_Data):
def __init__(self, grid_data, mode, cell_size):
size = map(lambda s, cs: s / cs, grid_data.size, cell_size)
xyz_origin = grid_data.xyz_origin
xyz_step = map(lambda step, cs: step*cs, grid_data.xyz_step, cell_size)
component_name = grid_data.component_name
components = []
for component in grid_data.components:
components.append(Reduced_Component(component, mode, cell_size))
Grid_Data.__init__(self, '', '', size, xyz_origin, xyz_step,
component_name, components)
# -----------------------------------------------------------------------------
# Average over cells to produce reduced size grid object.
#
class Reduced_Component(Grid_Component):
def __init__(self, component, mode, cell_size):
self.component = component
self.mode = mode
self.cell_size = cell_size
Grid_Component.__init__(self, component.name, component.rgba)
# ---------------------------------------------------------------------------
#
def submatrix(self, ijk_origin, ijk_size):
ijk_full_origin = map(lambda i, cs: i * cs, ijk_origin, self.cell_size)
ijk_full_size = map(lambda s, cs: s*cs, ijk_size, self.cell_size)
values = self.component.submatrix(ijk_full_origin, ijk_full_size)
if mode == 'ave':
m = average_down(values, self.cell_size)
I have this saved as a .py file and when I double click it, the command prompt appears for a milisecond and then disappears. I managed to take a screenshot of that command prompt which it says "Unable to create process using 'bin/env python "C:\Users...........py".
What I want to do is to be able to do this downsizing using the Syntax that the code tells me to use :
# Syntax: downsize.py <x-cell-size> <y-cell-size> <z-cell-size>
# <in-file> <netcdf-out-file>
Can you help me ?

Don't run the file by double-clicking it. Run the file by opening a new shell, and typing in the path to the .py file (or just cd to the parent directory) followed by the arguments you want to pass. For example:
python downsize.py 1 2 3 foo bar

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.