Accelerating speed of reading contents from dataframe in pandas

Accelerating speed of reading contents from dataframe in pandas - python

Let us suppose we have table with following dimension :
print(metadata.shape)-(8732, 8)
let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :
def feature_extractor(file_name):
audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
return mfccs_scaled_features
if i use following loops :
from tqdm import tqdm
extracted_features =[]
for index_num, row in tqdm(metadata.iterrows()):
file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
final_class_labels=row["class"]
data=feature_extractor(file_name)
extracted_features.append([data,final_class_labels])
it takes in total following amount of time :
3555it [21:15, 2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
n_fft, y.shape[-1]
8326it [48:40, 3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
n_fft, y.shape[-1]
8329it [48:41, 3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
n_fft, y.shape[-1]
8732it [50:53, 2.86it/s]
how can i optimize this code to do the thing in less amount of time? it is possible?

You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.
from pandarallel import pandarallel
pandarallel.initialize()
PATH = os.path.abspath(Base_Directory)
def feature_extractor(file_name):
# If using windows, you may need to put these here~
# import librosa
# import numpy as np
# import os
file_name = os.path.join(PATH, file_name)
audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
return mfccs_scaled_features
df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)

Related

How can I optimize this code to cope with a larger dataset?

I have a set of images with the same size. And I want to insert them into a dataframe, with the rows being the names of the images and the columns being the pixels. They are all in the same directory.
I can already do this for a folder with a few images (as shown in the "Example for 7 images" link below), but when I try it for a dataset with 9912 images, the compile shows "killed". How can I optimize this code to get all the images?
from matplotlib import image
import numpy as np
import pandas as pd
import glob
columns = ["file"]
for i in range (150528):
columns.append("pixel" + str(i))
df = pd.DataFrame(columns = columns)
i = 0
for file in glob.glob('/home/nuno/resizepics/*.jpg'):
imgarr = image.imread(file)
imgarr = imgarr.flatten()
df.loc[i,"file"] = file
for j in range(len(imgarr)):
df.iloc[i,j+1] = imgarr[j]
i += 1
#print(df)
df.to_csv('pixels.csv')
Example for 7 images

If "killed" means it raises an error you can try using exeptions (try, except, else) and make it try again from the spot it stopped. You can also try to delay it a bit with time module because it works with large data.

Python apriori returning Generator instead of Dataframe

I'm writing code that takes a small portion of a dataset (shopping baskets), converts it into a hot encoded dataframe and I want to run mlxtend's apriori algorithm on it to get frequent itemsets.
However, whenever I run the apriori algorithm, it seems to run instantly and it returns a generator object rather than a dataframe. I followed the instructions from the documentation, and in their example it shows that apriori returns a dataframe. What am I doing wrong?
Here is my code:
import numpy as np
import pandas as pd
import csv
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
from apyori import apriori
def simpleRandomisedSample(filename, support_frac, sample_frac):
df1 = pd.read_csv("%s.csv" % filename, header=None) #Saving csv file into a dataframe in memory
size = len(df1)
support = support_frac * len(df1) #Sets the original support value to x% of the original dataset
sample_support = support * sample_frac #Support for our reduced sample as a fraction of the original support
sample = df1.sample(frac=sample_frac) #Saving x% (randomised) of the dataset as our sample
sample = sample.reset_index(drop = True) #Reseting indexes (which previously got randomised along with the data)
del df1 #Deleting original dataframe from memory to clear up space
sample_size = len(sample)
return size, support, sample_size, sample_support, sample
def main():
size, support, sample_size, sample_support, sample = simpleRandomisedSample("chess",0.01,0.1)
print("The original dataset had %d rows and a support of %.2f" % (size, support))
print("The dataset was reduced to %d rows and the sample has a support of %.2f" % (sample_size, sample_support))
sample_list = sample.values.tolist() #Converting Dataframe to list of lists for use with Apriori
te = TransactionEncoder()
te_ary = te.fit(sample_list).transform(sample_list) #Preprocessing our sample to work with Apriori algorithm
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)
if __name__ == "__main__":
main()

You have a name conflict in your imports:
from mlxtend.frequent_patterns import apriori
[...]
from apyori import apriori
Your code is not using the mlxtend algorithm but the one provided by apyori, the one that is imported lated overwrites the previous one.
You can remove the one you're not using or, if you want to have access to both later on, you can give one a different name:
from mlxtend.frequent_patterns import apriori as mlx_apriori
from apyori import apriori as apy_apriori

How to write a video file using Moviepy

I am trying to implement a startle burst into a .mp4 file. I have been able to do this successfully but how I have altered the original audio results in an array. The trouble I am having is that the 'write_videofile' function seems to only take a specific type of structure in order to write the audio file. This is what I have so far any help would be great.
#Necessary imports
import os as os
import moviepy.editor as mp
from playsound import playsound
import librosa as librosa
import librosa.display
import sounddevice as sd
import numpy as np
from moviepy.audio.AudioClip import AudioArrayClip
#Set working directory
os.chdir('/Users/matthew/Desktop/Python_startle')
#Upload the video and indicate where in the clip the burst should be inserted
vid = mp.VideoFileClip("37vids_balanced/Baby_Babble.mp4")
audio, sr = librosa.load('37vids_balanced/Baby_Babble.mp4')
librosa.display.waveplot(audio, sr = sr)
sd.play(audio, sr)
audio_duration = librosa.get_duration(filename = '37vids_balanced/Baby_Babble.mp4')
start_point_sec = 6
start_point_samp = start_point_sec * sr
burst_samp = int(sr * .05) #how many samples are in the noise burst
burst_window = audio[start_point_samp : start_point_samp + burst_samp].astype(np.int32) #Isolating part of the audio clip to add burst to
#Creating startle burst
noise = np.random.normal(0,1,len(burst_window))
#Inserting the noise created into the portion of the audio that is wanted
audio[start_point_samp : start_point_samp + burst_samp] = audio[start_point_samp : start_point_samp + burst_samp].astype(np.int64) + noise #Puts the modified segment into the original audio
librosa.display.waveplot(audio, sr = sr)
sd.play(audio, sr)
new_vid = vid.set_audio(audio)
sd.play(new_vid.audio, sr)
#Trying to get the audio into the proper form to convert into a mp4
new_vid.audio = [[i] for i in new_vid.audio] #This makes the data into a list because the 'AudioArrayClip' could not take in an array because an array is not 'iterable' this might not be the best solution but it has worked so far
new_vid.audio = AudioArrayClip(new_vid.audio, fps = new_vid.fps)
new_vid.write_videofile('37vids_balanced_noise/Baby_Babble.mp4')
Maybe there is an easier way to take the array into an iterable form that would work.

Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

Background
For some place of interest, we want to extract some useful information from open datasource. Take meteorology data for example, we just want to recognize the long-term temporal pattern of one point, but the datafiles sometimes cover the whole word.
Here, I use Python to extract the vertical velocity for one spot in FNL(ds083.2) file every 6 hours in one year.
In other way, I want to read the original data and save the target variables through the timeline
My attempt
import numpy as np
from netCDF4 import Dataset
import pygrib
import pandas as pd
import os, time, datetime
# Find the corresponding grid box
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return array[idx]
## Obtain the X, Y indice
site_x,site_y =116.4074, 39.9042 ## The area of interests.
grib='./fnl_20140101_06_00.grib2' ## Any files for obtaining lat-lon
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[8]#
lon_list,lat_list = grb.latlons()[1][0],grb.latlons()[0].T[0]
x_indice = np.where(lon_list == find_nearest(lon_list, site_x))[0]
y_indice = np.where(lat_list == find_nearest(lat_list, site_y))[0]
def extract_vm():
files = os.listdir('.') ### All files have already save in one path
files.sort()
dict_vm = {"V":[]}
### Travesing the files
for file in files[1:]:
if file[-5:] == "grib2":
grib=file
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[4] ## Select certain Z level
data=grb.values
data = data[y_indice,x_indice]
dict_vm['V'].append(data)
ff = pd.DataFrame(dict_vm)
return ff
extract_vm()
My thought
How to speed up the reading process? Now, I use the linear reading method, the implement time will increase linearly with the processing temporal period.
Can we split those files in several cluster and tackle with them separately with multi-core processor. Are there any other advices on my code to improve the speed?
Any comments will be appreciate!

customizing np.fft.fft function in python

I wish to perform a fourier transform of the function 'stress' from 0 to infinity and extract the real and imaginary parts. I have the following code that does it using a numerical integration technique:
import numpy as np
from scipy.integrate import trapz
import fileinput
import sys,string
window = 200000 # length of the array I wish to transform (number of data points)
time = np.linspace(1,window,window)
freq = np.logspace(-5,2,window)
output = [0]*len(freq)
for index,f in enumerate(freq):
visco = trapz(stress*np.exp(-1j*f*t),t)
soln = visco*(1j*f)
output[index] = soln
print 'f storage loss'
for i in range(len(freq)):
print freq[i],output[i].real,output[i].imag
This gives me a nice transformation of my input data.
Now I have an array of size 2x10^6, and using the above technique is not feasible(computation time scales as O(N^2)), so I have turned to the inbuilt fft function in numpy.
There aren't too many arguments that you can specify to change this function, and so I'm finding it difficult to customize it to my needs.
So far I have
import numpy as np
import fileinput
import sys, string
np.set_printoptions(threshold='nan')
N = len(stress)
fvi = np.fft.fft(stress,n=N)
gprime = fvi.real
gdoubleprime = fvi.imag
for i in range(len(stress)):
print gprime[i], gdoubleprime[i]
And it's not giving me accurate results.
The DFT in python is of the form A_k = summation(a_m * exp(-2*piimk/n)) where the summation is from m = 0 to m = n-1 (http://docs.scipy.org/doc/numpy-1.10.1/reference/routines.fft.html). How can I change it to the form that I have mentioned in my first code, i.e. exp(-1jfreq*t) (freq is the frequency and t is the time which have already been predefined)? Or is there a post processing of the data that I have to do?
Thanks in advance for all your help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accelerating speed of reading contents from dataframe in pandas - python

Related

How can I optimize this code to cope with a larger dataset?

Python apriori returning Generator instead of Dataframe

How to write a video file using Moviepy

Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

customizing np.fft.fft function in python

Categories

Resources