Using DataSketch to find similarity between 3 audios using mfccs - python

So i am using the datasketch library to find if the audio 2 and audio 3 are similar to the audio 1. However even at the threshold=1 where it should only output audios that are 100% same, it shows the out of the other 2 audios aswell which are really different from the 1st audio. The link to the audios All of them are different audios but with same 29second length
from datasketch import MinHash , MinHashLSH
x1 , Sr1 = librosa.load(r'path\f1.mp3')
mfcc1 = librosa.feature.mfcc(y=x1 , sr=Sr1)
mfcc1 = mfcc1.tobytes()
x2 , Sr2 = librosa.load(r'path\f2.mp3')
mfcc2 = librosa.feature.mfcc(y=x2 , sr=Sr2)
mfcc2 = mfcc2.tobytes()
x3 , Sr3 = librosa.load(r'path\f3.mp3')
mfcc3 = librosa.feature.mfcc(y=x3 , sr=Sr3)
mfcc3 = mfcc3.tobytes()
minhash1 = MinHash(num_perm=128 , hashfunc=hash)
minhash2 = MinHash(num_perm=128 , hashfunc=hash)
minhash3 = MinHash(num_perm=128 , hashfunc=hash)
for col1 in mfcc1:
minhash1.update(col1)
for col2 in mfcc2:
minhash2.update(col2)
for col3 in mfcc3:
minhash3.update(col3)
lsh = MinHashLSH(threshold= 1 , num_perm=128)
lsh.insert("minhash2",minhash2)
lsh.insert("minhash3",minhash3)
result=lsh.query(minhash1)
print(result)

Related

Python Writing a function with a Markov chain matrix

I'm learnign python of biologists
I have this Markov chain of an infectious disease process:
I had to save a dictionary with states as keys and values as states index and then fill a respective transition matrix representing the Markov chain so I did this:
state_index = {'S':0, 'I':1, 'R':2, 'D':3}
initial_distribution = [0.9999, 0.0001, 0, 0]
# model parameters
r = 0.05
d = 0.02
h = 0.95
S, I, R, D = initial_distribution
T = sp.matrix([[1-(r*I) , (r*I) , 0 , 0 ],
[0 , 1-d , d*h , d*(1-h) ],
[0 , 0 , 1 , 0 ],
[0 , 0 , 0 , 1 ]])
Then I had to write a function name 'step' that gets a list with the state probability distribution (indexed according to the dictionary), the model parameters, and the transition matrix.
And the function returns an updated probability distribution (after multiplying it with the transition matrix) and an updated transition matrix (taking into account I change).
So I wrote this: but couldn't understand how can I update a transition matrix here
def step(dist, T, r, d, h):
next_dist_matrix = dist*T #updated probability distribution
next_dist = list(np.array(next_dist_matrix)[0])
next_T = T?
return next_dist, next_T
And now I'm stuck here because I need this func to create for-loop to take 1000 simulation steps. each time saving the current state distribution to a list

Tensorflow 2.0: Packing numerical features of a dataset together in a functional way

I am trying to reproduce Tensorflow tutorial code from here which is supposed to download CSV file and preprocess data (up to combining numerical data together).
The reproducible example goes as follows:
import tensorflow as tf
print("TF version is: {}".format(tf.__version__))
# Download data
train_url = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
test_url = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_path = tf.keras.utils.get_file("train.csv", train_url)
test_path = tf.keras.utils.get_file("test.csv", test_url)
# Get data into batched dataset
def get_dataset(path):
dataset = tf.data.experimental.make_csv_dataset(path
,batch_size=5
,num_epochs=1
,label_name='survived'
,na_value='?'
,ignore_errors=True)
return dataset
raw_train_dataset = get_dataset(train_path)
raw_test_dataset = get_dataset(test_path)
# Define numerical and categorical column lists
def get_df_batch(dataset):
for batch,label in dataset.take(1):
df = pd.DataFrame()
df['survived'] = label.numpy()
for key, value in batch.items():
df[key] = value.numpy()
return df
dfb = get_df_batch(raw_train_dataset)
num_columns = [i for i in dfb if (dfb[i].dtype != 'O' and i!='survived')]
cat_columns = [i for i in dfb if dfb[i].dtype == 'O']
# Combine numerical columns into one `numerics` column
class Pack():
def __init__(self,names):
self.names = names
def __call__(self,features, labels):
num_features = [features.pop(name) for name in self.names]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features["numerics"] = num_features
return features, labels
packed_train = raw_train_dataset.map(Pack(num_columns))
# Show what we got
def show_batch(dataset):
for batch, label in dataset.take(1):
for key, value in batch.items():
print("{:20s}: {}".format(key,value.numpy()))
show_batch(packed_train)
TF version is: 2.0.0
sex : [b'female' b'female' b'male' b'male' b'male']
class : [b'Third' b'First' b'Second' b'First' b'Third']
deck : [b'unknown' b'E' b'unknown' b'C' b'unknown']
embark_town : [b'Queenstown' b'Cherbourg' b'Southampton' b'Cherbourg' b'Queenstown']
alone : [b'n' b'n' b'y' b'n' b'n']
numerics : [[ 28. 1. 0. 15.5 ]
[ 40. 1. 1. 134.5 ]
[ 32. 0. 0. 10.5 ]
[ 49. 1. 0. 89.1042]
[ 2. 4. 1. 29.125 ]]
Then I try, and fail, combine numeric features in a functional way:
#tf.function
def pack_func(row, num_columns=num_columns):
features, labels = row
num_features = [features.pop(name) for name in num_columns]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features['numerics'] = num_features
return features, labels
packed_train = raw_train_dataset.map(pack_func)
Partial traceback:
ValueError: in converted code:
:3 pack_func *
features, labels = row
ValueError: too many values to unpack (expected 2)
2 questions here:
How features and labels are get assigned in def __call__(self,features, labels): in the definition of Class Pack. My intuition they should be passed in as defined variables, though I absolutely do not understand where they get defined.
When I do
for row in raw_train_dataset.take(1):
print(type(row))
print(len(row))
f,l = row
print(f)
print(l)
I see that row in raw_train_dataset is a tuple2, which can be successfully unpacked into features and labels. Why it cannot be done via map API? Can you suggest the right way of combining numerical features in functional way?
Many thanks in advance!!!
After some research and trial the answer to the second question seems to be:
def pack_func(features, labels, num_columns=num_columns):
num_features = [features.pop(name) for name in num_columns]
num_features = [tf.cast(feat, tf.float32) for feat in num_features]
num_features = tf.stack(num_features, axis=1)
features['numerics'] = num_features
return features, labels
packed_train = raw_train_dataset.map(pack_func)
show_batch(packed_train)
sex : [b'male' b'male' b'male' b'female' b'male']
class : [b'Third' b'Third' b'Third' b'First' b'Third']
deck : [b'unknown' b'unknown' b'unknown' b'E' b'unknown']
embark_town : [b'Southampton' b'Southampton' b'Queenstown' b'Cherbourg' b'Queenstown']
alone : [b'y' b'n' b'n' b'n' b'y']
numerics : [[24. 0. 0. 8.05 ]
[14. 5. 2. 46.9 ]
[ 2. 4. 1. 29.125 ]
[39. 1. 1. 83.1583]
[21. 0. 0. 7.7333]]

Trying to make a histogram within python for number of magnitudes from a text file

I need help on making a histogram dealing with the number of times a magnitude is within a range. I have a histogram made with the galaxy number, but I realized that doesn't really give any information.
I have tried making a bin of the galaxy numbers but realized that didn't really matter, nor did it work.
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import csv
import math
from collections import Counter
import numpy as np
from numpy.polynomial.polynomial import polyfit
histflux = []
galnum = []
with open('/home/jacob/PHOTOMETRY/PHOTOM_CATS/SpARCS-0035_totalall_HAWKIKs.cat', 'r') as magfile:
magplots = csv.reader(magfile)
firstmagline = magfile.readline()
for line in magfile:
id , ra , dec , x , y , hawkiks_tot , k_flag , k_star , k_fluxrad , totmask , hawkiks , ehawkiks , vimosu , evimosu , vimosb , \
evimosb , vimosv , evimosv , vimosr , evimosr , vimosi , evimosi , decamz , edecamz , fourstarj1 , efourstarj1 , hawkij , ehawkij , \
irac1 , eirac1 , irac2, eirac2 , irac3 , eirac3 , irac4 , eirac4 = line.split()
goodflag = float(k_flag)
goodhawki = float(hawkiks)
if goodflag != 0.0:
continue
try:
histfluxk = -2.5 * math.log10(goodhawki) +25
except ValueError:
print(histfluxk)
histflux.append(histfluxk)
galnum.append(float(id))
plt.hist([galnum, histflux])
plt.xlabel('Galaxy Number')
plt.ylabel('K-Band Magnitude')
plt.title('K-Band Magnitudes of Galaxies')
plt.legend()
plt.show()
What I want to see is a histogram with the x axis ranging from 0-20 flux magnitudes in intervals of 2. The y-axis should be the number of times that the flux magnitudes were within those ranges. I am stumped on how to do this because I am new to python and especially making graphs on python.

Python - Clipping out data to fit profiles

I have several sets of data to which I'm trying to fit different profiles. In the centre of one of the minima there is contamination that prevents me from doing a good fit as you can see in this image:
How can I clip out those spikes in the bottom of my data taking into account that the spike is not always in the same position? Or how would you deal with data like this? I'm using lmfit to fit the profiles, in this case a Lorentzian and a Gaussian. Here is a minimal working example where I have played with the initial values to fit the data more closely:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
from lmfit.models import GaussianModel, ConstantModel, LorentzianModel
x = np.array([4085.18084467, 4085.38084374, 4085.5808428 , 4085.78084186, 4085.98084092, 4086.18083999, 4086.38083905, 4086.58083811, 4086.78083717, 4086.98083623, 4087.1808353 , 4087.38083436, 4087.58083342, 4087.78083248, 4087.98083155, 4088.18083061, 4088.38082967, 4088.58082873, 4088.78082779, 4088.98082686, 4089.18082592, 4089.38082498, 4089.58082404, 4089.78082311, 4089.98082217, 4090.18082123, 4090.38082029, 4090.58081935, 4090.78081842, 4090.98081748, 4091.18081654, 4091.3808156 , 4091.58081466, 4091.78081373, 4091.98081279, 4092.18081185, 4092.38081091, 4092.58080998, 4092.78080904, 4092.9808081 , 4093.18080716, 4093.38080622, 4093.58080529, 4093.78080435, 4093.98080341, 4094.18080247, 4094.38080154, 4094.5808006 , 4094.78079966, 4094.98079872, 4095.18079778, 4095.38079685, 4095.58079591, 4095.78079497, 4095.98079403, 4096.1807931 , 4096.38079216, 4096.58079122, 4096.78079028, 4096.98078934, 4097.18078841, 4097.38078747, 4097.58078653, 4097.78078559,4097.98078466, 4098.18078372, 4098.38078278, 4098.58078184, 4098.7807809 , 4098.98077997, 4099.18077903, 4099.38077809, 4099.58077715, 4099.78077622, 4099.98077528, 4100.18077434, 4100.3807734 , 4100.58077246, 4100.78077153, 4100.98077059, 4101.18076965, 4101.38076871, 4101.58076778, 4101.78076684, 4101.9807659 , 4102.18076496, 4102.38076402, 4102.58076309, 4102.78076215, 4102.98076121, 4103.18076027, 4103.38075934, 4103.5807584 , 4103.78075746, 4103.98075652, 4104.18075558, 4104.38075465, 4104.58075371, 4104.78075277, 4104.98075183, 4105.1807509 , 4105.38074996, 4105.58074902, 4105.78074808, 4105.98074714, 4106.18074621, 4106.38074527, 4106.58074433, 4106.78074339, 4106.98074246, 4107.18074152, 4107.38074058, 4107.58073964, 4107.7807387 , 4107.98073777, 4108.18073683, 4108.38073589, 4108.58073495, 4108.78073401, 4108.98073308, 4109.18073214, 4109.3807312 , 4109.58073026, 4109.78072933, 4109.98072839, 4110.18072745, 4110.38072651, 4110.58072557, 4110.78072464, 4110.9807237 , 4111.18072276, 4111.38072182, 4111.58072089, 4111.78071995, 4111.98071901, 4112.18071807, 4112.38071713, 4112.5807162 , 4112.78071526, 4112.98071432, 4113.18071338, 4113.38071245, 4113.58071151, 4113.78071057, 4113.98070963, 4114.18070869, 4114.38070776, 4114.58070682, 4114.78070588, 4114.98070494, 4115.18070401, 4115.38070307, 4115.58070213, 4115.78070119, 4115.98070025, 4116.18069932, 4116.38069838, 4116.58069744, 4116.7806965 , 4116.98069557, 4117.18069463, 4117.38069369, 4117.58069275, 4117.78069181, 4117.98069088, 4118.18068994, 4118.380689 , 4118.58068806, 4118.78068713, 4118.98068619, 4119.18068525, 4119.38068431, 4119.58068337, 4119.78068244, 4119.9806815 , 4120.18068056, 4120.38067962, 4120.58067869, 4120.78067775, 4120.98067681, 4121.18067587, 4121.38067493, 4121.580674 , 4121.78067306, 4121.98067212, 4122.18067118, 4122.38067025, 4122.58066931, 4122.78066837, 4122.98066743, 4123.18066649, 4123.38066556, 4123.58066462, 4123.78066368, 4123.98066274, 4124.1806618 , 4124.38066087, 4124.58065993, 4124.78065899, 4124.98065805, 4125.18065712, 4125.38065618, 4125.58065524, 4125.7806543 , 4125.98065336, 4126.18065243, 4126.38065149, 4126.58065055, 4126.78064961, 4126.98064868, 4127.18064774, 4127.3806468 , 4127.58064586, 4127.78064492, 4127.98064399, 4128.18064305, 4128.38064211, 4128.58064117, 4128.78064024, 4128.9806393 , 4129.18063836, 4129.38063742, 4129.58063648, 4129.78063555, 4129.98063461, 4130.18063367, 4130.38063273, 4130.5806318 , 4130.78063086, 4130.98062992, 4131.18062898, 4131.38062804, 4131.58062711, 4131.78062617, 4131.98062523, 4132.18062429, 4132.38062336, 4132.58062242, 4132.78062148, 4132.98062054, 4133.1806196 , 4133.38061867, 4133.58061773, 4133.78061679, 4133.98061585, 4134.18061492, 4134.38061398, 4134.58061304, 4134.7806121 , 4134.98061116])
y = np.array([0.90312759, 1.00923175, 0.94618369, 0.98284045, 0.91510612, 0.96737804, 0.97690214, 0.94363369, 1.00887784, 1.00110387, 0.91647096, 0.97943202, 1.00672907, 1.01552094, 1.01089407, 0.96914584, 0.9908419 , 1.0176613 , 0.97032148, 0.96003562, 0.9702355 , 0.93684173, 0.94652734, 0.94895018, 1.01214356, 0.85777678, 0.89308203, 0.9789272 , 0.93901884, 0.9684622 , 0.96969321, 0.86326307, 0.89607392, 0.92459571, 1.00454429, 1.06019733, 0.97291196, 0.95646497, 0.95899707, 1.02830351, 0.94938178, 0.91481128, 0.92606219, 0.97085631, 0.93597434, 0.91316857, 0.90644542, 0.91726926, 0.91686184, 0.96445563, 0.92166362, 0.95831572, 0.93859066, 0.85285273, 0.89944073, 0.91812428, 0.94265677, 0.88281406, 0.9470601 , 0.94921529, 0.97289222, 0.94632251, 0.96633195, 0.94096512, 0.95324803, 0.90920845, 0.92100257, 0.91181745, 0.95715298, 0.91715382, 0.90219214, 0.87585035, 0.86592191, 0.89335902, 0.85536392, 0.89619274, 0.9450366 , 0.82780137, 0.81214176, 0.83461329, 0.82858317, 0.80851704, 0.79253546, 0.85440086, 0.81679169, 0.80579976, 0.72312218, 0.75583125, 0.75204599, 0.84519188, 0.68686821, 0.71472154, 0.71706318, 0.72640234, 0.70526356, 0.68295282, 0.66795774, 0.65004383, 0.68096834, 0.72697547, 0.72436393, 0.77128385, 0.79666758, 0.67349101, 0.61479406, 0.57046337, 0.51614312, 0.52945366, 0.53112169, 0.53757761, 0.56680358, 0.63839684, 0.60704329, 0.62377533, 0.67862515, 0.64587581, 0.71316115, 0.76309798, 0.72217569, 0.7477785 , 0.79731849, 0.76934137, 0.77063868, 0.77871584, 0.77688526, 0.84342722, 0.85382332, 0.88700466, 0.85837992, 0.79589266, 0.83798993, 0.79835529, 0.84612746, 0.83214907, 0.86373676, 0.90729115, 0.82111605, 0.86165685, 0.84090099, 0.90389133, 0.89554032, 0.90792356, 0.92798016, 0.95588479, 0.95019718, 0.95447497, 0.89845759, 0.91638311, 0.99263342, 0.97477606, 0.95482538, 0.94489498, 0.94344967, 0.90526465, 0.92538486, 0.96279787, 0.94005143, 0.96842454, 0.92296494, 0.89954172, 0.8684367 , 0.95039002, 0.95229769, 0.93752274, 0.94741173, 0.96704449, 1.01130839, 0.95499414, 0.99596569, 0.95130622, 1.00014723, 1.00252218, 0.95130331, 1.0022896 , 0.99851989, 0.94405282, 0.95814021, 0.94851972, 1.01302067, 1.01400272, 0.97960083, 0.97070283, 1.01312797, 0.9842154 , 1.01147273, 0.97331853, 0.91403182, 0.96813051, 0.92319169, 0.9294103 , 0.96960715, 0.94811518, 0.97115083, 0.84687543, 0.90725159, 0.88061293, 0.87319615, 0.85331661, 0.89775082, 0.90956716, 0.83174505, 0.89753388, 0.89554364, 0.95329739, 0.87687031, 0.93883127, 0.97433899, 0.99515225, 0.97519981, 0.91956466, 0.97977674, 0.93582089, 1.00662722, 0.90157277, 1.02887754, 0.9777419 , 0.94257094, 1.02359615, 0.98968414, 1.00075502, 1.03230265, 1.05904074, 1.00488442, 1.05507886, 1.05085518, 1.02561781, 1.05896008, 0.98024381, 1.08005691, 0.94528977, 1.03853637, 1.02064405, 1.0467137 , 1.05375156, 1.12907949, 0.99295611, 1.06601022, 1.02846374, 0.98006807, 0.96446772, 0.97702428, 0.97788589, 0.93889781, 0.96366778, 0.96645265, 0.95857242, 1.05796304, 0.99441763, 1.00573183, 1.05001927])
e = np.array([0.0647344 , 0.04583914, 0.05665552, 0.04447208, 0.05644753, 0.03968611, 0.05985188, 0.04252311, 0.03366922, 0.04237672, 0.03765898, 0.03290132, 0.04626836, 0.05106203, 0.03619188, 0.03944098, 0.08115469, 0.05859644, 0.06091101, 0.05170821, 0.0427244 , 0.06804469, 0.06708318, 0.03369381, 0.04160575, 0.08007032, 0.09292148, 0.04378329, 0.08216214, 0.06087074, 0.05375458, 0.06185891, 0.06385766, 0.08084546, 0.04864063, 0.06400878, 0.04988693, 0.06689165, 0.05989534, 0.08010138, 0.0681177 , 0.04478208, 0.03876582, 0.05977015, 0.06610619, 0.05020086, 0.07244604, 0.0445143 , 0.06970626, 0.04423994, 0.0414573 , 0.06892836, 0.05715395, 0.04014724, 0.07908425, 0.06082051, 0.08380691, 0.08576757, 0.06571406, 0.04842625, 0.05298355, 0.05271857, 0.06340425, 0.10849621, 0.0811072 , 0.03642638, 0.10614094, 0.09865099, 0.06711037, 0.10244762, 0.11843505, 0.1092357 , 0.09748241, 0.09657009, 0.09970179, 0.10203563, 0.18494082, 0.14097796, 0.1151294 , 0.16172895, 0.17611204, 0.16226913, 0.2295418 , 0.17795924, 0.1253298 , 0.1771586 , 0.15139061, 0.14739618, 0.1620105 , 0.19158538, 0.21431605, 0.19292715, 0.23308884, 0.30519423, 0.31401994, 0.30569885, 0.31216375, 0.35147676, 0.25016472, 0.16232236, 0.09058787, 0.0604483 , 0.05168302, 0.21432774, 0.38149791, 0.5061975 , 0.44281541, 0.50646427, 0.43761581, 0.44989111, 0.47778238, 0.39944325, 0.32462726, 0.34560857, 0.3175776 , 0.30253441, 0.23059451, 0.24516185, 0.20708065, 0.26429751, 0.1830661 , 0.15155041, 0.16497299, 0.15794139, 0.13626666, 0.17839823, 0.13502886, 0.14148522, 0.10869864, 0.11723602, 0.09074029, 0.06922157, 0.07719777, 0.13181317, 0.11441895, 0.10655855, 0.12073767, 0.0846133 , 0.07974657, 0.06538693, 0.0573741 , 0.07864047, 0.08351471, 0.08130351, 0.0768824 , 0.07951992, 0.04478989, 0.0765122 , 0.04842814, 0.04355571, 0.05138656, 0.07215294, 0.04681987, 0.05790133, 0.06163808, 0.082449 , 0.06127927, 0.04971221, 0.05107901, 0.04493687, 0.06072161, 0.06094332, 0.03630467, 0.04162285, 0.04058228, 0.04526251, 0.06191432, 0.04901982, 0.0454908 , 0.06186274, 0.0407017 , 0.03865571, 0.04353665, 0.03898987, 0.04666321, 0.05856035, 0.04225933, 0.04797901, 0.03523971, 0.04728414, 0.05494382, 0.04773011, 0.03210954, 0.05651663, 0.03625933, 0.03596701, 0.03800191, 0.06267668, 0.06431192, 0.0602614 , 0.05139896, 0.04571979, 0.04375182, 0.0576867 , 0.07491418, 0.05339972, 0.07619115, 0.11569378, 0.07087871, 0.09076518, 0.13554717, 0.07811761, 0.07180695, 0.05831886, 0.06042863, 0.08759576, 0.06650081, 0.08420164, 0.08185432, 0.04338836, 0.04970979, 0.04008252, 0.03605485, 0.03456321, 0.05594584, 0.03856822, 0.03576337, 0.03118799, 0.0441686 , 0.0469118 , 0.03591666, 0.03562582, 0.04934832, 0.03280972, 0.03201576, 0.04338048, 0.07443531, 0.04121059, 0.03774147, 0.03717577, 0.03354207, 0.03806978, 0.0319364 , 0.03715712, 0.0379478 , 0.04867626, 0.0304592 , 0.03393844, 0.034518 , 0.04293514, 0.05177898, 0.05332907, 0.0352937 , 0.03359781, 0.04625272, 0.03733088, 0.03501259, 0.03346308, 0.04333749, 0.05741173])
cont = ConstantModel(prefix='cte_')
pars = cont.guess(y, x=x)
gauss = GaussianModel(prefix='g_')
pars.update( gauss.make_params())
pars['cte_c'].set(1)
pars['g_center'].set(4125, min=4120, max=4130)
pars['g_sigma'].set(1, min=0.5)
pars['g_amplitude'].set(-0.2, min=-0.5)
loren = LorentzianModel(prefix='l_')
pars.update( loren.make_params())
pars['l_center'].set(4106, min=4095, max=4115)
pars['l_sigma'].set(4, max=6)
pars['l_amplitude'].set(-6., max=-4.)
model = gauss + loren + cont
init = model.eval(pars, x=x)
result = model.fit(y, pars, x=x, weights=1/e)
#print(result.fit_report(min_correl=0.5))
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'k-', lw=2) # data in red
ax.plot(x, init, 'g--', lw=2) # initial guess
ax.plot(x, result.best_fit, 'r-', lw=2) # best fit
ax.set(xlim=(4085,4135), ylim=(0.4,1.14))
If the bad point is always at the same x value, you could remove that point from the data, perhaps with something like:
import numpy as np
def index_nearest(array, value):
"""index of array nearest to value"""
return np.abs(array-value).argmin()
ybad = index_nearest(x, 4150)
y[ybad] = x[ybad] = np.nan
x = x[np.where(np.isfinite(y))]
y = y[np.where(np.isfinite(y))]
and then fit your model to those data with the bad point removed.
But, also: if there is not an obviously errant point and the data "just" noisy, there is probably no advantage to removing what looks like bad points. Your data looks noisy to me, but it's hard to see that there is a systematically bad point. If you are going to remove a point, remember that you are asserting that this measurement was not merely affected by normal noise, but was wrong.
Finally: another approach to treating noisy data might be to try to smooth the data, say with a Savitzky-Golay filter. There is always some danger of smoothing out features with such an approach, but a modest S-G filter is often good for cleaning up noisy data enough to detect features. Of course, if fits to filtered data give significantly different results from fits to unfiltered data, you will probably need to understand why that is.

Calculate similarity/distance between rows using pandas faster

I am fairly new to Python and Pandas.
I have following columns in a Pandas Datframe:
SongNumber songID albumID artistID similarArtists artistHotttnesss songHotness loudness tempo year
With numerical data from artistHotnesss to year columns.
So I tried calculating distance/cosine between songs using the below code:
t1=time()
m = 1000
mat = np.zeros((m, m))
for i in range(0,m):
for j in range(0,m):
if i != j:
mat[i][j] = euclidean(data.ix[i,5:], data.ix[j,5:])
'''if data.ix[i,2] == data.ix[j,2]:
mat[i][j] += 1
if data.ix[i,3] == data.ix[j,3]:
mat[i][j] += 1
#l1,l2 - list of similar artists
l1_str = data.ix[i,4].strip(']')[1:]
l2_str = data.ix[j,4].strip(']')[1:]
l1 = l1_str.split()
l2 = l2_str.split()
common = len(set(l1).intersection(l2))
mat[i][j] += common
mat[i][j] /= 3'''
else:
mat[i][j] = 0.0
t2 =time()
print(t2-t1)
So this essentially requires looping 10^4*10^4 times.
If I perform this for m =1000 , I get results in 2249 sec or 37.48 mins, so I am not getting the results for m = 10000 in time.
How can I speed it up (by avoiding loops ? pandas functions) ?
Thanks for help
You can avoid using loops by using the euclidean_distances function in scikit-learn.
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
mat = np.random.rand(5, 5)
pairwise_dist_mat = euclidean_distances(mat)
pairwise_dist_mat
array([[ 0. , 1.19602663, 1.08341967, 1.07792121, 1.1245057 ],
[ 1.19602663, 0. , 0.52135682, 0.82797734, 0.78247091],
[ 1.08341967, 0.52135682, 0. , 0.87764513, 0.81903634],
[ 1.07792121, 0.82797734, 0.87764513, 0. , 0.1386294 ],
[ 1.1245057 , 0.78247091, 0.81903634, 0.1386294 , 0. ]])

Categories