Similarity matrix representing in a heatmap - python

I know there are several questions here about the similarity matrix.I read them but I couldn't find my answer.
I have a dataframe with 8 rows like the example below. I want to compare each row together and generate a heatmap showing how each row is similar to others.
df:
speed(km/h) acceleration(m/s2) Deceleration(m/s2) Satisfaction(%)
100 2.1 -1.1 10
150 3.6 -2.2 20
250 0.1 -4 30
100 0.6 -0,1 20
I am looking for a function in python to measure the similarity between rows and generate a matrix. Finally, it is great if I can show the result of the matrix by a heatmap for each pixel showing the similarity.
Thanks in advance

Related

Euclidean calculation - calculate the data non-symmetrically to reduce redundancy?

I'm calculating the Euclidean distance between all rows in a large data frame.
This code works:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df,metric='euclidean')
dist_matrix = squareform(distances)
pd.DataFrame(dist_matrix).to_csv('distance_matrix.txt')
And this prints out a matrix like this:
0 1 2
0 0.0 4.7 2.3
1 4.7 0.0 3.3
2 2.3 3.3 0.0
But there's a lot of redundant calculating happening (e.g. the distance between sequence 1 and sequence 2 is getting a score....and then the distance between sequence 2 and sequence 1 is getting the same score).
Would someone know a more efficient way of calculating the Euclidean distance between the rows in a big data frame, non-redundantly (i.e. the dataframe is about 35gb)?

wasserstein distance for multiple histograms

I'm trying to calculate the distance matrix between histograms. I can only find the code for calculating the distance between 2 histograms and my data have more than 10. My data is a CSV file and histogram comes in columns that add up to 100. Which consist of about 65,000 entries, I only run with 20% of the data but the code still does not work.
I've tried the distance_matrix from scipy.spatial.distance_matrix but it ignore the face that data are histogram and treat them as normal numerical data. I've also tried wasserstein distance but the error was object too deep for desired array
from scipy.stats import wasserstein_distance
distance = wasserstein_distance (df3,df3)
I expected the result to be somewhat like this :
0 1 2 3 4 5 6
0 0.000000 259.730341 331.083554 320.302997 309.577373 249.868085
1 259.730341 0.000000 208.368304 190.441382 262.030304 186.033572
2 331.083554 208.368304 0.000000 112.255111 256.269253 227.510879
3 320.302997 190.441382 112.255111 0.000000 246.350482 205.346804
4 309.577373 262.030304 256.269253 246.350482 0.000000 239.642379
but it was an error instead
ValueError: object too deep for desired array

Data calculation in pandas python

I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?
This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.

How to get indices of both lists of different length if there is a unique match between them using python 2.7

Today I want to know how to get indices for both lists whenever there is match between them. I came across use of enumerate and zip function. But they work if two list are of same length. Since my inputs are different I want to get indices of both of the lists.
# Simulated Time(msec) Simulated O/p Actual Time(msec) Actual o/p
0 12.57 0 12.55
50 12.58 100 12.56
100 12.55 200 12.60
150 12.59 300 12.45
200 12.53 400 12.59
250 12.87 500 12.78
300 12.50 600 12.57
350 12.75 700 12.66
400 12.80 800 12.78
...... ...... ..... ......
Also My simulated data is in different file and generating data at 50Hz rate different from my actual data. Hence Simulated data is of higher length than actual data. But actual data is present in simulated data. I want to get indices of both the the list. Example Simulated Time(msec) 100 (i=2) is matching with indice(j=1) of actual time. If I get indices of both i and j then I can compare corresponding simulated output and actual output at that particular instant.
Lastly I want to iterate till the end of simulated time.
Please suggest how can I solve this.
if simand act contains unique values, here is a way to do that, using the numpy set routine np.in1d:
sim=np.unique(np.random.randint(0,10,3))*10 #sample
act=np.unique(np.random.randint(0,10,5))*10 #sample
i=np.arange(len(sim))[np.in1d(sim,act)]
j=np.arange(len(act))[np.in1d(act,sim)]
print(sim,act,i,j)
#[40 50 70] [10 30 40 50] [0 1] [2 3]

Categories