Euclidean calculation - calculate the data non-symmetrically to reduce redundancy? - python

I'm calculating the Euclidean distance between all rows in a large data frame.
This code works:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df,metric='euclidean')
dist_matrix = squareform(distances)
pd.DataFrame(dist_matrix).to_csv('distance_matrix.txt')
And this prints out a matrix like this:
0 1 2
0 0.0 4.7 2.3
1 4.7 0.0 3.3
2 2.3 3.3 0.0
But there's a lot of redundant calculating happening (e.g. the distance between sequence 1 and sequence 2 is getting a score....and then the distance between sequence 2 and sequence 1 is getting the same score).
Would someone know a more efficient way of calculating the Euclidean distance between the rows in a big data frame, non-redundantly (i.e. the dataframe is about 35gb)?

Related

Similarity matrix representing in a heatmap

I know there are several questions here about the similarity matrix.I read them but I couldn't find my answer.
I have a dataframe with 8 rows like the example below. I want to compare each row together and generate a heatmap showing how each row is similar to others.
df:
speed(km/h) acceleration(m/s2) Deceleration(m/s2) Satisfaction(%)
100 2.1 -1.1 10
150 3.6 -2.2 20
250 0.1 -4 30
100 0.6 -0,1 20
I am looking for a function in python to measure the similarity between rows and generate a matrix. Finally, it is great if I can show the result of the matrix by a heatmap for each pixel showing the similarity.
Thanks in advance

wasserstein distance for multiple histograms

I'm trying to calculate the distance matrix between histograms. I can only find the code for calculating the distance between 2 histograms and my data have more than 10. My data is a CSV file and histogram comes in columns that add up to 100. Which consist of about 65,000 entries, I only run with 20% of the data but the code still does not work.
I've tried the distance_matrix from scipy.spatial.distance_matrix but it ignore the face that data are histogram and treat them as normal numerical data. I've also tried wasserstein distance but the error was object too deep for desired array
from scipy.stats import wasserstein_distance
distance = wasserstein_distance (df3,df3)
I expected the result to be somewhat like this :
0 1 2 3 4 5 6
0 0.000000 259.730341 331.083554 320.302997 309.577373 249.868085
1 259.730341 0.000000 208.368304 190.441382 262.030304 186.033572
2 331.083554 208.368304 0.000000 112.255111 256.269253 227.510879
3 320.302997 190.441382 112.255111 0.000000 246.350482 205.346804
4 309.577373 262.030304 256.269253 246.350482 0.000000 239.642379
but it was an error instead
ValueError: object too deep for desired array

Pandas function to apply multi input function to every cell in data frame?

I'm setting up a dataframe by reading a csv file in pandas, the columns represent points in one dimensional positional arguments for different samples, the rows each represent 0.01s time segments. I want to create a new dataframe to represent velocity and acceleration (so basically apply the operation [point(i)-point(i-1)]/0.01) to every cell in the data frame.
I'm having trouble using pandas.applymap or other approaches because I don't quite know how to refer to multiple arguments in the dataframe for every operation, if that makes sense.
import pandas as pd
import numpy as np
data = pd.read_csv("file_name")
def velocity(xf, xi):
v = (xf - xi)*100
return v
velocity = data.applymap(velocity)
This is what the first few column and rows of the original data frame look like:
X LFHD Y LFHD Z LFHD X RFHD Y RFHD
0 700.003 -1769.61 1556.05 811.922 -1878.46
1 699.728 -1769.50 1555.99 811.942 -1878.14
2 699.465 -1769.38 1555.99 811.980 -1877.81
3 699.118 -1769.38 1555.83 812.005 -1877.48
4 699.017 -1768.78 1556.19 812.003 -1877.11
For every positional value in each column, I want to calculate the velocity where the initial positional value is the cell above (xi as the input in the velocity function) and the final positional value is the cell in question (xf).
when I try to run the above code, it gives me an error because there is only one argument provided for velocity, when it expects 2. I don't know how to go about providing the second argument so that it outputs the proper new dataframe with the velocity calculated in each cell.
df_velocity = data.diff()*100
df_velocity
Out[6]:
X_LFHD Y_LFHD Z_LFHD X_RFHD Y_RFHD
0 NaN NaN NaN NaN NaN
1 -27.5 11.0 -6.0 2.0 32.0
2 -26.3 12.0 0.0 3.8 33.0
3 -34.7 0.0 -16.0 2.5 33.0
4 -10.1 60.0 36.0 -0.2 37.0

index appears the list of 5 indices closest

I'm looking for a method or function that from an index (or the name of a movie) appears the list of 5 indices (list of 5 films) closest
My DataFrame :
movie_title movieId Action Adventure Fantasy Sci-Fi Thriller
Avatar 1 1.0 1.0 1.0 1.0 0.0
Spectre 2 1.0 1.0 0.0 0.0 1
John Carter 3 1.0 1.0 0.0 1.0 0.0
Put the DataFrame as a matrix :
df_matrix = userGenreTable.as_matrix(columns=userGenreTable.columns[2:])
calculating the distance between two vectors :
from scipy.spatial import distance
for i in range(len(df_matrix)):
for j in range(len(df_matrix)):
print(distance.euclidean(df_matrix[i,:], df_matrix[j,:]))
I do not see how to calculate the five indexes of the nearest vectors.
You can use .loc like this.
# Build the array
arr = np.array([[distance.euclidean(df_matrix .loc[i,'Action':'Thriller'],
df_matrix .loc[j,'Action':'Thriller']) for j in range(len(df))]\
for i in range (len(df))])
# Find the min distance
i,j = np.unravel_index(arr.argmin(), arr.shape)
print([i,j]) # prints the slice location for the minimum euclidean distance.
It's tricky to reference dataframe columns as indexes, but an update to .loc lets us scan through a 'range' of them. Hope that helps!

Data calculation in pandas python

I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.

Categories