wasserstein distance for multiple histograms - python

I'm trying to calculate the distance matrix between histograms. I can only find the code for calculating the distance between 2 histograms and my data have more than 10. My data is a CSV file and histogram comes in columns that add up to 100. Which consist of about 65,000 entries, I only run with 20% of the data but the code still does not work.
I've tried the distance_matrix from scipy.spatial.distance_matrix but it ignore the face that data are histogram and treat them as normal numerical data. I've also tried wasserstein distance but the error was object too deep for desired array
from scipy.stats import wasserstein_distance
distance = wasserstein_distance (df3,df3)
I expected the result to be somewhat like this :
0 1 2 3 4 5 6
0 0.000000 259.730341 331.083554 320.302997 309.577373 249.868085
1 259.730341 0.000000 208.368304 190.441382 262.030304 186.033572
2 331.083554 208.368304 0.000000 112.255111 256.269253 227.510879
3 320.302997 190.441382 112.255111 0.000000 246.350482 205.346804
4 309.577373 262.030304 256.269253 246.350482 0.000000 239.642379
but it was an error instead
ValueError: object too deep for desired array

Related

Similarity matrix representing in a heatmap

I know there are several questions here about the similarity matrix.I read them but I couldn't find my answer.
I have a dataframe with 8 rows like the example below. I want to compare each row together and generate a heatmap showing how each row is similar to others.
df:
speed(km/h) acceleration(m/s2) Deceleration(m/s2) Satisfaction(%)
100 2.1 -1.1 10
150 3.6 -2.2 20
250 0.1 -4 30
100 0.6 -0,1 20
I am looking for a function in python to measure the similarity between rows and generate a matrix. Finally, it is great if I can show the result of the matrix by a heatmap for each pixel showing the similarity.
Thanks in advance

Euclidean calculation - calculate the data non-symmetrically to reduce redundancy?

I'm calculating the Euclidean distance between all rows in a large data frame.
This code works:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df,metric='euclidean')
dist_matrix = squareform(distances)
pd.DataFrame(dist_matrix).to_csv('distance_matrix.txt')
And this prints out a matrix like this:
0 1 2
0 0.0 4.7 2.3
1 4.7 0.0 3.3
2 2.3 3.3 0.0
But there's a lot of redundant calculating happening (e.g. the distance between sequence 1 and sequence 2 is getting a score....and then the distance between sequence 2 and sequence 1 is getting the same score).
Would someone know a more efficient way of calculating the Euclidean distance between the rows in a big data frame, non-redundantly (i.e. the dataframe is about 35gb)?

smoothing curve with pandas and interpolate not modifying data

I'm sure I'm not doing this right. I have a dataframe with a series of data, basically year and a value. I want to smoothen the curve and was looking to use spline to test results.
Basically I was trying to take a column and return the new datapoints into another column:
df['smooth'] = df['value'].interpolate(method='spline', order=3, s=0.)
but the results between smooth and value are the same.
value periodDate smooth diffSmooth
6 422976.72 2019 422976.72 0.0
7 190865.94 2018 190865.94 0.0
8 188440.89 2017 188440.89 0.0
9 192481.64 2016 192481.64 0.0
10 191958.64 2015 191958.64 0.0
11 681376.60 2014 681376.60 0.0
Any suggestions of what I'm doing wrong?
According to the Pandas docs, the interpolate function fills missing values in a sequence, so for example linear interpolation would be [0, 1, NaN, 3] -> [0, 1, 2, 3]. In short, you're using the wrong function. If you want to fit a spline, sklearn or scipy or numpy may be better bets.

Python - Pandas: how can I interpolate between values that grow exponentially?

I have a Pandas Series that contains the price evolution of a product (my country has high inflation), or say, the amount of coronavirus infected people in a certain country. The values in both of these datasets grows exponentially; that means that if you had something like [3, NaN, 27] you'd want to interpolate so that the missing value is filled with 9 in this case. I checked the interpolation method in the Pandas documentation but unless I missed something, I didn't find anything about this type of interpolation.
I can do it manually, you just take the geometric mean, or in the case of more values, get the average growth rate by doing (final value/initial value)^(1/distance between them) and then multiply accordingly. But there's a lot of values to fill in in my Series, so how do I do this automatically? I guess I'm missing something since this seems to be something very basic.
Thank you.
You could take the logarithm of your series, interpolate lineraly and then transform it back to your exponential scale.
import pandas as pd
import numpy as np
arr = np.exp(np.arange(1,10))
arr = pd.Series(arr)
arr[3] = None
0 2.718282
1 7.389056
2 20.085537
3 NaN
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
arr = np.log(arr) # Transform according to assumed process.
arr = arr.interpolate('linear') # Interpolate.
np.exp(arr) # Invert previous transformation.
0 2.718282
1 7.389056
2 20.085537
3 54.598150
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64

Python - FFT of 2D correlation function: is it the 2D power spectrum?

I have a density field on a 2d grid. I have a matrix that describes the "correlation" of each point of the grid with all the other points. For example, let's say that I have a 4x4 grid, whose points I label with these indices:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
then I have a matrix which is 16x16. The first row describes the correlation of pixel 1 with all the others, the second row describes the correlation of pixel 2 with all the others and so on until row 16. Symbolically:
1-1 1-2 1-3 ...... 1-16
2-1 2-2 2-3 ...... 2-16
.
.
.
16-1 16-2 16-3...... 16-16
If I use a 2-dimensional FFT of this matrix (np.fft.fft2) do I get a quantity which is the density field power spectrum? Or do I have to do some extra operation on the matrix before applying the FFT, such as reordering the entries according to some rules, or padding extra entries with zeros?

Categories