Pandas - Euclidean Distance Between Columns - python

I have a dataframe as follows:
uuid x_1 y_1 x_2 y_2
0 di-ab5 82.31 184.20 148.06 142.54
1 di-de6 92.35 185.21 24.12 16.45
2 di-gh7 123.45 0.01 NaN NaN
...
I am trying to calculate the euclidean distance between [x_1, y_1] and [x_2, y_2] in a new column (not real values in this example).
uuid dist
0 di-ab5 12.31
1 di-de6 62.35
2 di-gh7 NaN
Caveats:
some rows have NaN on some of the datapoints
it is okay to represent data in the original dataframe as points (i.e. [1.23, 4.56]) instead of splitting up the x and y coordinates
I am currently using the following script:
df['dist'] = np.sqrt((df['x_1'] - df['x_2'])**2 + (df['y_1'] - df['y_2'])**2)
But it seems verbose and often fails.
Is there a better way to do this using pandas, numpy, or scipy?

You can use np.linalg.norm, i.e.:
df['dist'] = np.linalg.norm(df.iloc[:, [1,2]].values - df.iloc[:, [3,4]], axis=1)
Output:
uuid x_1 y_1 x_2 y_2 dist
0 di-ab5 82.31 184.20 148.06 142.54 77.837125
1 di-de6 92.35 185.21 24.12 16.45 182.030960
2 di-gh7 123.45 0.01 NaN NaN NaN

def getDist( df, a, b ):
return np.sqrt((df[f'x_{a}']-df[f'x_{b}'])**2+(df[f'y_{a}']-df[f'y_{b}'])**2)

np.sqrt((df.filter(like='x').agg('diff',1).sum(1)**2)+(df.filter(like='y').agg('diff',1).sum(1)**2))
How it works
Filter x and y respectively
df.filter(like='x')
Find the cross column difference and square it.
df.filter(like='x').agg('diff',1).sum(1)**2
Add the two outcomes together and find the square root.
np.sqrt((df.filter(like='x').agg('diff',1).sum(1)**2)+(df.filter(like='y').agg('diff',1).sum(1)**2))

Another solution using numpy:
diff = (df[['x_1','y_1']].to_numpy()-df[['x_2','y_2']].to_numpy())
df['dist'] = np.sqrt((diff*diff).sum(-1))
output:
uuid x_1 y_1 x_2 y_2 dist
0 di-ab5 82.31 184.20 148.06 142.54 77.837125
1 di-de6 92.35 185.21 24.12 16.45 182.030960
2 di-gh7 123.45 0.01 NaN NaN NaN

Related

how to loop a dataframe with increment factor based on a particular column value

The dataframe I am working with looks like this:
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
Here,
the features have the following meanings:
FWeight - weight of each fragment (or row)
fsim - similarity score between the two columns cap1 and cap2
The weighted formula is:
For example,
For vid2 "-_aaMGK6GGw_57_61", COS = 2
Thus, the two rows with vid2 comes under this.
fsim FWeight
0 0.253792 0.750000
1 0.192565 0.250000
The calculated value vid_score needs to be
vid_score(1st video) = (fsim[0] * FWeight[0] + fsim[1] * FWeight[1])/(FWeight[0] + FWeight[1])
The expected output value vid_score for vid2 = -_aaMGK6GGw_57_61 is
(0.750000) * (0.253792) + (0.250000) * (0.192565)
= 0.238485 (Final value)
For some videos, this COS = 1, 2, 3, 4, 5, ...
Thus this needs to be dynamic
I am trying to calculate the weighted similarity score for each video ID that is vid2 here. However, there are a number of captions and weights respectively for each video. It varies, some have 2, some 1, some 3, etc. This number of segments and captions has been stored in the feature COS (that is, count of segments).
I want to iterate through the dataframe where score for each video is stored as a weighted average score of the fsim (fragment similarity score). But the number of iteration is not regular.
I have tried this code. But I am not able to iterate dynamically with the iteration factor being COS instead of just a constant value
vems_score = 0.0
video_scores = []
for i, row in merged.iterrows():
vid_score = 0.0
total_weight = 0.0
for j in range(row['COS']):
total_weight = total_weight + row['FWeight']
vid_score = vid_score + (row['FWeight'] * row['fsim'])
i = i + row['COS']
vid_score = vid_score/total_weight
video_scores.append(vid_score)
print(video_scores)
Here is my sol which you can modify/optimize to your needs.
import pandas as pd, numpy as np
def computeSim():
vid=[1,1,2,2,3]
cos=[2,2,2,2,1]
fsim=[0.25,.19,.56,.17,.27]
weight = [.75,.25,.33,.66,.71]
df= pd.DataFrame({'vid':vid,'cos':cos,'fsim':fsim,'fw':weight})
print(df)
df2 = df.groupby('vid')
similarity=[]
for group in df2:
similarity.append( np.sum(group[1]['fsim']*group[1]['fw'])/ np.sum(group[1]['fw']))
return similarity
output:
0.235
0.30000000000000004
0.27
Solution
Try this with your data. I assume that you stored the dataframe as df.
df['Prod'] = df['fsim']*df['FWeight']
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output with your data: Dummy Data (B)
vid2 COS
-_aaMGK6GGw_57_61 2 0.238485
-_hbPLsZvvo_18_25 1 0.275962
-_hbPLsZvvo_5_8 2 0.307548
dtype: float64
Dummy Data: A
I made the following dummy data to test a few aspects of the logic.
df = pd.DataFrame({'vid2': [1,1,2,5,2,6,7,4,8,7,6,2],
'COS': [2,2,3,1,3,2,2,1,1,2,2,3],
'fsim': np.random.rand(12),
'FWeight': np.random.rand(12)})
df['Prod'] = df['fsim']*df['FWeight']
print(df)
# Groupby and apply formula
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output:
vid2 COS
1 2 0.405734
2 3 0.535873
4 1 0.534456
5 1 0.346937
6 2 0.369810
7 2 0.479250
8 1 0.065854
dtype: float64
Dummy Data: B (OP Provided)
This is your dummy data. I made this script so anyone could easily run it and load the data as a dataframe.
import pandas as pd
from io import StringIO
s = """
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
"""
df = pd.read_csv(StringIO(s), sep='\s+')
#print(df)

Quantify relative position of coordinates - python

I have a df of coordinates representing points at various timescales. I want to calculate the average these points in relation to each other.
To achieve this, I'm aiming to calculate the space between each point and the rest of the points. I'm then hoping to average these points.
The following calculates the distance between each pair of points.
import pandas as pd
from scipy.spatial import distance
import itertools
df = pd.DataFrame({
'Time' : [1,1,1,2,2,2,3,3,3],
'id' : ['A','B','C','A','B','C','A','B','C'],
'X' : [1.0,3.0,2.0,2.0,4.0,3.0,3.0,5.0,4.0],
'Y' : [1.0,1.0,0.5,2.0,2.0,2.5,3.0,3.0,3.0],
})
ids = list(df['id'])
# get the points
points = df[["X", "Y"]].values
# calculate distance of each point from every other point.
# row i contains contains distances for point i.
# distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()
# get the start and end points
cartesian = list(itertools.product(ids, ids))
data = dict(
start_region = [x[0] for x in cartesian],
end_region = [x[1] for x in cartesian],
distance = distances
)
df1 = pd.DataFrame(data)
All I really need to output is:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
So the average position of these points in relation to each other would be the green coordinates.
But if I average the dataset above it displays:
I understand how this occurs. It's not referencing the other points.
Here my take on it
import itertools
def relative_dist(gp):
combs = list(itertools.combinations(gp.index, 2))
df_gp = pd.concat([gp.loc[tup,:].diff() for tup in combs], keys=combs).dropna()
return df_gp
df_dist = (df.set_index('id').groupby('Time')[['X','Y']].apply(relative_dist)
.droplevel('id').rename_axis(['Time','start_point','end_point'])
.reset_index())
Out[341]:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
df_avg = df_dist.groupby(['start_point','end_point'], as_index=False)[['X','Y']].mean()
Out[347]:
start_point end_point X Y
0 A B 2.0 0.0
1 A C 1.0 0.0
2 B C -1.0 0.0
Here's a suggestion on how to visualise the relative positions of your points. I would want, for each timestamp, to plot an ellipse at position (X_, Y_) where:
X_ is the mean of your points X coordinates for that timestamp.
Y_ is the mean of your points X coordinates for that timestamp.
the width of the ellipse equals the variance of your points X coordinates for that timestamp.
the height of the ellipse equals the variance of your points Y coordinates for that timestamp.
In that way, in a glance and for each timestamp, you could read some very high level statistics about your coordinates distribution at that timestamp.
Here's some code to generate such a visualisation:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Ellipse
# sample data with 4 timestamps
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'id' : ['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'],
'X' : [1,2,1,2,1,2,1,2,4,4,3,4,10,8,5,6],
'Y' : [1,1,3,3,1,1,2,2,5,5,8,5,6,6,7,6],
})
# for each timestamp, compute means and variances within all samples for that timestamp
means = df.groupby("Time")[["X", "Y"]].mean()
variances = df.groupby("Time")[["X", "Y"]].var()
df_ = pd.concat([means, variances], axis=1)
df_.columns = ["X_", "Y_", "var_X", "var_Y"]
# plot
fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})
for row in df_.itertuples():
ellipse = Ellipse(xy=(row.X_, row.Y_), # position of the ellipse is (X,Y)
width=row.var_X, # width helps to get a grasp on X variance
height=row.var_Y, # height helps to get a grasp on Y variance
angle=0)
ax.add_artist(ellipse)
ellipse.set_clip_box(ax.bbox)
ellipse.set_alpha(.4)
plt.text(x=row.X_+0.2, y=row.Y_+0.2, s=f"t={row.Index}") # just add timestamp legend
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
plt.show()
Which would look like this:
What do you think? Another idea could be to do a GIF (in case the timestamps average collide too much).

Percentage difference between any two columns of pandas dataframe

I would like to have a function defined for percentage diff calculation between any two pandas columns.
Lets say that my dataframe is defined by:
R1 R2 R3 R4 R5 R6
A B 1 2 3 4
I would like my calculation defined as
df['R7'] = df[['R3','R4']].apply( method call to calculate perc diff)
and
df['R8'] = df[['R5','R6']].apply(same method call to calculate perc diff)
How can i do that?
I have tried below
df['perc_cnco_error'] = df[['CumNetChargeOffs_x','CumNetChargeOffs_y']].apply(lambda x,y: percCalc(x,y))
def percCalc(x,y):
if x<1e-9:
return 0
else:
return (y - x)*100/x
and it gives me the error message
TypeError: ('() takes exactly 2 arguments (1 given)',
u'occurred at index CumNetChargeOffs_x')
At it's simplest terms:
def percentage_change(col1,col2):
return ((col2 - col1) / col1) * 100
You can apply it to any 2 columns of your dataframe:
df['a'] = percentage_change(df['R3'],df['R4'])
df['b'] = percentage_change(df['R6'],df['R5'])
>>> print(df)
R1 R2 R3 R4 R5 R6 a b
0 A B 1 2 3 4 100.0 -25.0
Equivalently using pandas arithmetic operation functions
def percentage_change(col1,col2):
return ((col2.sub(col1)).div(col1)).mul(100)
pandas.sub
pandas.div
pandas.mul
You can also utilise pandas built-in pct_change which computes the percentage change across all the columns passed, and select the column you want to return:
df['R7'] = df[['R3','R4']].pct_change(axis=1)['R4']
df['R8'] = df[['R6','R5']].pct_change(axis=1)['R5']
>>> print(df)
R1 R2 R3 R4 R5 R6 a b R7 R8
0 A B 1 2 3 4 100.0 -25.0 1.0 -0.25
Setup:
df = pd.DataFrame({'R1':'A','R2':'B',
'R3':1,'R4':2,'R5':3,'R6':4},
index=[0])
To calculate percent diff between R3 and R4 you can use:
df['R7'] = (df.R3 - df.R4) / df.R3 * 100
This would give you the deviation in percentage:
df.apply(lambda row: (row.iloc[0]-row.iloc[1])/row.iloc[0]*100, axis=1)
If you have more than two columns try,
df[['R3', 'R5']].apply(lambda row: (row.iloc[0]-row.iloc[1])/row.iloc[0]*100, axis=1)

Pandas new column with constant increments

I need a new column that adds in increments, in this case .02.
DF before:
x y x2 y2
0 1.022467 1.817298 1.045440 3.302572
1 1.026426 1.821669 1.053549 3.318476
2 1.018198 1.818419 1.036728 3.306648
3 1.013077 1.813290 1.026325 3.288020
4 1.017878 1.811058 1.036076 3.279930
DF after:
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.000000
1 1.026426 1.821669 1.053549 3.318476 0.020000
2 1.018198 1.818419 1.036728 3.306648 0.040000
3 1.013077 1.813290 1.026325 3.288020 0.060000
4 1.017878 1.811058 1.036076 3.279930 0.080000
5 1.016983 1.814031 1.034254 3.290708 0.100000
I have looked around for a while, and cannot find a good solution. The only way on my mind is to make a standard python list and bring it in. There has to be a better way. Thanks
Because your index is the perfect range for this (i.e. 0...n), just multiply it by your constant:
df['t'] = .02 * df.index.values
>>> df
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.00
1 1.026426 1.821669 1.053549 3.318476 0.02
2 1.018198 1.818419 1.036728 3.306648 0.04
3 1.013077 1.813290 1.026325 3.288020 0.06
4 1.017878 1.811058 1.036076 3.279930 0.08
You could also use a list comprehension:
df['t'] = [0.02 * i for i in range(len(df))]

Euclidean distance in Python

I have two 3000x3 vectors and I'd like to compute 1-to-1 Euclidean distance between them. For example, vec1 is
1 1 1
2 2 2
3 3 3
4 4 4
...
The vec2 is
2 2 2
3 3 3
4 4 4
5 5 5
...
I'd like to get the results as
1.73205081
1.73205081
1.73205081
1.73205081
...
I triedscipy.spatial.distance.cdist(vec1,vec2), and it returns a 3000x3000 matrix whereas I only need the main diagonal. I also tried np.sqrt(np.sum((vec1-vec2)**2 for vec1,vec2 in zip(vec1,vec2))) and it didn't work for my purpose. Is there any way to compute the distances please? I'd appreciate any comments.
cdist gives you back a 3000 x 3000 array because it computes the distance between every pair of row vectors in your two input arrays.
To compute only the distances between corresponding row indices, you could use np.linalg.norm:
a = np.repeat((np.arange(3000) + 1)[:, None], 3, 1)
b = a + 1
dist = np.linalg.norm(a - b, axis=1)
Or using standard vectorized array operations:
dist = np.sqrt(((a - b) ** 2).sum(1))
Here's another way that works. It still utilizes the np.linalg.norm function but it processes the data, if that is something you needed.
import numpy as np
vec1='''1 1 1
2 2 2
3 3 3
4 4 4'''
vec2='''2 2 2
3 3 3
4 4 4
5 5 5'''
process_vec1 = np.array([])
process_vec2 = np.array([])
for line in vec1:
process_vec1 = np.append( process_vec1, map(float,line.split()) )
for line in vec2:
process_vec2 = np.append( process_vec2, map(float,line.split()) )
process_vec1 = process_vec1.reshape( (len(process_vec1)/3, 3) )
process_vec2 = process_vec2.reshape( (len(process_vec2)/3, 3) )
dist = np.linalg.norm( process_vec1 - process_vec2 , axis = 1 )
print dist
[1.7320508075688772 1.7320508075688772 1.7320508075688772 1.7320508075688772]

Categories