calculating average distance of nearest neighbours in pandas dataframe - python

I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
For each time point, I would like to know the nearest car neighbour for each car. Example:
df2
car nearest_neighbour euclidean_distance
time
0 1 3 1.41
0 2 3 1.00
0 3 1 1.41
1 1 3 10.05
1 3 1 10.05
2 4 5 53.04
2 5 4 53.04
I know I can caluclate the pairwise distances between cars from How to apply euclidean distance function to a groupby object in pandas dataframe? but how do I get the nearest neighbour for each car?
After that it seems simple enough to get an average of the distances for each frame using groupby, but its the second step that really throws me off.
Help appreciated!

It might be a bit overkill but you could use nearest neighbors from scikit
An example:
import numpy as np
from sklearn.neighbors import NearestNeighbors
import pandas as pd
def nn(x):
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
distances, indices = nbrs.kneighbors(x)
return distances, indices
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))
groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
group = groups.get_group(i)
for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
nn_rows.append({'time': i,
'car': group.iloc[j]['car'],
'nearest_neighbour': group.iloc[tup[1][1]]['car'],
'euclidean_distance': tup[0][1]})
nn_df = pd.DataFrame(nn_rows).set_index('time')
Result:
car euclidean_distance nearest_neighbour
time
0 1 1.414214 3
0 2 1.000000 3
0 3 1.000000 2
1 1 10.049876 3
1 3 10.049876 1
2 4 53.037722 5
2 5 53.037722 4
(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1) is about 1.4142135623730951 while sqrt((218-217)**2 + 0) = 1)

use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.
example (for time == 0):
X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0. , 2.23606798, 1.41421356],
[2.23606798, 0. , 1. ],
[1.41421356, 1. , 0. ]])
Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.
idx = np.argsort(dist)
idx
array([[0, 2, 1],
[1, 2, 0],
[2, 1, 0]], dtype=int64)
Then, just pick out the cars & closest distances using the idx
dist[v[:,0], v[:,1]]
array([1.41421356, 1. , 1. ])
df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)
combine the above logic into a function that returns the required dataframe:
def closest(df):
X = df[['x', 'y']]
dist = cdist(X, X)
v = np.argsort(dist)
return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
nearest_neighbour=df.car.values[v[:, 1]])
& use it with groupby, finally dropping the index because the groupby-apply adds an additional index
df.groupby('time').apply(closest).reset_index(drop=True)
time x y car euclidean_distance nearest_neighbour
0 0 216 13 1 1.414214 3
1 0 218 12 2 1.000000 3
2 0 217 12 3 1.000000 2
3 1 280 110 1 10.049876 3
4 1 290 109 3 10.049876 1
5 2 130 3 4 53.037722 5
6 2 132 56 5 53.037722 4
by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result

Related

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Python: multiplying fixed array by element wise second array

Example code:
import numpy as np
a = np.arange(1,11)
b = np.arange(1,11)
b[:] = 0
b[3] = 10
b[4] = 10
print(a, b)
[ 1 2 3 4 5 6 7 8 9 10] [ 0 0 0 10 10 0 0 0 0 0]
I am attempting to multiply b by element-wise a-array such that my resulting array is the following:
[0 0 10 30 50 70 90 110 130 150]
Any help would be greatly appreciated.
It looks like you want the convolution of both arrays:
np.convolve(a,b)[:len(a)+1]
# array([ 0, 0, 0, 10, 30, 50, 70, 90, 110, 130, 150])
Elementwise multiplication of b with a shall give you [0,0,40,50,0,0,0,0,0,0] and not what you have stated.

Check if X,Y column pair in table B is within delta distance of any X, Y column pair in table A

I have a dataframe named origA:
X, Y
10, 20
11, 2
9, 35
8, 7
And another one named calcB:
Xc, Yc
1, 7
9, 22
I want to check that for every Xc, Yc pair in calcB if there is a X,Y pair in origA that has an euclidean distance to Xc, Yc that is less than delta and if yes, put True in the respective row at a new column Detected in origA.
#Wen-Ben's solution might work for small datasets. However, you run quickly into performance problems when you try to compute the distances for many points. Hence, there are already plenty of smart algorithms which reduce the amount of required distance calculations - one of them is BallTree (provided by scikit-learn):
from sklearn.neighbors import BallTree
# Prepare the data and the search radius:
origA = pd.DataFrame()
origA['X'] = [10, 11, 9, 8]
origA['Y'] = [20, 2, 35, 7]
calcB = pd.DataFrame()
calcB['Xc'] = [1, 9]
calcB['Yc'] = [7, 22]
delta = 5
# Stack the coordinates together:
pointsA = np.column_stack([origA.X, origA.Y])
pointsB = np.column_stack([calcB.Xc, calcB.Yc])
# Create the Ball Tree and search for close points:
tree = BallTree(pointsB)
detected = tree.query_radius(pointsA, r=delta, count_only=True)
# Add results as additional column:
origA['Detected'] = detected.astype(bool)
Output
X Y Detected
0 10 20 True
1 11 2 False
2 9 35 False
3 8 7 False
You can using the method from scipy
import scipy
delta=5
ary = scipy.spatial.distance.cdist(dfa, dfb, metric='euclidean')
ary
Out[189]:
array([[15.8113883 , 2.23606798],
[11.18033989, 20.09975124],
[29.12043956, 13. ],
[ 7. , 15.03329638]])
dfa['detected']=(ary<delta).any(1)
dfa
Out[191]:
X Y detected
0 10 20 False
1 11 2 True
2 9 35 True
3 8 7 False

Pandas aggregating average while excluding current row

How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.
You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Categories