I have a list of places and I need to find the distance between each of those. Can anyone suggest a faster method? There are about 10k unique places, the method I'm using creates a 10k X 10k matrix and I'm running out of memory. I'm using 15GB RAM.
test_df
Latitude Longitude site
0 32.3 -94.1 1
1 35.2 -93.1 2
2 33.1 -83.4 3
3 33.2 -94.5 4
test_df = test_df[['site', 'Longitude', 'Latitude']]
test_df['coord'] = list(zip(test_df['Longitude'], test_df['Latitude']))
from haversine import haversine
for _,row in test_df.iterrows():
test_df[row.coord]=round(test_df['coord'].apply(lambda x:haversine(row.coord,x, unit='mi')),2)
df = test_df.rename(columns=dict(zip(test_df['coord'], test_df['Facility'])))
df.drop('coord', axis=1, inplace=True)
new_df = pd.melt(df, id_vars='Facility', value_vars=df.columns[1:])
new_df.rename(columns={'variable':'Place', 'value':'dist_in_mi'}, inplace=True)
new_df
site Place dist_in_mi
0 1 1 0.00
1 2 1 70.21
2 3 1 739.28
3 4 1 28.03
4 1 2 70.21
5 2 2 0.00
6 3 2 670.11
7 4 2 97.15
8 1 3 739.28
9 2 3 670.11
10 3 3 0.00
11 4 3 766.94
12 1 4 28.03
13 2 4 97.15
14 3 4 766.94
15 4 4 0.00
If you want to resolve your memory problem you need to use datatypes that use less memory.
In this case since the maximum distance between two points on the planet Earth is less than 20005Km you can use uint16 to store the value (if a 1Km resolution is enough for you)
Since i hadn't any data to work with i generated some data with the following code:
import random
import numpy as np
from haversine import haversine
def getNFacilities(n):
""" returns n random pairs of coordinates in the range [-90, +90]"""
for i in range(n):
yield random.random()*180 - 90, random.random()*180 - 90
facilities = list(getNFacilities(10000))
Then i resolved the memory problem in two different ways:
1- By storing the distance data in uint16 numbers
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time instead of two
out = np.zeros((len(start), len(end)), dtype = np.uint16) # might be better to use empty?
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
out[i, j] = int(haversine(coords1, coords2))
return out
After calculating the distance the memory used by the array was about 200MB:
In [133]: l = calculateDistance(facilities, facilities)
In [134]: sys.getsizeof(l)
Out[134]: 200000112
2- Or in alternative you can just use a generator:
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
yield [i, j, haversine(coords1, coords2)]
Related
To set up the question. I have a dataframe containing spots and their x,y positions. I want to iterate over each spot and check all other spots to see if they are within a radius. I then want to count the number of spots within the radius in a new column of the dataframe. I would like to iterate over the index as I have a decent understanding on how that works. I know that I am missing something simple but I have not been able to find a solution that works for me yet. Thank you in advance!
radius = 3
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
spot_id x_pos y_pos
0 1 5 4
1 2 4 10
2 3 10 8
3 4 3 6
4 5 8 3
I then want to get something that looks like this
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
To do it in a vectorized way, you can use scipy.spatial.distance_matrix to compute the distance matrix, D, between all the N row/position vectors ('x_pos', 'y_pos'). D is a N x N matrix (2D numpy.ndarray) whose entry (i, j) is the Euclidean distance between the ith and jth rows/ positions .
Then, check which positions are a distance = radius from each other (D <= radius), which will give you a boolean matrix. Finally, you can count all the True values row-wise using sum(axis=0). You have to subtract 1 in the end since the former counts the distance between a vector with itself (diagonal entries).
import pandas as pd
from scipy.spatial import distance_matrix
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
radius = 3
pos = df[['x_pos','y_pos']]
df['spots_within_radius'] = (distance_matrix(pos, pos) <= radius).sum(axis=0) - 1
Output
>>> df
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
If you don't want to use scipy.spatial.distance_matrix, you can compute D yourself using numpy's broadcasting.
import numpy as np
pos = df[['x_pos','y_pos']].to_numpy()
D = np.sum((pos - pos[:, None])**2, axis=-1) ** 0.5
df['spots_within_radius'] = (D <= radius).sum(axis=0) - 1
I would suggest using a KD Tree to answer this kind of question. It's a data structure designed to efficiently search for nearby points, and it's faster than computing a distance matrix. You can use SciKit Learn to implement this.
The code
Here's how:
import sklearn.neighbors
import pandas as pd
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
def add_points_in_range_column_kd(df, radius):
# Get positions as numpy array
positions = df[['x_pos', 'y_pos']].to_numpy(dtype='float32')
# Build KD Tree on those positions
tree = sklearn.neighbors.KDTree(positions)
# For each position, check how many points are in range.
# Return a count, and not the actual points.
return tree.query_radius(positions, r=radius, count_only=True) - 1
df['spots_within_radius'] = add_points_in_range_column_kd(df, 3)
The efficiency argument
Since a distance matrix needs to calculate distance between all points, it has a time complexity of O(N^2). In contrast, the time required to find all of the points inside the KD Tree is proportional to the depth of the tree times the number of points you need to find. On average, this is O(N log N). So this method will be more efficient for a large number of points.
Benchmarking
Theory is nice, but is it actually faster in practice?
I ran both a KD Tree method, and a distance matrix method, on dataframes of sizes ranging from N=10 to N=3000. I used the timeit module, running both methods in random order for 100 iterations for all point sizes. Here is a graph of the time it takes with each method:
For small numbers of points, the distance matrix method is faster. After you get 300 points to compare to each other, the KD Tree is faster. Note that this graph has a log axis on both scales.
Full testing details can be found here.
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
I have a very large pandas dataset, and at some point I need to use the following function
def proc_trader(data):
data['_seq'] = np.nan
# make every ending of a roundtrip with its index
data.ix[data.cumq == 0,'tag'] = np.arange(1, (data.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
data['_seq'] =data['tag'].fillna(method = 'bfill').fillna(0)
return data['_seq']
# btw, why on earth this function returns a dataframe instead of the series `data['_seq']`??
and I use apply
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)
Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be the arange thing? There are many name-productid combinations in the data.
Minimal Working Example:
import pandas as pd
import numpy as np
reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum')
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']
Nothing really fancy here, just tweaked in a couple of places. There is really no need to put in a function, so I didn't. On this tiny sample data, it's about twice as fast as the original.
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.cumsum()
reshaped.loc[ reshaped.cumq == 0, '_spell' ] = 1
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].cumsum()
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].bfill().fillna(0)
Result:
day delta out stock trader cumq _spell
0 0 10 1 a a 10 1.0
1 1 -10 1 a a 0 1.0
2 2 15 2 a a 15 2.0
3 4 -10 2 a a 5 2.0
4 5 -5 2 a a 0 2.0
5 10 5 0 a a 5 0.0
6 1 0 1 b a 0 1.0
I have a pandas series of value_counts for a data set. I would like to plot the data with a color band (I'm using bokeh, but calculating the data band is the important part):
I hesitate to use the word standard deviation since all the references I use calculate that based on the mean value, and I specifically want to use the mode as the center.
So, basically, I'm looking for a way in pandas to start at the mode and return a new series that of value counts that includes 68.2% of the sum of the value_counts. If I had this series:
val count
1 0
2 0
3 3
4 1
5 2
6 5 <-- mode
7 4
8 3
9 2
10 1
total = sum(count) # example value 21
band1_count = 21 * 0.682 # example value ~ 14.3
This is the order they would be added based on an algorithm that walks the value count on each side of the mode and includes the higher of the two until the sum of the counts is > than 14.3.
band1_values = [6, 7, 8, 5, 9]
Here are the steps:
val count step
1 0
2 0
3 3
4 1
5 2 <-- 4) add to list -- eq (9,2), closer to (6,5)
6 5 <-- 1) add to list -- mode
7 4 <-- 2) add to list -- gt (5,2)
8 3 <-- 3) add to list -- gt (5,2)
9 2 <-- 5) add to list -- gt (4,1), stop since sum of counts > 14.3
10 1
Is there a native way to do this calculation in pandas or numpy? If there is a formal name for this study, I would appreciate knowing what it's called.