I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
Related
I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')
Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10
To set up the question. I have a dataframe containing spots and their x,y positions. I want to iterate over each spot and check all other spots to see if they are within a radius. I then want to count the number of spots within the radius in a new column of the dataframe. I would like to iterate over the index as I have a decent understanding on how that works. I know that I am missing something simple but I have not been able to find a solution that works for me yet. Thank you in advance!
radius = 3
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
spot_id x_pos y_pos
0 1 5 4
1 2 4 10
2 3 10 8
3 4 3 6
4 5 8 3
I then want to get something that looks like this
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
To do it in a vectorized way, you can use scipy.spatial.distance_matrix to compute the distance matrix, D, between all the N row/position vectors ('x_pos', 'y_pos'). D is a N x N matrix (2D numpy.ndarray) whose entry (i, j) is the Euclidean distance between the ith and jth rows/ positions .
Then, check which positions are a distance = radius from each other (D <= radius), which will give you a boolean matrix. Finally, you can count all the True values row-wise using sum(axis=0). You have to subtract 1 in the end since the former counts the distance between a vector with itself (diagonal entries).
import pandas as pd
from scipy.spatial import distance_matrix
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
radius = 3
pos = df[['x_pos','y_pos']]
df['spots_within_radius'] = (distance_matrix(pos, pos) <= radius).sum(axis=0) - 1
Output
>>> df
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
If you don't want to use scipy.spatial.distance_matrix, you can compute D yourself using numpy's broadcasting.
import numpy as np
pos = df[['x_pos','y_pos']].to_numpy()
D = np.sum((pos - pos[:, None])**2, axis=-1) ** 0.5
df['spots_within_radius'] = (D <= radius).sum(axis=0) - 1
I would suggest using a KD Tree to answer this kind of question. It's a data structure designed to efficiently search for nearby points, and it's faster than computing a distance matrix. You can use SciKit Learn to implement this.
The code
Here's how:
import sklearn.neighbors
import pandas as pd
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
def add_points_in_range_column_kd(df, radius):
# Get positions as numpy array
positions = df[['x_pos', 'y_pos']].to_numpy(dtype='float32')
# Build KD Tree on those positions
tree = sklearn.neighbors.KDTree(positions)
# For each position, check how many points are in range.
# Return a count, and not the actual points.
return tree.query_radius(positions, r=radius, count_only=True) - 1
df['spots_within_radius'] = add_points_in_range_column_kd(df, 3)
The efficiency argument
Since a distance matrix needs to calculate distance between all points, it has a time complexity of O(N^2). In contrast, the time required to find all of the points inside the KD Tree is proportional to the depth of the tree times the number of points you need to find. On average, this is O(N log N). So this method will be more efficient for a large number of points.
Benchmarking
Theory is nice, but is it actually faster in practice?
I ran both a KD Tree method, and a distance matrix method, on dataframes of sizes ranging from N=10 to N=3000. I used the timeit module, running both methods in random order for 100 iterations for all point sizes. Here is a graph of the time it takes with each method:
For small numbers of points, the distance matrix method is faster. After you get 300 points to compare to each other, the KD Tree is faster. Note that this graph has a log axis on both scales.
Full testing details can be found here.
I have a list of places and I need to find the distance between each of those. Can anyone suggest a faster method? There are about 10k unique places, the method I'm using creates a 10k X 10k matrix and I'm running out of memory. I'm using 15GB RAM.
test_df
Latitude Longitude site
0 32.3 -94.1 1
1 35.2 -93.1 2
2 33.1 -83.4 3
3 33.2 -94.5 4
test_df = test_df[['site', 'Longitude', 'Latitude']]
test_df['coord'] = list(zip(test_df['Longitude'], test_df['Latitude']))
from haversine import haversine
for _,row in test_df.iterrows():
test_df[row.coord]=round(test_df['coord'].apply(lambda x:haversine(row.coord,x, unit='mi')),2)
df = test_df.rename(columns=dict(zip(test_df['coord'], test_df['Facility'])))
df.drop('coord', axis=1, inplace=True)
new_df = pd.melt(df, id_vars='Facility', value_vars=df.columns[1:])
new_df.rename(columns={'variable':'Place', 'value':'dist_in_mi'}, inplace=True)
new_df
site Place dist_in_mi
0 1 1 0.00
1 2 1 70.21
2 3 1 739.28
3 4 1 28.03
4 1 2 70.21
5 2 2 0.00
6 3 2 670.11
7 4 2 97.15
8 1 3 739.28
9 2 3 670.11
10 3 3 0.00
11 4 3 766.94
12 1 4 28.03
13 2 4 97.15
14 3 4 766.94
15 4 4 0.00
If you want to resolve your memory problem you need to use datatypes that use less memory.
In this case since the maximum distance between two points on the planet Earth is less than 20005Km you can use uint16 to store the value (if a 1Km resolution is enough for you)
Since i hadn't any data to work with i generated some data with the following code:
import random
import numpy as np
from haversine import haversine
def getNFacilities(n):
""" returns n random pairs of coordinates in the range [-90, +90]"""
for i in range(n):
yield random.random()*180 - 90, random.random()*180 - 90
facilities = list(getNFacilities(10000))
Then i resolved the memory problem in two different ways:
1- By storing the distance data in uint16 numbers
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time instead of two
out = np.zeros((len(start), len(end)), dtype = np.uint16) # might be better to use empty?
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
out[i, j] = int(haversine(coords1, coords2))
return out
After calculating the distance the memory used by the array was about 200MB:
In [133]: l = calculateDistance(facilities, facilities)
In [134]: sys.getsizeof(l)
Out[134]: 200000112
2- Or in alternative you can just use a generator:
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
yield [i, j, haversine(coords1, coords2)]
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I have a data frame that looks like this, but with several hundred thousand rows:
df
D x y
0 y 5.887672 6.284714
1 y 9.038657 10.972742
2 n 2.820448 6.954992
3 y 5.319575 15.475197
4 n 1.647302 7.941926
5 n 5.825357 13.747091
6 n 5.937630 6.435687
7 y 7.789661 11.868023
8 n 2.669362 11.300062
9 y 1.153347 17.625158
I want to know what proportion of values ("D") in each x:y grid space is "n".
I can do it by brute force, by stepping through x and y and calculating the percentage:
zonexy = {}
for x in np.arange(0,10,2.5):
dfx = df[(df['x'] >= x) & (df['x'] < x+2.5)]
zonexy[x] = {}
for y in np.arange(0,24,6):
dfy = dfx[(dfx['y'] >= y) & (dfx['y'] < y+6)]
try:
pctn = len(dfy[dfy['Descr']=='n'])/len(dfy) * 100.0
except ZeroDivisionError:
pctn = 0
zonexy[x][y] = pctn
Output:
pd.DataFrame(zonexy)
0.0 2.5 5.0 7.5
0 0 0 0 0
6 100 100 50 0
12 0 0 50 0
18 0 0 0 0
But this, and all the variations on this theme that I've tried, is very slow. It seems like there should be a much more efficient way (probably via numpy), but I'm blanking on it.
One way would be to use the 2D histogram function of numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html
Then,
Run it once on the data where the criteria is matched (here, "D" is "n")
Run it again on all of the data.
Divide the first result, element-by-element, with the second result.