First dataframe df1 contains id and their corresponding two coordinates. For each coordinate pair in the first dataframe, i have to loop through the second dataframe to find the one with the least distance. I tried taking the individual coordinates and finding the distance between them but it does not work as expected. I believe it has to be taken as a pair when finding the distance between them. Not sure whether Python offers some methods to achieve this.
For eg: df1
Id Co1 Co2
334 30.371353 -95.384010
337 39.497448 -119.789623
df2
Id Co1 Co2
339 40.914585 -73.892456
441 34.760395 -77.999260
dfloc3 =[[38.991512-77.441536],
[40.89869-72.37637],
[40.936115-72.31452],
[30.371353-95.38401],
[39.84819-75.37162],
[36.929306-76.20035],
[40.682342-73.979645]]
dfloc4 = [[40.914585,-73.892456],
[41.741543,-71.406334],
[50.154522,-96.88806],
[39.743565,-121.795761],
[30.027597,-89.91014],
[36.51881,-82.560844],
[30.449587,-84.23629],
[42.920475,-85.8208]]
Given you can get your points into a list like so...
df1 = [[30.371353, -95.384010], [39.497448, -119.789623]]
df2 = [[40.914585, -73.892456], [34.760395, -77.999260]]
Import math then create a function to make finding the distance easier:
import math
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
Then simply transverse your your list saving the closest points:
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
Outputs:
Point: [30.371353, -95.38401] is closest to [34.760395, -77.99926]
Point: [39.497448, -119.789623] is closest to [34.760395, -77.99926]
The code below creates a new column in df1 showing the Id of the nearest point in df2. (I can't tell from the question if this is what you want.) I'm assuming the coordinates are in a Euclidean space, i.e., that the distance between points is given by the Pythagorean Theorem. If not, you could easily use some other calculation instead of dist_squared.
import pandas as pd
df1 = pd.DataFrame(dict(Id=[334, 337], Co1=[30.371353, 39.497448], Co2=[-95.384010, -119.789623]))
df2 = pd.DataFrame(dict(Id=[339, 441], Co1=[40.914585, 34.760395], Co2=[-73.892456, -77.999260]))
def nearest(row, df):
# calculate euclidian distance from given row to all rows of df
dist_squared = (row.Co1 - df.Co1) ** 2 + (row.Co2 - df.Co2) ** 2
# find the closest row of df
smallest_idx = dist_squared.argmin()
# return the Id for the closest row of df
return df.loc[smallest_idx, 'Id']
near = df1.apply(nearest, args=(df2,), axis=1)
df1['nearest'] = near
Related
I have 4 dataframes like as given below
df_raw = pd.DataFrame(
{'stud_id' : [101, 101,101],
'prod_id':[12,13,16],
'total_qty':[100,1000,80],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})
df_accu = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,16],
'accu_qty':[10,500,10],
'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})
df_inv = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,18],
'inv_qty':[5,100,15],
'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})
df_bkl = pd.DataFrame(
{'stud_id' : [101,101,101,101],
'prod_id' :[12,12,12,17],
'bkl_qty' :[15,40,2,10],
'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})
My objective is to find out the below
a) Get the date when threshold exceeds 50%
threshold is given by the formula below
threshold = (((df_inv['inv_qty']+df_bkl['bkl_qty']+df_accu['accu_qty'])/df_raw['total_qty'])*100)
We have to add in the same order. Meaning, first, we have to add inv_qty, then bkl_qty and finally accu_qty.We do this way in order to identify the correct date when they exceeded 50% of total qty. Additionally, this has to be computed for each stud_id and prod_id.
but the problem is df_bkl has multiple records for the same stud_id and prod_id and it is by design. Real data also looks like this. Whereas df_accu and df_inv will have only row for each stud_id and prod_id.
In the above formula for df['bkl_qty'],we have to use each value of df['bkl_qty'] to compute the sum.
for ex: let's take stud_id = 101 and prod_id = 12.
His total_qty = 100, inv_qty = 5, his accu_qty=10. but he has three bkl_qty values - 15,40 and 2. So, threshold has to be computed in a fashion like below
5 (is value of inv_qty) +15 (is 1st value of bkl_qty) +40 (is 2nd value of bkl_qty) +2 (is 3rd value of bkl_qty) +10(is value of accu_qty)
So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty value was 40. Meaning, 5+15+40 = 60 (which is greater than 50% of total_qty (100)).
I was trying something like below
df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['bkl_qty'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
But this is incorrect as I am not able to do each value by value for bkl_qty from df_bkl
In this post, I have shown only sample data with one stud_id=101 but in real time I have more than 1000's of stud_id and prod_id.
Therfore, any elegant and efficient approach would be useful. We have to apply this logic on million record datasets.
I expect my output to be like as shown below. whenever the sum value exceeds 50% of total_qty, we need to get that corresponding date
stud_id,prod_id,total_qty,threshold,threshold_date
101 12 100 72 22/10/2021
It can be achieved using groupby and cumsum which does cumulative summation.
# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()
# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['csum'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']
# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()
df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)
provides the output as:
stud_id prod_id total_qty threshold threshold_date
0 101 12 100 72.0 22/10/2021
Does this work?
I did the 1st differencing as the time series is not stationary.
When I do the invert transformation, some values are coming as negative as we get negative values due to diff(). Is there a way to sort it out and bring back the data in original format as close to the expected result.
This is my python code. Is there a way to fix the code or any alternate logic to make the data as stationary and forecasting the series?
count = 0
def invert_transformation(df_train, df_forecast):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
if count > 0: # For 1st differencing
print("Enter into invert transformation")
for col in columns:
df_fc[str(col)+'_f'] = df_train[col].iloc[-1] + df_fc[str(col)+'_f'].cumsum()
print("df_fc: \n", df_fc)
return df_fc
# Since the data is not stationary, I did the 1st difference
df_differenced = df_train.diff().dropna()
count = count + 1 #increase the count
count
....
....
model = VAR(df_differenced)
....
fc = model_fitted.forecast(y=forecast_input, steps=10)
df_forecast2 = pd.DataFrame(fc, index=df2.index[-nobs:], columns=df2.columns + '_f')
df_results = invert_transformation(df_train, df_forecast2)
value of df_results(TS is the index column) are:
TS Field1_f Field2_f
44:13.0 6.826511e+05 1.198614e+06
44:14.0 -8.620101e+05 4.694556e+05
..
..
44:22.0 -1.401620e+07 -2.092826e+06
Value of df_differenced are:
TS Field1 Field2
43:34.0 187000.0 29000.0
43:35.0 175000.0 76722.0
43:36.0 -10000.0 31000.0
43:37.0 90000.0 42000.0
43:38.0 -130000.0 -42000.0
43:39.0 40000.0 -98444.0
..
..
44:11.0 -130000.0 40722.0
44:12.0 117000.0 -42444.0
I have several large data sets (~3000 rows, 100 columns) that I need to process with pandas. Each row represents a point on a map and there's a bunch of data associated with that point. I am doing spatial calculations (may introduce a few more variables in the future), so for each row I am only using the data from 1-4 columns. The issue is that I have to compare each row to every other row - essentially, I am trying to figure out spatial relationships between every point. At this stage in the project, I am doing calculations to determine how many points are inside a given radius for each point in the table. I have to do this 5 or 6 times (i.e. running the distance calculation function for multiple radius sizes.) This means that I end up with ~10-50 million calculations. It is slow. Very slow (like 9+ hours of computing time.)
After I run all these calculations, I need to append them as new columns in the original (very large) dataframe. Currently, I have been passing the entire dataframe to my function, which might further slow things down.
I know that many people are running this size of calculation on super computers or dedicated multicore units, but I would like to do what I can to optimize my code to run as efficiently as possible, regardless of the hardware.
I am currently using a double for loop with .iterrows(). I have stripped away as much of the unnecessary steps as possible. I may be able to pair down the dataframe into a subset, and then pass that to the function, and append the calculations to the original in another step, if that would help speed things up. I have also considered using .apply() to eliminate the outside loop (e.g. .apply() the inner loop to all rows in the dataframe...?)
Below, I have showed the functions that I am using. This is probably the simplest application that I have for this project... there are others that do more calculations/return pairs or groups of points based on certain spacial criteria, but this is the best example to show the basic idea of what I am doing.
# specify file to be read into pandas
df = pd.read_csv('input_file.csv', low_memory = False)
# function to return distance between two points w/ (x,y) coordinates
def xy_distance_calc(x1, x2, y1, y2):
return math.sqrt((x1 - x2)**2 + (y1-y2)**2)
# function to calculate number of points inside a given radius for each point
def spacing_calc(data, rad_crit, col_x, col_y):
count_list = list()
df_list = pd.DataFrame()
for index, row in data.iterrows():
x_row_current = row[col_x]
y_row_current = row[col_y]
count = 0
# dist_list = list()
for index1, row1 in data.iterrows():
x1 = row1[col_x]
y1 = row1[col_y]
dist = xy_distance_calc(x_row_current, x1, y_row_current, y1)
if dist < rad_crit:
count += 1
else:
continue
count_list.append(count)
df_list = pd.DataFrame(data=count_list, columns = [str(rad_crit) + ' radius'])
return df_list
# call the function for each radius in question, append new data
df_2640 = spacing_calc(df, 2640.0, 'MID_X', 'MID_Y')
df = df.join(df_2640)
df_1320 = spacing_calc(df, 1320.0, 'MID_X', 'MID_Y')
df = df.join(df_1320)
df_1155 = spacing_calc(df, 1155.0, 'MID_X', 'MID_Y')
df = df.join(df_1155)
df_990 = spacing_calc(df, 990.0, 'MID_X', 'MID_Y')
df = df.join(df_990)
df_660 = spacing_calc(df, 660.0, 'MID_X', 'MID_Y')
df = df.join(df_660)
df_330 = spacing_calc(df, 330.0, 'MID_X', 'MID_Y')
df = df.join(df_330)
df.to_csv('spacing_calc_all.csv', index=None)
No errors, everything works, I just don't think it's as efficient as it could be.
Your problem is that you loop too many times. At the very least, you should calculate a distance matrix and the count the how many points fall within a radius from that matrix. However, the fastest solution is to use numpy's vectorized functions, which are highly optimized C code.
As with most learning experiences, it's better to start with a small problem:
>>> import numpy as np
>>> import pandas as pd
>>> from scipy.spatial import distance_matrix
# Create a dataframe with columns two MID_X and MID_Y assigned at random
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.uniform(1, 10, size=(5, 2)), columns=['MID_X', 'MID_Y'])
>>> df.index.name = 'PointID'
MID_X MID_Y
PointID
0 4.370861 9.556429
1 7.587945 6.387926
2 2.404168 2.403951
3 1.522753 8.795585
4 6.410035 7.372653
# Calculate the distance matrix
>>> cols = ['MID_X', 'MID_Y']
>>> d = distance_matrix(df[cols].values, df[cols].values)
array([[0. , 4.51542241, 7.41793942, 2.94798323, 2.98782637],
[4.51542241, 0. , 6.53786001, 6.52559479, 1.53530446],
[7.41793942, 6.53786001, 0. , 6.4521226 , 6.38239593],
[2.94798323, 6.52559479, 6.4521226 , 0. , 5.09021286],
[2.98782637, 1.53530446, 6.38239593, 5.09021286, 0. ]])
# The radii for which you want to measure. They need to be raised
# up 2 extra dimensions to prepare for array broadcasting later
>>> radii = np.array([3,6,9])[:, None, None]
array([[[3]],
[[6]],
[[9]]])
# Count how many points fall within a certain radius from another
# point using numpy's array broadcasting. `d < radii` will return
# an array of `True/False` and we can count the number of `True`
# by `sum` over the last axis.
#
# The distance between a point to itself is 0 and we don't want
# to count that hence the -1.
>>> count = (d < radii).sum(axis=-1) - 1
array([[2, 1, 0, 1, 2],
[3, 2, 0, 2, 3],
[4, 4, 4, 4, 4]])
# Putting everything together for export
>>> result = pd.DataFrame(count, index=radii.flatten()).stack().to_frame('Count')
>>> result.index.names = ['Radius', 'PointID']
Count
Radius PointID
3 0 2
1 1
2 0
3 1
4 2
6 0 3
1 2
2 0
3 2
4 3
9 0 4
1 4
2 4
3 4
4 4
The final result means that within a radius of 3, point #0 has 2 neighbours, point #1 has 1 neighbour, point #2 has 0 neighbour and so on. Reshape and format the frame to your liking.
You shouldn't have a problem scaling this to thousands of points.
I m trying to loop each row of df1 with every row of df2 and create a new col in df1 and store the min(all values) in it.
lat_sc= shopping_centers['lat']
long_sc= shopping_centers['lng']
for i, j in zip(lat_sc,long_sc):
for lat_real, long_real in zip(real_estate['lat'],real_estate['lng']):
euclid_dist.append( lat_real - i)
short_dist.append(min(euclid_dist))
euclid_dist = []
Result:
df1['shortest'] = min(df1['lat']- each lat of df2)
df1['nearest sc'] = that corresponding sc_id
Edit to include sc_id in df1
This could get computationally intensive as df2 gets big but you can find the difference the df1 distance and all the df2 distances like this (it's possible to do this more efficiently)
def find_euclid_dist(row):
dist_arr = np.sqrt((ref_lats - row["lat"])**2 + (ref_longs - row["lng"])**2)
return np.min(dist_arr)
ref_lats = df2["lat"].values
ref_longs = df2["lng"].values
df1["shortest"] = df1.apply(find_euclid_dist, axis=1)
How abut using cdist from scipy?
from scipy.spatial.distance import cdist
df1['shortest'] = cdist(df1[['lat','lng']], df2[['lat','lng']], metric='euclidean').min(1)
print(df1) returns:
lat lng addr_street shortest
0 -37.980523 -37.980523 37 Scarlet Drive 183.022436
1 -37.776161 -37.776161 999 Heidelberg Road 182.817951
2 -37.926238 -37.926238 47 New Street 182.968096
3 -37.800056 -37.800056 3/113 Normanby Road 182.841849
I'll try to explain what I'm currently working with:
I have two dataframes: one for Gas Station A (165 stations), and other for Gas Station B (257 stations). They both share the same format:
id Coor
1 (a1,b1)
2 (a2,b2)
Coor has tuples with the location coordinates. What I want to do is to add 3 columns to Dataframe A with nearest Competitor #1, #2 and #3 (from Gas Station B).
Currently I managed to get every distance from A to B (42405 distance measures), but in a list format:
distances=[]
for (u,v) in gasA['coor']:
for (w,x) in gasB['coor']:
distances.append(sp.distance.euclidean((u,v),(w,x)))
This lets me have the values I need, but I still need to match them with the ID from Gas Station A, and get the top 3. I have the suspicion working with lists is not the best approach here. Do you have any suggestions?
Edit: as suggested, first 5 rows are:
in GasA:
id coor
60712 (-333525363206695,-705191013427772)
60512 (-333539879388388, -705394161580837)
60085 (-333545609177068, -703168832659184)
60110 (-333601677229216, -705167284798638)
60078 (-333608898397271, -707213099595404)
in GasB:
id coor
70174 (-333427160000000,-705459060000000)
70223 (-333523030000000, -706705470000000)
70383 (-333549270000000, -705320990000000)
70162 (-333556960000000, -705384750000000)
70289 (-333565850000000, -705104360000000)
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
Creating the data:
A = pd.DataFrame({'id':['60712','60512','60085', '60110','60078'], 'coor':[ (-333525363206695,-705191013427772),\
(-333539879388388, -705394161580837),\
(-333545609177068, -703168832659184),\
(-333601677229216, -705167284798638),\
(-333608898397271, -707213099595404)]})
B = pd.DataFrame({'id':['70174','70223','70383', '70162','70289'], 'coor':[ (-333427160000000,-705459060000000),\
(-333523030000000, -706705470000000),\
(-333549270000000, -705320990000000),\
(-333556960000000, -705384750000000),\
(-333565850000000, -705104360000000)]})
Calculating the distances:
res = euclidean_distances(list(A.coor), list(B.coor))
Selecting top 3 closest stations from B and appending to a column in A:
d = []
for i, id_ in enumerate(A.index):
distances = np.argsort(res[i])[0:3] #select top 3
distances = B.iloc[distances]['id'].values
d.append(distances)
A = A.assign(dist=d)
edit
result of running with example:
coor id dist
0 (-333525363206695, -705191013427772) 60712 [70223, 70174, 70162]
1 (-333539879388388, -705394161580837) 60512 [70223, 70289, 70174]
2 (-333545609177068, -703168832659184) 60085 [70223, 70174, 70162]
3 (-333601677229216, -705167284798638) 60110 [70223, 70174, 70162]
4 (-333608898397271, -707213099595404) 60078 [70289, 70383, 70162]
Define a function that calculates the distances from A to all B's and returns indices of B with the three smallest distances.
def get_nearest_three(row):
(u,v) = row['Coor']
dist_list = gasB.Coor.apply(sp.distance.euclidean,args = [u,v])
# want indices of the 3 indices of B with smallest distances
return list(np.argsort(dist_list))[0:3]
gasA['dists'] = gasA.apply(get_nearest_three, axis = 1)
You can do something like this.
a = gasA.coor.values
b = gasB.coor.values
c = np.sum(np.sum((a[:,None,::-1] - b)**2, axis=1), axis=0)
we can get the numpy arrays for the coordinates for both and then broadcast a to represent all it's combinations and then take the euclidean distance.
Consider a cross join (matching every row by every row between both datasets) which should be manageable with your small sets, 165 X 257, then calculate the distance. Then, rank by distance and filter for top 3.
cj_df = pd.merge(gasA.assign(key=1), gasB.assign(key=1),
on="key", suffixes=['_A', '_B'])
cj_df['distance'] = cj_df.apply(lambda row: sp.distance.euclidean(row['Coor_A'],
row['Coor_B']),
axis = 1)
# RANK BY DISTANCE
cj_df['rank'] = .groupby('id_A')['distance'].rank()
# FILTER FOR TOP 3
top3_df = cj_df[cj_df['rank'] <= 3].sort_values(['id_A', 'rank'])