I'll try to explain what I'm currently working with:
I have two dataframes: one for Gas Station A (165 stations), and other for Gas Station B (257 stations). They both share the same format:
id Coor
1 (a1,b1)
2 (a2,b2)
Coor has tuples with the location coordinates. What I want to do is to add 3 columns to Dataframe A with nearest Competitor #1, #2 and #3 (from Gas Station B).
Currently I managed to get every distance from A to B (42405 distance measures), but in a list format:
distances=[]
for (u,v) in gasA['coor']:
for (w,x) in gasB['coor']:
distances.append(sp.distance.euclidean((u,v),(w,x)))
This lets me have the values I need, but I still need to match them with the ID from Gas Station A, and get the top 3. I have the suspicion working with lists is not the best approach here. Do you have any suggestions?
Edit: as suggested, first 5 rows are:
in GasA:
id coor
60712 (-333525363206695,-705191013427772)
60512 (-333539879388388, -705394161580837)
60085 (-333545609177068, -703168832659184)
60110 (-333601677229216, -705167284798638)
60078 (-333608898397271, -707213099595404)
in GasB:
id coor
70174 (-333427160000000,-705459060000000)
70223 (-333523030000000, -706705470000000)
70383 (-333549270000000, -705320990000000)
70162 (-333556960000000, -705384750000000)
70289 (-333565850000000, -705104360000000)
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
Creating the data:
A = pd.DataFrame({'id':['60712','60512','60085', '60110','60078'], 'coor':[ (-333525363206695,-705191013427772),\
(-333539879388388, -705394161580837),\
(-333545609177068, -703168832659184),\
(-333601677229216, -705167284798638),\
(-333608898397271, -707213099595404)]})
B = pd.DataFrame({'id':['70174','70223','70383', '70162','70289'], 'coor':[ (-333427160000000,-705459060000000),\
(-333523030000000, -706705470000000),\
(-333549270000000, -705320990000000),\
(-333556960000000, -705384750000000),\
(-333565850000000, -705104360000000)]})
Calculating the distances:
res = euclidean_distances(list(A.coor), list(B.coor))
Selecting top 3 closest stations from B and appending to a column in A:
d = []
for i, id_ in enumerate(A.index):
distances = np.argsort(res[i])[0:3] #select top 3
distances = B.iloc[distances]['id'].values
d.append(distances)
A = A.assign(dist=d)
edit
result of running with example:
coor id dist
0 (-333525363206695, -705191013427772) 60712 [70223, 70174, 70162]
1 (-333539879388388, -705394161580837) 60512 [70223, 70289, 70174]
2 (-333545609177068, -703168832659184) 60085 [70223, 70174, 70162]
3 (-333601677229216, -705167284798638) 60110 [70223, 70174, 70162]
4 (-333608898397271, -707213099595404) 60078 [70289, 70383, 70162]
Define a function that calculates the distances from A to all B's and returns indices of B with the three smallest distances.
def get_nearest_three(row):
(u,v) = row['Coor']
dist_list = gasB.Coor.apply(sp.distance.euclidean,args = [u,v])
# want indices of the 3 indices of B with smallest distances
return list(np.argsort(dist_list))[0:3]
gasA['dists'] = gasA.apply(get_nearest_three, axis = 1)
You can do something like this.
a = gasA.coor.values
b = gasB.coor.values
c = np.sum(np.sum((a[:,None,::-1] - b)**2, axis=1), axis=0)
we can get the numpy arrays for the coordinates for both and then broadcast a to represent all it's combinations and then take the euclidean distance.
Consider a cross join (matching every row by every row between both datasets) which should be manageable with your small sets, 165 X 257, then calculate the distance. Then, rank by distance and filter for top 3.
cj_df = pd.merge(gasA.assign(key=1), gasB.assign(key=1),
on="key", suffixes=['_A', '_B'])
cj_df['distance'] = cj_df.apply(lambda row: sp.distance.euclidean(row['Coor_A'],
row['Coor_B']),
axis = 1)
# RANK BY DISTANCE
cj_df['rank'] = .groupby('id_A')['distance'].rank()
# FILTER FOR TOP 3
top3_df = cj_df[cj_df['rank'] <= 3].sort_values(['id_A', 'rank'])
Related
I have a fairly large dataframe on which I want to
row-wise search for an interval containing a value
perform a linear interpolation between the two elements found at point 1 and two elements from another array
add a column to the dataframe with the interpolated values
What I have done involves a for loop, i.e.:
Given a sample of the dataframe Fak
beta0 beta1 beta2 beta3 beta4 beta5 beta6 beta7 beta8 beta9 beta10
0 0.008665 0.061391 0.159690 0.223275 0.232535 0.251266 0.279847 0.465671 0.672253 0.914753 1.0
1 0.009121 0.064322 0.166623 0.232418 0.241945 0.261106 0.290169 0.477621 0.682283 0.916384 1.0
2 0.009491 0.066689 0.172210 0.239776 0.249516 0.269020 0.298463 0.487108 0.690031 0.917638 1.0
3 0.009733 0.068232 0.175837 0.244542 0.254418 0.274140 0.303820 0.493102 0.694703 0.918304 1.0
4 0.009860 0.069027 0.177687 0.246963 0.256906 0.276734 0.306523 0.495985 0.696696 0.918511 1.0
I have an array psi
[-12.97, -11.97, -10.97, -9.97, -8.97, -7.97, -6.97, -5.97, -4.97, -3.97, -2.97, -1.97]
I define the value I want to search in Fak, i.e. intF = 0.16
I calculate the new dataframe with the following loop
dxlist = []
for i,Faki in Fak.iterrows():
# interpolation boundaries ID
if intF == 0.0:
ip1 = 1
elif intF == 1.0:
ip1 = -1
else:
ip1 = np.where(Faki>int(intF)/100)[0][0]
im1 = ip1-1
# coefficients
dfak = Faki[ip1] - Faki[im1]
dpsi = psi[ip1] - psi[im1]
m = dfak/dpsi
q = Faki[im1]-m*psi[im1]
# calculate
intPsi = (int(intF)/100-q)/m
intDi = 2**intPsi
dxlist.append(intDi)
dfout['newcolumn'] = dxlist
which works, but it is quite slow.
What I am missing is how to calculate the linear interpolation row-wise and use the indices on an outside array.
Apparently I found a vectorized solution:
psidf = Fak.copy()
psidf.loc[Fak.index] = psi
Fakp1 = Fak[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
Fakm1 = Fak[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
psip1 = psidf[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
psim1 = psidf[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
m = (Fakp1-Fakm1)/(psip1-psim1)
q = Fakm1-m*psim1
intDi_series = 2**((intF/100-q)/m)
intDi['d'+str(int(intF))+nsfx] = intDi_series
The key is to generate a database with the array as rows, having the same shape as Fak (which is done in the first two lines of the above code).
Then, I isolate the columns I need from each dataframe using the ge and le methods for pandas dataframe, and I use the indices in the newly generated dataframe
Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together.
I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.
Reproducible Example Code:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Below is my initial code to achieved wanted output
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
However, output above is still not what we want because groupby function cuts the data for each year.
So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Now I've got the time. Here is my solution:
# Modules needed
import pandas as pd
import numpy as np
input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})
streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts
streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths
longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()
output:
dates
2000 9
2001 3
Name: streak_length, dtype: int64
Not sure if this is the most efficient, but it's one solution:
arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index() # convert to df
df['adj_year'] = df['index'].dt.year # adj_year will be adjusted for streaks
mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year
Output:
adj_year
2000 9
2001 3
dtype: int64
Note:
By convention variable names in Python (except for classes) are lower case , so arr as opposed to InputArray
1 and 0 are equivalent to True and False, so you can make convert them to boolean without the explicit comparison
cumsum is zero-indexed (as is usual in Python) so we add 1
This solution doesn't answer the question exactly, so will not be the final answer.
i.e. This regards max_consecutive trues at the boundaries for both current-year and following year
boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))
distinct = boolean_array.ne(boolean_array.shift()).cumsum()
distinct_masked = distinct[boolean_array]
streak_sum = distinct_masked.value_counts()
streak_sum_series = pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()
Output:
max_consect
2000 9
2001 9
dtype: int64
I have several large data sets (~3000 rows, 100 columns) that I need to process with pandas. Each row represents a point on a map and there's a bunch of data associated with that point. I am doing spatial calculations (may introduce a few more variables in the future), so for each row I am only using the data from 1-4 columns. The issue is that I have to compare each row to every other row - essentially, I am trying to figure out spatial relationships between every point. At this stage in the project, I am doing calculations to determine how many points are inside a given radius for each point in the table. I have to do this 5 or 6 times (i.e. running the distance calculation function for multiple radius sizes.) This means that I end up with ~10-50 million calculations. It is slow. Very slow (like 9+ hours of computing time.)
After I run all these calculations, I need to append them as new columns in the original (very large) dataframe. Currently, I have been passing the entire dataframe to my function, which might further slow things down.
I know that many people are running this size of calculation on super computers or dedicated multicore units, but I would like to do what I can to optimize my code to run as efficiently as possible, regardless of the hardware.
I am currently using a double for loop with .iterrows(). I have stripped away as much of the unnecessary steps as possible. I may be able to pair down the dataframe into a subset, and then pass that to the function, and append the calculations to the original in another step, if that would help speed things up. I have also considered using .apply() to eliminate the outside loop (e.g. .apply() the inner loop to all rows in the dataframe...?)
Below, I have showed the functions that I am using. This is probably the simplest application that I have for this project... there are others that do more calculations/return pairs or groups of points based on certain spacial criteria, but this is the best example to show the basic idea of what I am doing.
# specify file to be read into pandas
df = pd.read_csv('input_file.csv', low_memory = False)
# function to return distance between two points w/ (x,y) coordinates
def xy_distance_calc(x1, x2, y1, y2):
return math.sqrt((x1 - x2)**2 + (y1-y2)**2)
# function to calculate number of points inside a given radius for each point
def spacing_calc(data, rad_crit, col_x, col_y):
count_list = list()
df_list = pd.DataFrame()
for index, row in data.iterrows():
x_row_current = row[col_x]
y_row_current = row[col_y]
count = 0
# dist_list = list()
for index1, row1 in data.iterrows():
x1 = row1[col_x]
y1 = row1[col_y]
dist = xy_distance_calc(x_row_current, x1, y_row_current, y1)
if dist < rad_crit:
count += 1
else:
continue
count_list.append(count)
df_list = pd.DataFrame(data=count_list, columns = [str(rad_crit) + ' radius'])
return df_list
# call the function for each radius in question, append new data
df_2640 = spacing_calc(df, 2640.0, 'MID_X', 'MID_Y')
df = df.join(df_2640)
df_1320 = spacing_calc(df, 1320.0, 'MID_X', 'MID_Y')
df = df.join(df_1320)
df_1155 = spacing_calc(df, 1155.0, 'MID_X', 'MID_Y')
df = df.join(df_1155)
df_990 = spacing_calc(df, 990.0, 'MID_X', 'MID_Y')
df = df.join(df_990)
df_660 = spacing_calc(df, 660.0, 'MID_X', 'MID_Y')
df = df.join(df_660)
df_330 = spacing_calc(df, 330.0, 'MID_X', 'MID_Y')
df = df.join(df_330)
df.to_csv('spacing_calc_all.csv', index=None)
No errors, everything works, I just don't think it's as efficient as it could be.
Your problem is that you loop too many times. At the very least, you should calculate a distance matrix and the count the how many points fall within a radius from that matrix. However, the fastest solution is to use numpy's vectorized functions, which are highly optimized C code.
As with most learning experiences, it's better to start with a small problem:
>>> import numpy as np
>>> import pandas as pd
>>> from scipy.spatial import distance_matrix
# Create a dataframe with columns two MID_X and MID_Y assigned at random
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.uniform(1, 10, size=(5, 2)), columns=['MID_X', 'MID_Y'])
>>> df.index.name = 'PointID'
MID_X MID_Y
PointID
0 4.370861 9.556429
1 7.587945 6.387926
2 2.404168 2.403951
3 1.522753 8.795585
4 6.410035 7.372653
# Calculate the distance matrix
>>> cols = ['MID_X', 'MID_Y']
>>> d = distance_matrix(df[cols].values, df[cols].values)
array([[0. , 4.51542241, 7.41793942, 2.94798323, 2.98782637],
[4.51542241, 0. , 6.53786001, 6.52559479, 1.53530446],
[7.41793942, 6.53786001, 0. , 6.4521226 , 6.38239593],
[2.94798323, 6.52559479, 6.4521226 , 0. , 5.09021286],
[2.98782637, 1.53530446, 6.38239593, 5.09021286, 0. ]])
# The radii for which you want to measure. They need to be raised
# up 2 extra dimensions to prepare for array broadcasting later
>>> radii = np.array([3,6,9])[:, None, None]
array([[[3]],
[[6]],
[[9]]])
# Count how many points fall within a certain radius from another
# point using numpy's array broadcasting. `d < radii` will return
# an array of `True/False` and we can count the number of `True`
# by `sum` over the last axis.
#
# The distance between a point to itself is 0 and we don't want
# to count that hence the -1.
>>> count = (d < radii).sum(axis=-1) - 1
array([[2, 1, 0, 1, 2],
[3, 2, 0, 2, 3],
[4, 4, 4, 4, 4]])
# Putting everything together for export
>>> result = pd.DataFrame(count, index=radii.flatten()).stack().to_frame('Count')
>>> result.index.names = ['Radius', 'PointID']
Count
Radius PointID
3 0 2
1 1
2 0
3 1
4 2
6 0 3
1 2
2 0
3 2
4 3
9 0 4
1 4
2 4
3 4
4 4
The final result means that within a radius of 3, point #0 has 2 neighbours, point #1 has 1 neighbour, point #2 has 0 neighbour and so on. Reshape and format the frame to your liking.
You shouldn't have a problem scaling this to thousands of points.
I m trying to loop each row of df1 with every row of df2 and create a new col in df1 and store the min(all values) in it.
lat_sc= shopping_centers['lat']
long_sc= shopping_centers['lng']
for i, j in zip(lat_sc,long_sc):
for lat_real, long_real in zip(real_estate['lat'],real_estate['lng']):
euclid_dist.append( lat_real - i)
short_dist.append(min(euclid_dist))
euclid_dist = []
Result:
df1['shortest'] = min(df1['lat']- each lat of df2)
df1['nearest sc'] = that corresponding sc_id
Edit to include sc_id in df1
This could get computationally intensive as df2 gets big but you can find the difference the df1 distance and all the df2 distances like this (it's possible to do this more efficiently)
def find_euclid_dist(row):
dist_arr = np.sqrt((ref_lats - row["lat"])**2 + (ref_longs - row["lng"])**2)
return np.min(dist_arr)
ref_lats = df2["lat"].values
ref_longs = df2["lng"].values
df1["shortest"] = df1.apply(find_euclid_dist, axis=1)
How abut using cdist from scipy?
from scipy.spatial.distance import cdist
df1['shortest'] = cdist(df1[['lat','lng']], df2[['lat','lng']], metric='euclidean').min(1)
print(df1) returns:
lat lng addr_street shortest
0 -37.980523 -37.980523 37 Scarlet Drive 183.022436
1 -37.776161 -37.776161 999 Heidelberg Road 182.817951
2 -37.926238 -37.926238 47 New Street 182.968096
3 -37.800056 -37.800056 3/113 Normanby Road 182.841849
First dataframe df1 contains id and their corresponding two coordinates. For each coordinate pair in the first dataframe, i have to loop through the second dataframe to find the one with the least distance. I tried taking the individual coordinates and finding the distance between them but it does not work as expected. I believe it has to be taken as a pair when finding the distance between them. Not sure whether Python offers some methods to achieve this.
For eg: df1
Id Co1 Co2
334 30.371353 -95.384010
337 39.497448 -119.789623
df2
Id Co1 Co2
339 40.914585 -73.892456
441 34.760395 -77.999260
dfloc3 =[[38.991512-77.441536],
[40.89869-72.37637],
[40.936115-72.31452],
[30.371353-95.38401],
[39.84819-75.37162],
[36.929306-76.20035],
[40.682342-73.979645]]
dfloc4 = [[40.914585,-73.892456],
[41.741543,-71.406334],
[50.154522,-96.88806],
[39.743565,-121.795761],
[30.027597,-89.91014],
[36.51881,-82.560844],
[30.449587,-84.23629],
[42.920475,-85.8208]]
Given you can get your points into a list like so...
df1 = [[30.371353, -95.384010], [39.497448, -119.789623]]
df2 = [[40.914585, -73.892456], [34.760395, -77.999260]]
Import math then create a function to make finding the distance easier:
import math
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
Then simply transverse your your list saving the closest points:
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
Outputs:
Point: [30.371353, -95.38401] is closest to [34.760395, -77.99926]
Point: [39.497448, -119.789623] is closest to [34.760395, -77.99926]
The code below creates a new column in df1 showing the Id of the nearest point in df2. (I can't tell from the question if this is what you want.) I'm assuming the coordinates are in a Euclidean space, i.e., that the distance between points is given by the Pythagorean Theorem. If not, you could easily use some other calculation instead of dist_squared.
import pandas as pd
df1 = pd.DataFrame(dict(Id=[334, 337], Co1=[30.371353, 39.497448], Co2=[-95.384010, -119.789623]))
df2 = pd.DataFrame(dict(Id=[339, 441], Co1=[40.914585, 34.760395], Co2=[-73.892456, -77.999260]))
def nearest(row, df):
# calculate euclidian distance from given row to all rows of df
dist_squared = (row.Co1 - df.Co1) ** 2 + (row.Co2 - df.Co2) ** 2
# find the closest row of df
smallest_idx = dist_squared.argmin()
# return the Id for the closest row of df
return df.loc[smallest_idx, 'Id']
near = df1.apply(nearest, args=(df2,), axis=1)
df1['nearest'] = near