I have several large data sets (~3000 rows, 100 columns) that I need to process with pandas. Each row represents a point on a map and there's a bunch of data associated with that point. I am doing spatial calculations (may introduce a few more variables in the future), so for each row I am only using the data from 1-4 columns. The issue is that I have to compare each row to every other row - essentially, I am trying to figure out spatial relationships between every point. At this stage in the project, I am doing calculations to determine how many points are inside a given radius for each point in the table. I have to do this 5 or 6 times (i.e. running the distance calculation function for multiple radius sizes.) This means that I end up with ~10-50 million calculations. It is slow. Very slow (like 9+ hours of computing time.)
After I run all these calculations, I need to append them as new columns in the original (very large) dataframe. Currently, I have been passing the entire dataframe to my function, which might further slow things down.
I know that many people are running this size of calculation on super computers or dedicated multicore units, but I would like to do what I can to optimize my code to run as efficiently as possible, regardless of the hardware.
I am currently using a double for loop with .iterrows(). I have stripped away as much of the unnecessary steps as possible. I may be able to pair down the dataframe into a subset, and then pass that to the function, and append the calculations to the original in another step, if that would help speed things up. I have also considered using .apply() to eliminate the outside loop (e.g. .apply() the inner loop to all rows in the dataframe...?)
Below, I have showed the functions that I am using. This is probably the simplest application that I have for this project... there are others that do more calculations/return pairs or groups of points based on certain spacial criteria, but this is the best example to show the basic idea of what I am doing.
# specify file to be read into pandas
df = pd.read_csv('input_file.csv', low_memory = False)
# function to return distance between two points w/ (x,y) coordinates
def xy_distance_calc(x1, x2, y1, y2):
return math.sqrt((x1 - x2)**2 + (y1-y2)**2)
# function to calculate number of points inside a given radius for each point
def spacing_calc(data, rad_crit, col_x, col_y):
count_list = list()
df_list = pd.DataFrame()
for index, row in data.iterrows():
x_row_current = row[col_x]
y_row_current = row[col_y]
count = 0
# dist_list = list()
for index1, row1 in data.iterrows():
x1 = row1[col_x]
y1 = row1[col_y]
dist = xy_distance_calc(x_row_current, x1, y_row_current, y1)
if dist < rad_crit:
count += 1
else:
continue
count_list.append(count)
df_list = pd.DataFrame(data=count_list, columns = [str(rad_crit) + ' radius'])
return df_list
# call the function for each radius in question, append new data
df_2640 = spacing_calc(df, 2640.0, 'MID_X', 'MID_Y')
df = df.join(df_2640)
df_1320 = spacing_calc(df, 1320.0, 'MID_X', 'MID_Y')
df = df.join(df_1320)
df_1155 = spacing_calc(df, 1155.0, 'MID_X', 'MID_Y')
df = df.join(df_1155)
df_990 = spacing_calc(df, 990.0, 'MID_X', 'MID_Y')
df = df.join(df_990)
df_660 = spacing_calc(df, 660.0, 'MID_X', 'MID_Y')
df = df.join(df_660)
df_330 = spacing_calc(df, 330.0, 'MID_X', 'MID_Y')
df = df.join(df_330)
df.to_csv('spacing_calc_all.csv', index=None)
No errors, everything works, I just don't think it's as efficient as it could be.
Your problem is that you loop too many times. At the very least, you should calculate a distance matrix and the count the how many points fall within a radius from that matrix. However, the fastest solution is to use numpy's vectorized functions, which are highly optimized C code.
As with most learning experiences, it's better to start with a small problem:
>>> import numpy as np
>>> import pandas as pd
>>> from scipy.spatial import distance_matrix
# Create a dataframe with columns two MID_X and MID_Y assigned at random
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.uniform(1, 10, size=(5, 2)), columns=['MID_X', 'MID_Y'])
>>> df.index.name = 'PointID'
MID_X MID_Y
PointID
0 4.370861 9.556429
1 7.587945 6.387926
2 2.404168 2.403951
3 1.522753 8.795585
4 6.410035 7.372653
# Calculate the distance matrix
>>> cols = ['MID_X', 'MID_Y']
>>> d = distance_matrix(df[cols].values, df[cols].values)
array([[0. , 4.51542241, 7.41793942, 2.94798323, 2.98782637],
[4.51542241, 0. , 6.53786001, 6.52559479, 1.53530446],
[7.41793942, 6.53786001, 0. , 6.4521226 , 6.38239593],
[2.94798323, 6.52559479, 6.4521226 , 0. , 5.09021286],
[2.98782637, 1.53530446, 6.38239593, 5.09021286, 0. ]])
# The radii for which you want to measure. They need to be raised
# up 2 extra dimensions to prepare for array broadcasting later
>>> radii = np.array([3,6,9])[:, None, None]
array([[[3]],
[[6]],
[[9]]])
# Count how many points fall within a certain radius from another
# point using numpy's array broadcasting. `d < radii` will return
# an array of `True/False` and we can count the number of `True`
# by `sum` over the last axis.
#
# The distance between a point to itself is 0 and we don't want
# to count that hence the -1.
>>> count = (d < radii).sum(axis=-1) - 1
array([[2, 1, 0, 1, 2],
[3, 2, 0, 2, 3],
[4, 4, 4, 4, 4]])
# Putting everything together for export
>>> result = pd.DataFrame(count, index=radii.flatten()).stack().to_frame('Count')
>>> result.index.names = ['Radius', 'PointID']
Count
Radius PointID
3 0 2
1 1
2 0
3 1
4 2
6 0 3
1 2
2 0
3 2
4 3
9 0 4
1 4
2 4
3 4
4 4
The final result means that within a radius of 3, point #0 has 2 neighbours, point #1 has 1 neighbour, point #2 has 0 neighbour and so on. Reshape and format the frame to your liking.
You shouldn't have a problem scaling this to thousands of points.
Related
I have a dataframe that looks like this:
index Rod_1 label
0 [[1.94559799] [1.94498416] [1.94618273] ... [1.8941952 ] [1.89461277] [1.89435902]] F0
1 [[1.94129488] [1.94268905] [1.94327065] ... [1.93593512] [1.93689935] [1.93802091]] F0
2 [[1.94034818] [1.93996006] [1.93940095] ... [1.92700882] [1.92514855] [1.92449449]] F0
3 [[1.95784532] [1.96333782] [1.96036528] ... [1.94958261] [1.95199495] [1.95308231]] F2
Each cell in the Rod_1 column has an array of 12 million values. I'm trying the calculate the difference between every two values in this array to remove seasonality. That way my model will perform better, potentially.
This is the code that I've written:
interval = 1
for j in range(0, len(df_all['Rod_1'])):
for i in range(1, len(df_all['Rod_1'][0])):
df_all['Rod_1'][j][i - interval] = df_all['Rod_1'][j][i] - df_all['Rod_1'][j][i - interval]
I have 45 rows, and as I said each cell has 12 million values, so it takes 20 min to for my laptop to calculate this. Is there a faster way to do this?
Thanks in advance.
This should be much faster, I've tested up till 1M elements per cell for 10 rows which took 1.5 seconds to calculate the diffs (but a lot longer to make the test table)
import pandas as pd
import numpy as np
import time
#Create test data
np.random.seed(1)
num_rows = 10
rod1_array_lens = 5 #I tried with this at 1000000
possible_labels = ['F0','F1']
df = pd.DataFrame({
'Rod_1':[[[np.random.randint(10)] for _ in range(rod1_array_lens)] for _ in range(num_rows)],
'label':np.random.choice(possible_labels, num_rows)
})
#flatten Rod_1 from [[1],[2],[3]] --> [1,2,3]
#then use np.roll to make the diffs, throwing away the last element since it rolls over
start = time.time() #starting timing now
df['flat_Rod_1'] = df['Rod_1'].apply(lambda v: np.array([z for x in v for z in x]))
df['diffs'] = df['flat_Rod_1'].apply(lambda v: (np.roll(v,-1)-v)[:-1])
print('Took',time.time()-start,'to calculate diff')
I have a fairly large dataframe on which I want to
row-wise search for an interval containing a value
perform a linear interpolation between the two elements found at point 1 and two elements from another array
add a column to the dataframe with the interpolated values
What I have done involves a for loop, i.e.:
Given a sample of the dataframe Fak
beta0 beta1 beta2 beta3 beta4 beta5 beta6 beta7 beta8 beta9 beta10
0 0.008665 0.061391 0.159690 0.223275 0.232535 0.251266 0.279847 0.465671 0.672253 0.914753 1.0
1 0.009121 0.064322 0.166623 0.232418 0.241945 0.261106 0.290169 0.477621 0.682283 0.916384 1.0
2 0.009491 0.066689 0.172210 0.239776 0.249516 0.269020 0.298463 0.487108 0.690031 0.917638 1.0
3 0.009733 0.068232 0.175837 0.244542 0.254418 0.274140 0.303820 0.493102 0.694703 0.918304 1.0
4 0.009860 0.069027 0.177687 0.246963 0.256906 0.276734 0.306523 0.495985 0.696696 0.918511 1.0
I have an array psi
[-12.97, -11.97, -10.97, -9.97, -8.97, -7.97, -6.97, -5.97, -4.97, -3.97, -2.97, -1.97]
I define the value I want to search in Fak, i.e. intF = 0.16
I calculate the new dataframe with the following loop
dxlist = []
for i,Faki in Fak.iterrows():
# interpolation boundaries ID
if intF == 0.0:
ip1 = 1
elif intF == 1.0:
ip1 = -1
else:
ip1 = np.where(Faki>int(intF)/100)[0][0]
im1 = ip1-1
# coefficients
dfak = Faki[ip1] - Faki[im1]
dpsi = psi[ip1] - psi[im1]
m = dfak/dpsi
q = Faki[im1]-m*psi[im1]
# calculate
intPsi = (int(intF)/100-q)/m
intDi = 2**intPsi
dxlist.append(intDi)
dfout['newcolumn'] = dxlist
which works, but it is quite slow.
What I am missing is how to calculate the linear interpolation row-wise and use the indices on an outside array.
Apparently I found a vectorized solution:
psidf = Fak.copy()
psidf.loc[Fak.index] = psi
Fakp1 = Fak[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
Fakm1 = Fak[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
psip1 = psidf[Fak.ge(intF/100)].fillna(method='bfill',axis=1).iloc[:,0]
psim1 = psidf[Fak.le(intF/100)].fillna(method='ffill',axis=1).iloc[:,-1]
m = (Fakp1-Fakm1)/(psip1-psim1)
q = Fakm1-m*psim1
intDi_series = 2**((intF/100-q)/m)
intDi['d'+str(int(intF))+nsfx] = intDi_series
The key is to generate a database with the array as rows, having the same shape as Fak (which is done in the first two lines of the above code).
Then, I isolate the columns I need from each dataframe using the ge and le methods for pandas dataframe, and I use the indices in the newly generated dataframe
I’m trying to implement a runs-based kernel using Numba CUDA where I need to traverse the elements of a 3D matrix on a row per thread basis i.e. each thread is assigned a row that iterates over all elements of that row.
For example, if, for simplicity, I were to use a 2D matrix with 50 rows and 100 columns, I would need to create 50 threads that would go through the 100 elements of their respective row.
Can someone tell me how to do this?
Turns out it’s actually quite simple. You only need to launch as many threads as rows and have the kernel “point” its direction.
Here’s a simple kernel that demonstrates how to do such an iteration over a 3D matrix (binary_image). The kernel itself is part of the CCL algorithm I’m implementing but that can safely be ignored:
from numba import cuda
#cuda.jit
def kernel_1(binary_image, image_width, s_matrix, labels_matrix):
# notice how we're only getting the row and depth of each thread
row, image_slice = cuda.grid(2)
sm_pos, lm_pos = 0, 0
span_found = False
if row < binary_image.shape[0] and image_slice < binary_image.shape[2]: # guard for rows and slices
# and here's the traversing over the columns
for column in range(binary_image.shape[1]):
if binary_image[row, column, image_slice] == 0:
if not span_found: # Connected Component found
span_found = True
s_matrix[row, sm_pos, image_slice] = column
sm_pos = sm_pos + 1
# converting 2D coordinate to 1D
linearized_index = row * image_width + column
labels_matrix[row, lm_pos, image_slice] = linearized_index
lm_pos = lm_pos + 1
else:
s_matrix[row, sm_pos, image_slice] = column
elif binary_image[row, column, image_slice] == 255 and span_found:
span_found = False
s_matrix[row, sm_pos, image_slice] = column - 1
sm_pos = sm_pos + 1
learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.
I'll try to explain what I'm currently working with:
I have two dataframes: one for Gas Station A (165 stations), and other for Gas Station B (257 stations). They both share the same format:
id Coor
1 (a1,b1)
2 (a2,b2)
Coor has tuples with the location coordinates. What I want to do is to add 3 columns to Dataframe A with nearest Competitor #1, #2 and #3 (from Gas Station B).
Currently I managed to get every distance from A to B (42405 distance measures), but in a list format:
distances=[]
for (u,v) in gasA['coor']:
for (w,x) in gasB['coor']:
distances.append(sp.distance.euclidean((u,v),(w,x)))
This lets me have the values I need, but I still need to match them with the ID from Gas Station A, and get the top 3. I have the suspicion working with lists is not the best approach here. Do you have any suggestions?
Edit: as suggested, first 5 rows are:
in GasA:
id coor
60712 (-333525363206695,-705191013427772)
60512 (-333539879388388, -705394161580837)
60085 (-333545609177068, -703168832659184)
60110 (-333601677229216, -705167284798638)
60078 (-333608898397271, -707213099595404)
in GasB:
id coor
70174 (-333427160000000,-705459060000000)
70223 (-333523030000000, -706705470000000)
70383 (-333549270000000, -705320990000000)
70162 (-333556960000000, -705384750000000)
70289 (-333565850000000, -705104360000000)
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
Creating the data:
A = pd.DataFrame({'id':['60712','60512','60085', '60110','60078'], 'coor':[ (-333525363206695,-705191013427772),\
(-333539879388388, -705394161580837),\
(-333545609177068, -703168832659184),\
(-333601677229216, -705167284798638),\
(-333608898397271, -707213099595404)]})
B = pd.DataFrame({'id':['70174','70223','70383', '70162','70289'], 'coor':[ (-333427160000000,-705459060000000),\
(-333523030000000, -706705470000000),\
(-333549270000000, -705320990000000),\
(-333556960000000, -705384750000000),\
(-333565850000000, -705104360000000)]})
Calculating the distances:
res = euclidean_distances(list(A.coor), list(B.coor))
Selecting top 3 closest stations from B and appending to a column in A:
d = []
for i, id_ in enumerate(A.index):
distances = np.argsort(res[i])[0:3] #select top 3
distances = B.iloc[distances]['id'].values
d.append(distances)
A = A.assign(dist=d)
edit
result of running with example:
coor id dist
0 (-333525363206695, -705191013427772) 60712 [70223, 70174, 70162]
1 (-333539879388388, -705394161580837) 60512 [70223, 70289, 70174]
2 (-333545609177068, -703168832659184) 60085 [70223, 70174, 70162]
3 (-333601677229216, -705167284798638) 60110 [70223, 70174, 70162]
4 (-333608898397271, -707213099595404) 60078 [70289, 70383, 70162]
Define a function that calculates the distances from A to all B's and returns indices of B with the three smallest distances.
def get_nearest_three(row):
(u,v) = row['Coor']
dist_list = gasB.Coor.apply(sp.distance.euclidean,args = [u,v])
# want indices of the 3 indices of B with smallest distances
return list(np.argsort(dist_list))[0:3]
gasA['dists'] = gasA.apply(get_nearest_three, axis = 1)
You can do something like this.
a = gasA.coor.values
b = gasB.coor.values
c = np.sum(np.sum((a[:,None,::-1] - b)**2, axis=1), axis=0)
we can get the numpy arrays for the coordinates for both and then broadcast a to represent all it's combinations and then take the euclidean distance.
Consider a cross join (matching every row by every row between both datasets) which should be manageable with your small sets, 165 X 257, then calculate the distance. Then, rank by distance and filter for top 3.
cj_df = pd.merge(gasA.assign(key=1), gasB.assign(key=1),
on="key", suffixes=['_A', '_B'])
cj_df['distance'] = cj_df.apply(lambda row: sp.distance.euclidean(row['Coor_A'],
row['Coor_B']),
axis = 1)
# RANK BY DISTANCE
cj_df['rank'] = .groupby('id_A')['distance'].rank()
# FILTER FOR TOP 3
top3_df = cj_df[cj_df['rank'] <= 3].sort_values(['id_A', 'rank'])