I have a cell grid of big dimensions. Each cell has an ID (p1), cell value (p3) and coordinates in actual measures (X, Y). This is how first 10 rows/cells look like
p1 p2 p3 X Y
0 0 0.0 0.0 0 0
1 1 0.0 0.0 100 0
2 2 0.0 12.0 200 0
3 3 0.0 0.0 300 0
4 4 0.0 70.0 400 0
5 5 0.0 40.0 500 0
6 6 0.0 20.0 600 0
7 7 0.0 0.0 700 0
8 8 0.0 0.0 800 0
9 9 0.0 0.0 900 0
Neighbouring cells of cell i in the p1 can be determined as (i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1).
For example: p1 of 5 has neighbours - 4,6,504,505,506. (these are the ID of rows in the upper table - p1).
What I am trying to is:
For the chosen value/row i in p1, I would like to know all neighbours in the chosen distance from i and sum all their p3 values.
I tried to apply this solution (link), but I don't know how to incorporate the distance parameter. The cell value can be taken with df.iloc, but the steps before this are a bit tricky for me.
Can you give me any advice?
EDIT:
Using the solution from Thomas and having df called CO:
p3
0 45
1 580
2 12000
3 12531
4 22456
I'd like to add another column and use the values from p3 columns
CO['new'] = format(sum_neighbors(data, CO['p3']))
But it doesn't work. If I add a number instead of a reference to row CO['p3'] it works like charm. But how can I use values from p3 column automatically in format function?
SOLVED:
It worked with:
CO['new'] = CO.apply(lambda row: sum_neighbors(data, row.p3), axis=1)
Solution:
import numpy as np
import pandas
# Generating toy data
N = 10
data = pandas.DataFrame({'p3': np.random.randn(N)})
print(data)
# Finding neighbours
get_candidates = lambda i: [i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1]
filter = lambda neighbors, N: [n for n in neighbors if 0<=n<N]
get_neighbors = lambda i, N: filter(get_candidates(i), N)
print("Neighbors of 5: {}".format(get_neighbors(5, len(data))))
# Summing p3 on neighbors
def sum_neighbors(data, i, col='p3'):
return data.iloc[get_neighbors(i, len(data))][col].sum()
print("p3 sum on neighbors of 5: {}".format(sum_neighbors(data, 5)))
Output:
p3
0 -1.106541
1 -0.760620
2 1.282252
3 0.204436
4 -1.147042
5 1.363007
6 -0.030772
7 -0.461756
8 -1.110459
9 -0.491368
Neighbors of 5: [4, 6]
p3 sum on neighbors of 5: -1.1778133703169344
Notes:
I assumed p1 was range(N) as seemed to be implied (so we don't need it at all).
I don't think that 505 is a neighbour of 5 given the list of neighbors of i defined by the OP.
Related
I have a list of places and I need to find the distance between each of those. Can anyone suggest a faster method? There are about 10k unique places, the method I'm using creates a 10k X 10k matrix and I'm running out of memory. I'm using 15GB RAM.
test_df
Latitude Longitude site
0 32.3 -94.1 1
1 35.2 -93.1 2
2 33.1 -83.4 3
3 33.2 -94.5 4
test_df = test_df[['site', 'Longitude', 'Latitude']]
test_df['coord'] = list(zip(test_df['Longitude'], test_df['Latitude']))
from haversine import haversine
for _,row in test_df.iterrows():
test_df[row.coord]=round(test_df['coord'].apply(lambda x:haversine(row.coord,x, unit='mi')),2)
df = test_df.rename(columns=dict(zip(test_df['coord'], test_df['Facility'])))
df.drop('coord', axis=1, inplace=True)
new_df = pd.melt(df, id_vars='Facility', value_vars=df.columns[1:])
new_df.rename(columns={'variable':'Place', 'value':'dist_in_mi'}, inplace=True)
new_df
site Place dist_in_mi
0 1 1 0.00
1 2 1 70.21
2 3 1 739.28
3 4 1 28.03
4 1 2 70.21
5 2 2 0.00
6 3 2 670.11
7 4 2 97.15
8 1 3 739.28
9 2 3 670.11
10 3 3 0.00
11 4 3 766.94
12 1 4 28.03
13 2 4 97.15
14 3 4 766.94
15 4 4 0.00
If you want to resolve your memory problem you need to use datatypes that use less memory.
In this case since the maximum distance between two points on the planet Earth is less than 20005Km you can use uint16 to store the value (if a 1Km resolution is enough for you)
Since i hadn't any data to work with i generated some data with the following code:
import random
import numpy as np
from haversine import haversine
def getNFacilities(n):
""" returns n random pairs of coordinates in the range [-90, +90]"""
for i in range(n):
yield random.random()*180 - 90, random.random()*180 - 90
facilities = list(getNFacilities(10000))
Then i resolved the memory problem in two different ways:
1- By storing the distance data in uint16 numbers
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time instead of two
out = np.zeros((len(start), len(end)), dtype = np.uint16) # might be better to use empty?
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
out[i, j] = int(haversine(coords1, coords2))
return out
After calculating the distance the memory used by the array was about 200MB:
In [133]: l = calculateDistance(facilities, facilities)
In [134]: sys.getsizeof(l)
Out[134]: 200000112
2- Or in alternative you can just use a generator:
def calculateDistance(start, end):
mirror = start is end # if the matrix is mirrored the values are calculated just one time
for i, coords1 in enumerate(start[mirror:], mirror):
for j, coords2 in enumerate(end[:mirror and i or None]):
yield [i, j, haversine(coords1, coords2)]
I have a df of coordinates representing points at various timescales. I want to calculate the average these points in relation to each other.
To achieve this, I'm aiming to calculate the space between each point and the rest of the points. I'm then hoping to average these points.
The following calculates the distance between each pair of points.
import pandas as pd
from scipy.spatial import distance
import itertools
df = pd.DataFrame({
'Time' : [1,1,1,2,2,2,3,3,3],
'id' : ['A','B','C','A','B','C','A','B','C'],
'X' : [1.0,3.0,2.0,2.0,4.0,3.0,3.0,5.0,4.0],
'Y' : [1.0,1.0,0.5,2.0,2.0,2.5,3.0,3.0,3.0],
})
ids = list(df['id'])
# get the points
points = df[["X", "Y"]].values
# calculate distance of each point from every other point.
# row i contains contains distances for point i.
# distances[i, j] contains distance of point i from point j.
distances = distance.cdist(points, points, "euclidean")
distances = distances.flatten()
# get the start and end points
cartesian = list(itertools.product(ids, ids))
data = dict(
start_region = [x[0] for x in cartesian],
end_region = [x[1] for x in cartesian],
distance = distances
)
df1 = pd.DataFrame(data)
All I really need to output is:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
So the average position of these points in relation to each other would be the green coordinates.
But if I average the dataset above it displays:
I understand how this occurs. It's not referencing the other points.
Here my take on it
import itertools
def relative_dist(gp):
combs = list(itertools.combinations(gp.index, 2))
df_gp = pd.concat([gp.loc[tup,:].diff() for tup in combs], keys=combs).dropna()
return df_gp
df_dist = (df.set_index('id').groupby('Time')[['X','Y']].apply(relative_dist)
.droplevel('id').rename_axis(['Time','start_point','end_point'])
.reset_index())
Out[341]:
Time start_point end_point X Y
0 1 A B 2.0 0.0
1 1 A C 1.0 -0.5
2 1 B C -1.0 -0.5
3 2 A B 2.0 0.0
4 2 A C 1.0 0.5
5 2 B C -1.0 0.5
6 3 A B 2.0 0.0
7 3 A C 1.0 0.0
8 3 B C -1.0 0.0
df_avg = df_dist.groupby(['start_point','end_point'], as_index=False)[['X','Y']].mean()
Out[347]:
start_point end_point X Y
0 A B 2.0 0.0
1 A C 1.0 0.0
2 B C -1.0 0.0
Here's a suggestion on how to visualise the relative positions of your points. I would want, for each timestamp, to plot an ellipse at position (X_, Y_) where:
X_ is the mean of your points X coordinates for that timestamp.
Y_ is the mean of your points X coordinates for that timestamp.
the width of the ellipse equals the variance of your points X coordinates for that timestamp.
the height of the ellipse equals the variance of your points Y coordinates for that timestamp.
In that way, in a glance and for each timestamp, you could read some very high level statistics about your coordinates distribution at that timestamp.
Here's some code to generate such a visualisation:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Ellipse
# sample data with 4 timestamps
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'id' : ['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'],
'X' : [1,2,1,2,1,2,1,2,4,4,3,4,10,8,5,6],
'Y' : [1,1,3,3,1,1,2,2,5,5,8,5,6,6,7,6],
})
# for each timestamp, compute means and variances within all samples for that timestamp
means = df.groupby("Time")[["X", "Y"]].mean()
variances = df.groupby("Time")[["X", "Y"]].var()
df_ = pd.concat([means, variances], axis=1)
df_.columns = ["X_", "Y_", "var_X", "var_Y"]
# plot
fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})
for row in df_.itertuples():
ellipse = Ellipse(xy=(row.X_, row.Y_), # position of the ellipse is (X,Y)
width=row.var_X, # width helps to get a grasp on X variance
height=row.var_Y, # height helps to get a grasp on Y variance
angle=0)
ax.add_artist(ellipse)
ellipse.set_clip_box(ax.bbox)
ellipse.set_alpha(.4)
plt.text(x=row.X_+0.2, y=row.Y_+0.2, s=f"t={row.Index}") # just add timestamp legend
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
plt.show()
Which would look like this:
What do you think? Another idea could be to do a GIF (in case the timestamps average collide too much).
I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5
This is a follow up to this question: determine the coordinates where two pandas time series cross, and how many times the time series cross
I have 2 series in my Pandas dataframe, and would like to know where they intersect.
A B
0 1 0.5
1 2 3.0
2 3 1.0
3 4 1.0
4 5 6.0
With this code, we can create a third column that will contain True everytime the two series intersect:
df['difference'] = df.A - df.B
df['cross'] = np.sign(df.difference.shift(1))!=np.sign(df.difference)
np.sum(df.cross)-1
Now, instead of a simple True or False, I would to know in which direction the intersection took place. For example: from 1 to 2, it intersected upwards, from 2 to 3 downwards, from 3 to 4 no intersections, from 4 to 5 upwards.
A B Cross_direction
0 1 0.5 None
1 2 3.0 Upwards
2 3 1.0 Downwards
3 4 1.0 None
4 5 6.0 Upwards
In pseudo-code, it should be like this:
cross_directions = [none, none, ... * series size]
for item in df['difference']:
if item > 0 and next_item < 0:
cross_directions.append("up")
elif item < 0 and next_item > 0:
cross_directions.append("down")
The problem is that next_item is unavailable with this syntax (we obtain that in the original syntax using .shift(1)) and that it takes a lot of code.
Should I look into implementing the code above using something that can group the loop by 2 items at a time? Or is there a simpler and more elegant solution like the one from the previous question?
You can use numpy.select.
Below code should work for you, the code is as follows:
df = pd.DataFrame({'A': [1, 2, 3, 4,5], 'B': [0.5, 3, 1, 1, 6]})
df['Diff'] = df.A - df.B
df['Cross'] = np.select([((df.Diff < 0) & (df.Diff.shift() > 0)), ((df.Diff > 0) & (df.Diff.shift() < 0))], ['Up', 'Down'], 'None')
#Output dataframe
A B Diff Cross
0 1 0.5 0.5 None
1 2 3.0 -1.0 Up
2 3 1.0 2.0 Down
3 4 1.0 3.0 None
4 5 6.0 -1.0 Up
My very lousy and redundant solution.
dataframe['difference'] = dataframe['A'] - dataframe['B']
dataframe['temporary_a'] = np.array(dataframe.difference) > 0
dataframe['temporary_b'] = np.array(dataframe.difference.shift(1)) < 0
cross_directions = []
for index,row in dataframe.iterrows():
if not row['temporary_a'] and not row['temporary_b']:
cross_directions.append("up")
elif row['temporary_a'] and row['temporary_b']:
cross_directions.append("down")
else:
cross_directions.append("not")
dataframe['cross_direction'] = cross_directions
I have a large set (thousands) of smooth lines (series of x,y pairs) with different sampling of x and y and different length for each line, i.e.
x_0 = {x_00, x_01, ..., } # length n_0
x_1 = {x_10, x_11, ..., } # length n_1
...
x_m = {x_m0, x_m1, ..., } # length n_m
y_0 = {y_00, y_01, ..., } # length n_0
y_1 = {y_10, y_11, ..., } # length n_1
...
y_m = {y_m0, y_m1, ..., } # length n_m
I want to find cumulative properties of each line interpolated to a regular set of x points, i.e. x = {x_0, x_1 ..., x_n-1}
Currently I'm for-looping over each line, creating an interpolant, resampling, and then taking the sum/median/whatever of that result. It works, but it's really slow. Is there any way to vectorize / matrisize this operation?
I was thinking, since linear interpolation can be a matrix operation, perhaps it's possible. At the same time, since each row can have a different length... it might be complicated. Edit: but zero padding the shorter arrays would be easy...
What I'm doing now looks something like,
import numpy as np
import scipy as sp
import scipy.interpolate
...
# `xx` and `yy` are lists of lists with the x and y points respectively
# `xref` are the reference x values at which I want interpolants
yref = np.zeros([len(xx), len(xref)])
for ii, (xi, yi) in enumerate(zip(xx, yy)):
yref[ii] = sp.interp(xref, xi, yi)
y_med = np.median(yref, axis=-1)
y_sum = np.sum(yref, axis=-1)
...
Hopefully, you can adjust the following for your purposes.
I included pandas because it has an interpolation feature to fill in missing values.
Setup
import pandas as pd
import numpy as np
x = np.arange(19)
x_0 = x[::2]
x_1 = x[::3]
np.random.seed([3,1415])
y_0 = x_0 + np.random.randn(len(x_0)) * 2
y_1 = x_1 + np.random.randn(len(x_1)) * 2
xy_0 = pd.DataFrame(y_0, index=x_0)
xy_1 = pd.DataFrame(y_1, index=x_1)
Note:
x is length 19
x_0 is length 10
x_1 is length 7
xy_0 looks like:
0
0 -4.259448
2 -0.536932
4 0.059001
6 1.481890
8 7.301427
10 9.946090
12 12.632472
14 14.697564
16 17.430729
18 19.541526
xy_0 can be aligned with x via reindex
xy_0.reindex(x)
0
0 -4.259448
1 NaN
2 -0.536932
3 NaN
4 0.059001
5 NaN
6 1.481890
7 NaN
8 7.301427
9 NaN
10 9.946090
11 NaN
12 12.632472
13 NaN
14 14.697564
15 NaN
16 17.430729
17 NaN
18 19.541526
we can then fill in missing with interpolate
xy_0.reindex(x).interpolate()
0
0 -4.259448
1 -2.398190
2 -0.536932
3 -0.238966
4 0.059001
5 0.770445
6 1.481890
7 4.391659
8 7.301427
9 8.623759
10 9.946090
11 11.289281
12 12.632472
13 13.665018
14 14.697564
15 16.064147
16 17.430729
17 18.486128
18 19.541526
What about xy_1
xy_1.reindex(x)
0
0 -1.216416
1 NaN
2 NaN
3 3.704781
4 NaN
5 NaN
6 5.294958
7 NaN
8 NaN
9 8.168262
10 NaN
11 NaN
12 10.176849
13 NaN
14 NaN
15 14.714924
16 NaN
17 NaN
18 19.493678
Interpolated
xy_0.reindex(x).interpolate()
0
0 -1.216416
1 0.423983
2 2.064382
3 3.704781
4 4.234840
5 4.764899
6 5.294958
7 6.252726
8 7.210494
9 8.168262
10 8.837791
11 9.507320
12 10.176849
13 11.689541
14 13.202233
15 14.714924
16 16.307842
17 17.900760
18 19.493678