Return df containing points within radius - python

Return df containing points within radius - python - python

There's a few questions on this but I'm getting stuck. I have a df that contains coordinates for various scatter points. I want to generate a radius around one of these points and return the points that are within this radius for each point in time. Using the df below, I want to return a df that contains all the points within the radius around A for each point in time.
import pandas as pd
df = pd.DataFrame({
'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],
'Label' : ['A','B','C','D','E','A','B','C','D','E'],
'X' : [8,4,3,8,7,7,3,3,4,6],
'Y' : [3,3,3,4,3,2,1,5,4,2],
})
x_data = (df.groupby(['Time'])['X'].apply(list))
y_data = (df.groupby(['Time'])['Y'].apply(list))
AX_data = (df.loc[df['Label'] == 'A']['X'])
AY_data = (df.loc[df['Label'] == 'A']['Y'])
def countPoints(df, center_x, center_y, x, y, radius):
'''
Count number of points within radius for label A
'''
# Determine square distance
square_dist = (center_x - x) ** 2 + (center_y - y) ** 2
# Return df of rows within radius
df = df[square_dist <= radius ** 2].copy()
return df
df = countPoints(df, AX_data, AY_data, x_data, y_data, radius = 1)
Intended Output:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2

Here my take on it using np.linalg.norm
def calc_dist(gp, a_label, r=1):
dist_df = gp[['X', 'Y']] - gp.loc[gp.Label.eq(a_label), ['X', 'Y']].values
dist_arr = np.linalg.norm(dist_df, axis=1)
return gp[dist_arr <= r]
df_A = df.groupby('Time').apply(calc_dist, a_label='A', r=1).reset_index(drop=True)
Out[2159]:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2
Method 2:
df1 = df.where(df.Label.eq('A')).groupby(df.Time).apply(lambda x: x.ffill().bfill())
m = np.linalg.norm(df[['X', 'Y']] - df1[['X', 'Y']], axis=1) <= 1
df_A = df[m]
Out[2262]:
Time Label X Y
0 09:00:00.1 A 8 3
3 09:00:00.1 D 8 4
4 09:00:00.1 E 7 3
5 09:00:00.2 A 7 2
9 09:00:00.2 E 6 2

Related

How to fit a column as a function of another column in Python dataframe

For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)

That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18

limiting data set to be used xlim

I have lots of files that contain x, y, yerr columns. I read them and save and apply a change on the x values, then I would like to set a limit on the x values I will use afterwards which are the newxval:
for key, value in files_data.items():
file_short_name = key
D_value_sale = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
data.columns = ["x", "y"]
D = D_value_sale
b = 111
c = 222
data["newx"] = -c*(((data.x*(1/(1+D)))-b)/b)
data["newy"] = (data.y-data.y.min())/(data.y.max()-data.y.min())
w = data[(data.newx < 20000) & (data.newx > 8000)]
dfx = w["newx"]
dfy = w["newy"]
peak = GaussianModel()
pars = offset.make_params(c=np.median(dfy))
pars += peak.guess(dfy, x= dfy, amplitude=-0.5)
result = model.fit(dfy, pars, dfx)

If I'm understanding correctly what you are asking this is what you could do:
for key, value in files_data.items():
file_short_name = key
# main = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
# Here you should define what happens in case
# the data isn't what you expected it to be
data["newx"] = data.x + 1 # Perform whatever transformation you need
# data["newy"] = data.y * (1.01234) # Etc.
# Then you can limit the newx column by doing:
data[(data.newx < upper_limit) & (data.newx > lower_limit)]
What you're doing won't work if you want to preserve the relationship between columns. When you assign the data columns to their own variables xval, yval and error you are implicitely "losing" their relationship.

I'll open with the same caveat of "if I'm understanding you correctly" then the crux of what you are looking for is the boolean array that you have created to apply your limits:
data = data[(data[0] >= xlim[0]) & (data[0] <= xlim[1])]
This boolean array can be saved and applied to any array of the same shape.
idx = (data[0] >= xlim[0]) & (data[0] <= xlim[1])
filtered_data = data[0][idx]
filtered_newxval = newxval[idx]
By way of a more complete and independent example, see the code below where this concept can be applied to multidimensional arrays and pandas dataframes
import numpy as np
import pandas as pd
np.random.seed(42)
x = np.random.randint(0, 20, 10)
y = np.random.randint(0, 20, 10)
print("x", x)
# >>> x [ 6 19 14 10 7 6 18 10 10 3]
print("y", y)
# >>> y [ 7 2 1 11 5 1 0 11 11 16]
xmin = 3
xmax = 17
idx = (x >= xmin) & (x <= xmax)
data = np.vstack((x, y))
print("filtered_data:\n", data[:, idx])
# >>> filtered_data:
# [[ 6 14 10 7 6 10 10 3]
# [ 7 1 11 5 1 11 11 16]]
df = pd.DataFrame({"x": x, "y": y})
df["xnew"] = df["x"] * 2
print(df[idx])
# >>> x y xnew
# >>> 0 6 7 12
# >>> 2 14 1 28
# >>> 3 10 11 20
# >>> 4 7 5 14
# >>> 5 6 1 12
# >>> 7 10 11 20
# >>> 8 10 11 20
# >>> 9 3 16 6

How to print the row and columns of the value you're looking for in dataframe

So I made this dataframe
alp = "abcdefghijklmnopqrstuvwxyz0123456789"
s = "carl"
for i in s:
alp = alp.replace(i,"")
jaa = s+alp
x = list(jaa)
array = np.array(x)
re = np.reshape(array,(6,6))
dt = pd.DataFrame(re)
dt.columns = [1,2,3,4,5,6]
dt.index = [1,2,3,4,5,6]
dt
1 2 3 4 5 6
1 c a r l b d
2 e f g h i j
3 k m n o p q
4 s t u v w x
5 y z 0 1 2 3
6 4 5 6 7 8 9
I want to search a value , and print its row(index) and column.
For example, 'h', the output i want is 2,4.
Is there any way to get that output?

row, col = np.where(dt == "h")
print(dt.index[row[0]], dt.columns[col[0]])

Iterating through multiple dataframes pandas

I have two dataframes:
1) Contains a list of suppliers and their Lat,Long coordinates
sup_essential = pd.DataFrame({'supplier': ['A','B','C'],
'coords': [(51.1235,-0.3453),(52.1245,-0.3423),(53.1235,-1.4553)]})
2) A list of stores and their lat, long coordinates
stores_essential = pd.DataFrame({'storekey': [1,2,3],
'coords': [(54.1235,-0.6553),(49.1245,-1.3423),(50.1235,-1.8553)]})
I want to create an output table that has: store, store_coordinates, supplier, supplier_coordinates, distance for every combination of store and supplier.
I currently have:
test=[]
for row in sup_essential.iterrows():
for row in stores_essential.iterrows():
r = sup_essential['supplier'],stores_essential['storeKey']
test.append(r)
But this just gives me repeats of all the values

Source DFs
In [105]: sup
Out[105]:
coords supplier
0 (51.1235, -0.3453) A
1 (52.1245, -0.3423) B
2 (53.1235, -1.4553) C
In [106]: stores
Out[106]:
coords storekey
0 (54.1235, -0.6553) 1
1 (49.1245, -1.3423) 2
2 (50.1235, -1.8553) 3
Solutions:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
m = pd.merge(sup.assign(x=0), stores.assign(x=0), on='x', suffixes=['1','2']).drop('x',1)
d1 = sup[['coords']].assign(lat=sup.coords.str[0], lon=sup.coords.str[1]).drop('coords',1)
d2 = stores[['coords']].assign(lat=stores.coords.str[0], lon=stores.coords.str[1]).drop('coords',1)
m['dist_km'] = np.ravel(dist.pairwise(np.radians(d1), np.radians(d2)) * 6367)
## -- End pasted text --
Result:
In [135]: m
Out[135]:
coords1 supplier coords2 storekey dist_km
0 (51.1235, -0.3453) A (54.1235, -0.6553) 1 334.029670
1 (51.1235, -0.3453) A (49.1245, -1.3423) 2 233.213416
2 (51.1235, -0.3453) A (50.1235, -1.8553) 3 153.880680
3 (52.1245, -0.3423) B (54.1235, -0.6553) 1 223.116901
4 (52.1245, -0.3423) B (49.1245, -1.3423) 2 340.738587
5 (52.1245, -0.3423) B (50.1235, -1.8553) 3 246.116984
6 (53.1235, -1.4553) C (54.1235, -0.6553) 1 122.997130
7 (53.1235, -1.4553) C (49.1245, -1.3423) 2 444.459052
8 (53.1235, -1.4553) C (50.1235, -1.8553) 3 334.514028

Binning values into groups with a minimum size using pandas

I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.

You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Return df containing points within radius - python - python

Related

How to fit a column as a function of another column in Python dataframe

limiting data set to be used xlim

How to print the row and columns of the value you're looking for in dataframe

Iterating through multiple dataframes pandas

Binning values into groups with a minimum size using pandas

Categories

Resources