Applying diff on selected rows for comparing angles from math.atan2 - python

I have a data frame like this that want to apply diff function on:
test = pd.DataFrame({ 'Observation' : ['0','1','2',
'3','4','5',
'6','7','8'],
'Value' : [30,60,170,-170,-130,-60,-30,10,20]
})
Observation Value
0 30
1 60
2 170
3 -170
4 -130
5 -60
6 -30
7 10
8 20
The column 'Value' is in degrees. So, the difference between -170 and 170 should be 20, not -340. In other words, when d2*d1 < 0, instead of d2-d1, I'd like to get 360-(abs(d1)+abs(d2))
Here's why I try. But then I don't know how to continue it without using a for loop:
test['Value_diff_1st_attempt'] = test['Value'].diff(1)
test['sign_temp'] = test['Value'].shift()
test['Sign'] = np.sign(test['Value']*test['sign_temp'])
Here's what the result should look like:
Observation Value Delta_Value
0 30 NAN
1 60 30
2 170 110
3 -170 20
4 -130 40
5 -60 70
6 -30 30
7 10 40
8 20 10
Eventually I'd like to get just the magnitude of differences all in positive values. Thanks.
Update: So, the value results are derived from math.atan2 function. The values are from 0<theta<180 or -180<theta<0. The problem arises when we are dealing with a change of direction from 170 (upper left corner) to -170 (lower left corner) for example, where the change is really just 20 degrees. However, when we go from -30 (Lower right corner) to 10 (upper right corner), the change is really 40 degrees. I hope I explained it well.

I believe this should work (took the definition from #JasonD's answer):
test["Value"].rolling(2).apply(lambda x: 180 - abs(abs(x[0] - x[1]) - 180))
Out[45]:
0 NaN
1 30.0
2 110.0
3 20.0
4 40.0
5 70.0
6 30.0
7 40.0
8 10.0
Name: Value, dtype: float64
How it works:
Based on your question, the two angles a and b are between 0 and +/-180. For 0 < d < 180 I will write d < 180 and for -180 < d < 0 I will write d < 0. There are four possibilities:
a < 180, b < 180 -> the result is simply |a - b|. And since |a - b| - 180 cannot be greater than 180, the formula will simplify to a - b if a > b and b - a if b > a.
a < 0, b < 0 - > The same logic applies here. Both negative and their absolute difference cannot be greater than 180. The result will be |a - b|.
a < 180, b < 0 - > a - b will be greater than 0 for sure. For the cases where |a - b| > 180, we should look at the other angle and this translates to 360 - |a - b|.
a < 0, b < 180 -> again, similar to the above. If the absolute difference is greater than 180, calculate 360 - absolute difference.
For the pandas part: rolling(n) creates arrays of size n. For 2: (row 0, row1), (row1, row2), ... With apply, you apply that formula to every rolling pair where x[0] is the first element (a) and x[1] is the second element.

Related

rolling mean with a moving window

My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75

Fill with the values from neighbor value compering other column in Pandas

I am having dataframe like this:
azimuth id
15 100
15 1
15 100
150 2
150 100
240 3
240 100
240 100
350 100
What I need is to fill instead 100 values values from row where azimuth is the closest:
Desired output:
azimuth id
15 1
15 1
15 1
150 2
150 2
240 3
240 3
240 3
350 1
350 is near to 15 because this is a circle (angle representation). The difference is 25.
What I have:
def mysubstitution(x):
for i in x.index[x['id'] == 100]:
i = int(i)
diff = (x['azimuth'] - x.loc[i, 'azimuth']).abs()
for ind in diff.index:
if diff[ind] > 180:
diff[ind] = 360 - diff[ind]
else:
pass
exclude = [y for y in x.index if y not in x.index[x['id'] == 100]]
closer_idx = diff[exclude]
closer_df = pd.DataFrame(closer_idx)
sorted_df = closer_df.sort_values('azimuth', ascending=True)
try:
a = sorted_df.index[0]
x.loc[i, 'id'] = x.loc[a, 'id']
except Exception as a:
print(a)
return x
Which works ok most of the time, but I guess there is some simpler solution.
Thanks in advance.
I tried to implement the functionality in two steps. First, for each azimuth, I grouped another dataframe that holds their id value(for values other than 100).
Then, using this array I implemented the replaceAzimuth function, which takes each row in the dataframe, first checks if the value already exists. If so, it directly replaces it. Otherwise,it replaces the id value with the closest azimuth value from the grouped dataframe.
Here is the implementation:
df = pd.DataFrame([[15,100],[15,1],[15,100],[150,2],[150,100],[240,3],[240,100],[240,100],[350,100]],columns=['azimuth','id'])
df_non100 = df[df['id'] != 100]
df_grouped = df_non100.groupby(['azimuth'])['id'].min().reset_index()
def replaceAzimuth(df_grouped,id_val):
real_id = df_grouped[df_grouped['azimuth'] == id_val['azimuth']]['id']
if real_id.size == 0:
df_diff = df_grouped
df_diff['azimuth'] = df_diff['azimuth'].apply(lambda x: min(abs(id_val['azimuth'] - x),(360 - id_val['azimuth'] + x)))
id_val['id'] = df_grouped.iloc[df_diff['azimuth'].idxmin()]['id']
else:
id_val['id'] = real_id
return id_val
df = df.apply(lambda x: replaceAzimuth(df_grouped,x), axis = 1)
df
For me, the code seems to give the output you have shown. But not sure if will work on all cases!
First set all ids to nan if they are 100.
df.id = np.where(df.id==100, np.nan, df.id)
Then calculate the angle diff pairwise and find the closest ID to fill the nans.
df.id = df.id.combine_first(
pd.DataFrame(np.abs(((df.azimuth.values[:,None]-df.azimuth.values) +180) % 360 - 180))
.pipe(np.argsort)
.applymap(lambda x: df.id.iloc[x])
.apply(lambda x: x.dropna().iloc[0], axis=1)
)
df
azimuth id
0 15 1.0
1 15 1.0
2 15 1.0
3 150 2.0
4 150 2.0
5 240 3.0
6 240 3.0
7 240 3.0
8 350 1.0

Calculating grid values given the distance in python

I have a cell grid of big dimensions. Each cell has an ID (p1), cell value (p3) and coordinates in actual measures (X, Y). This is how first 10 rows/cells look like
p1 p2 p3 X Y
0 0 0.0 0.0 0 0
1 1 0.0 0.0 100 0
2 2 0.0 12.0 200 0
3 3 0.0 0.0 300 0
4 4 0.0 70.0 400 0
5 5 0.0 40.0 500 0
6 6 0.0 20.0 600 0
7 7 0.0 0.0 700 0
8 8 0.0 0.0 800 0
9 9 0.0 0.0 900 0
Neighbouring cells of cell i in the p1 can be determined as (i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1).
For example: p1 of 5 has neighbours - 4,6,504,505,506. (these are the ID of rows in the upper table - p1).
What I am trying to is:
For the chosen value/row i in p1, I would like to know all neighbours in the chosen distance from i and sum all their p3 values.
I tried to apply this solution (link), but I don't know how to incorporate the distance parameter. The cell value can be taken with df.iloc, but the steps before this are a bit tricky for me.
Can you give me any advice?
EDIT:
Using the solution from Thomas and having df called CO:
p3
0 45
1 580
2 12000
3 12531
4 22456
I'd like to add another column and use the values from p3 columns
CO['new'] = format(sum_neighbors(data, CO['p3']))
But it doesn't work. If I add a number instead of a reference to row CO['p3'] it works like charm. But how can I use values from p3 column automatically in format function?
SOLVED:
It worked with:
CO['new'] = CO.apply(lambda row: sum_neighbors(data, row.p3), axis=1)
Solution:
import numpy as np
import pandas
# Generating toy data
N = 10
data = pandas.DataFrame({'p3': np.random.randn(N)})
print(data)
# Finding neighbours
get_candidates = lambda i: [i-500+1, i-500-1, i-1, i+1, i+500+1, i+500-1]
filter = lambda neighbors, N: [n for n in neighbors if 0<=n<N]
get_neighbors = lambda i, N: filter(get_candidates(i), N)
print("Neighbors of 5: {}".format(get_neighbors(5, len(data))))
# Summing p3 on neighbors
def sum_neighbors(data, i, col='p3'):
return data.iloc[get_neighbors(i, len(data))][col].sum()
print("p3 sum on neighbors of 5: {}".format(sum_neighbors(data, 5)))
Output:
p3
0 -1.106541
1 -0.760620
2 1.282252
3 0.204436
4 -1.147042
5 1.363007
6 -0.030772
7 -0.461756
8 -1.110459
9 -0.491368
Neighbors of 5: [4, 6]
p3 sum on neighbors of 5: -1.1778133703169344
Notes:
I assumed p1 was range(N) as seemed to be implied (so we don't need it at all).
I don't think that 505 is a neighbour of 5 given the list of neighbors of i defined by the OP.

New column based in multiple conditions

a b
0 100 90
1 30 117
2 90 99
3 200 94
I want to create a new df["c"] with next conditions:
If a > 50 and b is into (a ± 0.5a), then c = a
If a > 50 and b is out (a ± 0.5a), then c = b
If a <= 50, then *c = a*
Output should be:
a b c
0 100 90 100
1 30 117 30
2 90 99 90
3 200 94 94
I´ve tried:
df['c'] = np.where(df.eval("0.5 * a <= b <= 1.5 * a"), df.a, df.b)
But I don´t know how to include the last condition (If a <= 50, then c = a) in this sentence.
You're almost there, you'll just need to add an or clause inside your eval string.
np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b)
# ~~~~~~~~~~~~
array([100, 30, 90, 94])

Fraction of values in (x, y) space

I have a data frame that looks like this, but with several hundred thousand rows:
df
D x y
0 y 5.887672 6.284714
1 y 9.038657 10.972742
2 n 2.820448 6.954992
3 y 5.319575 15.475197
4 n 1.647302 7.941926
5 n 5.825357 13.747091
6 n 5.937630 6.435687
7 y 7.789661 11.868023
8 n 2.669362 11.300062
9 y 1.153347 17.625158
I want to know what proportion of values ("D") in each x:y grid space is "n".
I can do it by brute force, by stepping through x and y and calculating the percentage:
zonexy = {}
for x in np.arange(0,10,2.5):
dfx = df[(df['x'] >= x) & (df['x'] < x+2.5)]
zonexy[x] = {}
for y in np.arange(0,24,6):
dfy = dfx[(dfx['y'] >= y) & (dfx['y'] < y+6)]
try:
pctn = len(dfy[dfy['Descr']=='n'])/len(dfy) * 100.0
except ZeroDivisionError:
pctn = 0
zonexy[x][y] = pctn
Output:
pd.DataFrame(zonexy)
0.0 2.5 5.0 7.5
0 0 0 0 0
6 100 100 50 0
12 0 0 50 0
18 0 0 0 0
But this, and all the variations on this theme that I've tried, is very slow. It seems like there should be a much more efficient way (probably via numpy), but I'm blanking on it.
One way would be to use the 2D histogram function of numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html
Then,
Run it once on the data where the criteria is matched (here, "D" is "n")
Run it again on all of the data.
Divide the first result, element-by-element, with the second result.

Categories