joint probability with a condition

joint probability with a condition - python

I am working with wind speed (sknt) and visbility (vsby) data in hourly intervals from weather stations. I was able to calculate the joint probability for both wind speed and visibility using this,
df1=df.groupby('vsby').size().div(len(df))
df2=df.groupby(['vsby', 'sknt']).size().div(len(df)).div(vprob, axis=0, level='vsby')
vsby sknt 0
0 6.0 15.0 1.000000
1 10.0 0.0 1.000000
2 11.0 7.0 0.500000
3 11.0 16.0 0.500000
4 13.0 12.0 1.000000
5 14.0 3.0 0.500000
6 14.0 4.0 0.250000
7 14.0 12.0 0.250000
8 16.0 0.0 0.099796
9 16.0 2.0 0.209776
10 16.0 3.0 0.173116
11 16.0 4.0 0.134420
12 16.0 5.0 0.175153
13 16.0 6.0 0.024440
14 16.0 7.0 0.032587
15 16.0 8.0 0.018330
16 16.0 9.0 0.024440
17 16.0 10.0 0.024440
18 16.0 11.0 0.026477
19 16.0 12.0 0.016293
20 16.0 13.0 0.014257
21 16.0 14.0 0.008147
22 16.0 15.0 0.008147
23 16.0 16.0 0.004073
24 16.0 17.0 0.004073
25 16.0 18.0 0.002037
I am interested in finding the probability of wind speed >= x for all visibility recorded. For example, vsby 16, probability = (0.018330 + 0.024440 + 0.024440 + 0.026477 + 0.016293 + 0.014257 + 0.008147 + 0.008147 + 0.004073 + 0.004073 + 0.002037)
I tried,
df2.loc[df2.sknt >= 7, df2.vsby].sum()
but its not working.

Try the below. To select a column using .loc it is sufficient to just provide the name.
df2 = df2.reset_index()
df2.loc[df2['sknt'] >= 7, 'vsby'].sum()

Related

Pandas rolling but involves last rows value

I have this dataframe
hour = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
visitor = [4,6,2,4,3,7,5,7,8,3,2,8,3,6,4,5,1,8,9,4,2,3,4,1]
df = {"Hour":hour, "Total_Visitor":visitor}
df = pd.DataFrame(df)
print(df)
I applied 6 window rolling sum
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
The first 5 rows will give you NaN value,
The problem is I want to know the sum of total visitor from 9pm to 3am, so I have to sum total visitor from hour 21 and then back to hour 0 until 3
How do you do that automatically with rolling?

I think you need add last N values, then using rolling and filter by length of Series:
N = 6
df_roll = df.iloc[-N:].append(df).rolling(N).sum().iloc[-len(df):]
print (df_roll)
Hour Total_Visitor
0 105.0 18.0
1 87.0 20.0
2 69.0 20.0
3 51.0 21.0
4 33.0 20.0
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Check original solution:
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
Hour Total_Visitor
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Numpy alternative with strides is complicated, but faster if large one Series:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
N = 3
x = np.concatenate([fv[-N+1:], fv.to_numpy()])
cv = pd.Series(rolling_window(x, N).sum(axis=1), index=fv.index)
print (cv)
0 5
1 4
2 4
3 6
4 5
dtype: int64

Though you have mentioned a series, see if this is helpful-
import pandas as pd
def cyclic_roll(s, n):
s = s.append(s[:n-1])
result = s.rolling(n).sum()
return result[-n+1:].append(result[n-1:-n+1])
fv = pd.DataFrame([1, 2, 3, 4, 5])
cv = fv.apply(cyclic_roll, n=3)
cv.reset_index(inplace=True, drop=True)
print cv
Output
0
0 10.0
1 8.0
2 6.0
3 9.0
4 12.0

How to use a for loop length as an optimization parameter in Pytorch

I am trying to register my loop's length as an optimization parameter so that the optimizer will automatically adjust the length of the loop.
Here is an example code (the real problem is more complex but you will get the idea):
import torch
positive_iter = torch.tensor([10.0], requires_grad=True)
negative_iter = torch.tensor([20.0], requires_grad=True)
optimizer = torch.optim.Adam([positive_iter, negative_iter], lr=0.02, betas=(0.5, 0.999))
for i in range(100):
loss = torch.tensor([0.0], requires_grad=True)
for g in range(int(positive_iter)):
loss = loss + torch.rand(1)
for d in range(int(negative_iter)):
loss = loss - torch.rand(1) * 2
loss = torch.abs(loss)
loss.backward()
optimizer.step()
print(i, loss.item(), positive_iter.item(), negative_iter.item())
The optimization does not seem to work, here is the output:
0 19.467784881591797 10.0 20.0
1 14.334418296813965 10.0 20.0
2 13.515042304992676 10.0 20.0
3 13.477707862854004 10.0 20.0
4 15.240434646606445 10.0 20.0
5 18.45014190673828 10.0 20.0
6 18.557266235351562 10.0 20.0
7 16.325769424438477 10.0 20.0
8 13.95105266571045 10.0 20.0
9 12.435094833374023 10.0 20.0
10 13.70322322845459 10.0 20.0
11 10.128765106201172 10.0 20.0
12 16.986034393310547 10.0 20.0
13 15.652003288269043 10.0 20.0
14 10.300052642822266 10.0 20.0
15 18.038368225097656 10.0 20.0
16 11.830389022827148 10.0 20.0
17 14.917057037353516 10.0 20.0
18 18.603071212768555 10.0 20.0
19 17.595298767089844 10.0 20.0
20 17.17181968688965 10.0 20.0
21 14.548274993896484 10.0 20.0
22 18.839675903320312 10.0 20.0
23 13.375761032104492 10.0 20.0
24 14.045333862304688 10.0 20.0
25 13.088285446166992 10.0 20.0
26 15.019135475158691 10.0 20.0
27 16.992284774780273 10.0 20.0
28 13.883159637451172 10.0 20.0
29 12.695013999938965 10.0 20.0
30 17.23816680908203 10.0 20.0
...continued
Could you advise on how to make the for loop length an optimization parameter.
Thank you

Cumulative percentage of pandas data frame

I have a data frame like below with a specific ID (code) and areas and length by a specific distance (Dist_km)
code Dist_km Shape_Leng Shape_Area
0 M0017 5.0 57516.601608 5.076465e+07
1 M0017 10.0 94037.663673 4.638184e+07
2 M0017 15.0 39106.310470 1.426327e+07
3 M0017 20.0 138.038115 6.464380e+02
4 M0017 30.0 12158.395200 4.102351e+06
5 M0073 5.0 51922.847698 3.375080e+07
6 M0073 10.0 75543.660382 5.966612e+07
7 M0073 15.0 55277.027428 3.423961e+07
8 M0073 20.0 26945.782055 2.584022e+07
9 M0073 25.0 4052.670711 6.904536e+05
10 M0333 5.0 30090.687597 5.468791e+07
11 M0333 10.0 55946.815385 5.768929e+07
12 M0333 15.0 65026.329732 4.008600e+07
13 M0333 20.0 59014.487216 2.994337e+07
14 M0333 25.0 17423.635441 6.358991e+06
Using:
mrb['cum_area_sqm'] = mrb.groupby(['code'])['Shape_Area'].apply(lambda x: x.cumsum())
mrb['cum_area_ha'] = mrb['cum_area_sqm']/10000
mrb_cumsum = mrb.groupby(['code','Dist_km']).agg({'cum_area_ha': 'sum'})
I have managed to convert the data frame to the below
cum_area_ha
code Dist_km
M0017 5.0 5076.464548
10.0 9714.648238
15.0 11140.974881
20.0 11141.039525
30.0 11551.274623
M0073 5.0 3375.080465
10.0 9341.692680
15.0 12765.654064
20.0 15349.676332
25.0 15418.721691
M0333 5.0 5468.790981
10.0 11237.720454
15.0 15246.320869
20.0 18240.658255
25.0 18876.557351
However, I would like to now get a cumulative percentages of these areas for each code by Dist_km up to a 100 percent.
So, for example for M0017, I would like to have something like the below.
cum_area_ha cum_area_pc
code Dist_km
M0017 5.0 5076.464548 43.49
10.0 9714.648238 84.10
15.0 11140.974881 96.45
20.0 11141.039525 96.45
30.0 11551.274623 100.00

You can divide each element by the last cum_area_ha in the same code group.
mrb_cumsum.div(mrb_cumsum.groupby(level=0).last())
Out[97]:
cum_area_ha
code Dist_km
M0017 5.0 0.439472
10.0 0.841002
15.0 0.964480
20.0 0.964486
30.0 1.000000
M0073 5.0 0.218895
10.0 0.605867
15.0 0.827932
20.0 0.995522
25.0 1.000000
M0333 5.0 0.289713
10.0 0.595327
15.0 0.807685
20.0 0.966313
25.0 1.000000

Getting most recent observation & date from several columns

Take the following toy DataFrame:
data = np.arange(35, dtype=np.float32).reshape(7, 5)
data = pd.concat((
pd.DataFrame(list('abcdefg'), columns=['field1']),
pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])),
axis=1)
data.iloc[1:4, 4:] = np.nan
data.iloc[4, 3:] = np.nan
print(data)
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 4.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
I'd like to replace the "year" columns (2014-2017) with two fields: the most recent non-null observation, and the corresponding year of that observation. Assume field1 is a unique key. (I'm not looking to do any groupby ops, just 1 row per record.) I.e.:
field1 field2 obs date
0 a 0.0 4.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
I've gotten this far:
pd.melt(data, id_vars=['field1', 'field2'],
value_vars=['2014', '2015', '2016', '2017'])\
.dropna(subset=['value'])
field1 field2 variable value
0 a 0.0 2014 1.0
1 b 5.0 2014 6.0
2 c 10.0 2014 11.0
3 d 15.0 2014 16.0
4 e 20.0 2014 21.0
5 f 25.0 2014 26.0
6 g 30.0 2014 31.0
# ...
But am struggling with how to pivot back to desired format.

Maybe:
d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"])
d2["date"] = d2["date"].astype(int)
df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()]
which gives me
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

what about the following apporach:
In [160]: df
Out[160]:
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 -10.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \
...: .last() \
...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1))
Out[180]:
field1 field2 obs date
0 a 0.0 -10.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017

last_valid_index + agg('last')
A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1)
B=data.groupby(['value'] * data.shape[1], 1).agg('last')
data['date']=A
data['obs']=B
data
Out[1326]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
By using assign we can push them into one line as blow
data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last'))
Out[1340]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0

Also another possibility by using sort_values and drop_duplicates:
data.melt(id_vars=["field1", "field2"], var_name="date",
value_name="obs")\
.dropna(subset=['obs'])\
.sort_values(['field1', 'date'], ascending=[True, False])\
.drop_duplicates('field1', keep='first')
which gives you
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

Efficiently updating NaN's in a pandas dataframe from a prior row & specific columns value

I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)

fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

joint probability with a condition - python

Try the below. To select a column using .loc it is sufficient to just provide the name. df2 = df2.reset_index() df2.loc[df2['sknt'] >= 7, 'vsby'].sum()

Related

Pandas rolling but involves last rows value

How to use a for loop length as an optimization parameter in Pytorch

Cumulative percentage of pandas data frame

Getting most recent observation & date from several columns

Efficiently updating NaN's in a pandas dataframe from a prior row & specific columns value

Categories

Resources