Cumulative percentage of pandas data frame - python

I have a data frame like below with a specific ID (code) and areas and length by a specific distance (Dist_km)
code Dist_km Shape_Leng Shape_Area
0 M0017 5.0 57516.601608 5.076465e+07
1 M0017 10.0 94037.663673 4.638184e+07
2 M0017 15.0 39106.310470 1.426327e+07
3 M0017 20.0 138.038115 6.464380e+02
4 M0017 30.0 12158.395200 4.102351e+06
5 M0073 5.0 51922.847698 3.375080e+07
6 M0073 10.0 75543.660382 5.966612e+07
7 M0073 15.0 55277.027428 3.423961e+07
8 M0073 20.0 26945.782055 2.584022e+07
9 M0073 25.0 4052.670711 6.904536e+05
10 M0333 5.0 30090.687597 5.468791e+07
11 M0333 10.0 55946.815385 5.768929e+07
12 M0333 15.0 65026.329732 4.008600e+07
13 M0333 20.0 59014.487216 2.994337e+07
14 M0333 25.0 17423.635441 6.358991e+06
Using:
mrb['cum_area_sqm'] = mrb.groupby(['code'])['Shape_Area'].apply(lambda x: x.cumsum())
mrb['cum_area_ha'] = mrb['cum_area_sqm']/10000
mrb_cumsum = mrb.groupby(['code','Dist_km']).agg({'cum_area_ha': 'sum'})
I have managed to convert the data frame to the below
cum_area_ha
code Dist_km
M0017 5.0 5076.464548
10.0 9714.648238
15.0 11140.974881
20.0 11141.039525
30.0 11551.274623
M0073 5.0 3375.080465
10.0 9341.692680
15.0 12765.654064
20.0 15349.676332
25.0 15418.721691
M0333 5.0 5468.790981
10.0 11237.720454
15.0 15246.320869
20.0 18240.658255
25.0 18876.557351
However, I would like to now get a cumulative percentages of these areas for each code by Dist_km up to a 100 percent.
So, for example for M0017, I would like to have something like the below.
cum_area_ha cum_area_pc
code Dist_km
M0017 5.0 5076.464548 43.49
10.0 9714.648238 84.10
15.0 11140.974881 96.45
20.0 11141.039525 96.45
30.0 11551.274623 100.00

You can divide each element by the last cum_area_ha in the same code group.
mrb_cumsum.div(mrb_cumsum.groupby(level=0).last())
Out[97]:
cum_area_ha
code Dist_km
M0017 5.0 0.439472
10.0 0.841002
15.0 0.964480
20.0 0.964486
30.0 1.000000
M0073 5.0 0.218895
10.0 0.605867
15.0 0.827932
20.0 0.995522
25.0 1.000000
M0333 5.0 0.289713
10.0 0.595327
15.0 0.807685
20.0 0.966313
25.0 1.000000

Related

How to use a for loop length as an optimization parameter in Pytorch

I am trying to register my loop's length as an optimization parameter so that the optimizer will automatically adjust the length of the loop.
Here is an example code (the real problem is more complex but you will get the idea):
import torch
positive_iter = torch.tensor([10.0], requires_grad=True)
negative_iter = torch.tensor([20.0], requires_grad=True)
optimizer = torch.optim.Adam([positive_iter, negative_iter], lr=0.02, betas=(0.5, 0.999))
for i in range(100):
loss = torch.tensor([0.0], requires_grad=True)
for g in range(int(positive_iter)):
loss = loss + torch.rand(1)
for d in range(int(negative_iter)):
loss = loss - torch.rand(1) * 2
loss = torch.abs(loss)
loss.backward()
optimizer.step()
print(i, loss.item(), positive_iter.item(), negative_iter.item())
The optimization does not seem to work, here is the output:
0 19.467784881591797 10.0 20.0
1 14.334418296813965 10.0 20.0
2 13.515042304992676 10.0 20.0
3 13.477707862854004 10.0 20.0
4 15.240434646606445 10.0 20.0
5 18.45014190673828 10.0 20.0
6 18.557266235351562 10.0 20.0
7 16.325769424438477 10.0 20.0
8 13.95105266571045 10.0 20.0
9 12.435094833374023 10.0 20.0
10 13.70322322845459 10.0 20.0
11 10.128765106201172 10.0 20.0
12 16.986034393310547 10.0 20.0
13 15.652003288269043 10.0 20.0
14 10.300052642822266 10.0 20.0
15 18.038368225097656 10.0 20.0
16 11.830389022827148 10.0 20.0
17 14.917057037353516 10.0 20.0
18 18.603071212768555 10.0 20.0
19 17.595298767089844 10.0 20.0
20 17.17181968688965 10.0 20.0
21 14.548274993896484 10.0 20.0
22 18.839675903320312 10.0 20.0
23 13.375761032104492 10.0 20.0
24 14.045333862304688 10.0 20.0
25 13.088285446166992 10.0 20.0
26 15.019135475158691 10.0 20.0
27 16.992284774780273 10.0 20.0
28 13.883159637451172 10.0 20.0
29 12.695013999938965 10.0 20.0
30 17.23816680908203 10.0 20.0
...continued
Could you advise on how to make the for loop length an optimization parameter.
Thank you

How do I change multiple values in pandas df column to np.nan, based on condition in other column?

I don't have much experience in coding and this is my first question, so please be patient with me. I need to find a way to change multiple values of a pandas df column to np.nan, based on a condition in another column. Therefore I have created copies of the required columns "Vorgabe" and "Temp".
Whenever the value in "Grad" isn't 0 i want to change the values in a definded area in "Vorgabe" and "Temp" to np.nan.
print(df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 22.0 20.5
2 22.0 21.0 8 1.0 22.0 21.0
3 22.0 21.0 6 0.0 22.0 21.0
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 23.0 21.5
6 24.0 22.5 3 1.0 24.0 22.5
7 24.0 23.0 4 0.0 24.0 23.0
8 24.0 25.5 9 0.0 24.0 25.5
So I want to achieve something like this:
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 nan nan <-one row above
2 22.0 21.0 8 1.0 nan nan
3 22.0 21.0 6 0.0 nan nan <-one row among
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 nan nan
6 24.0 22.5 3 1.0 nan nan
7 24.0 23.0 4 0.0 nan nan
8 24.0 25.5 9 0.0 24.0 25.5
Does somebody have a solution to my problem?
EDIT: I may have been unclear. The goal is to change every value in "Vorgabe" and "Temp" in an defined area to nan. In my example the area would be one row above, the row with 1.0 in it, and one row among. So not only the row, where 1.0 is located, but also rows above and under.
Use loc:
df.loc[df.Grad != 0.0, ['Vorgabe', 'Temp']] = np.nan
print(df)
Output
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 22.0 20.5
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 22.0 21.0
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 23.0 21.5
6 24.0 22.5 3 1.0 NaN NaN
7 24.0 23.0 4 0.0 24.0 23.0
8 24.0 25.5 9 0.0 24.0 25.5
You could use numpy.where.
import numpy as np
df['Vorbage']=np.where(df['Grad']!=0, df['OptOpTemp'], np.nan)
df['Temp']=np.where(df['Grad']!=0, df['OpTemp'], np.nan)
Chain 3 conditions with | for bitwise OR, for rows above and under 1 use mask with shift:
mask1 = df['Grad'] == 1
mask2 = df['Grad'].shift() == 1
mask3 = df['Grad'].shift(-1) == 1
mask1 = df['Grad'] != 0
mask2 = df['Grad'].shift() != 0
mask3 = df['Grad'].shift(-1) != 0
mask = mask1 | mask2 | mask3
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
print (df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 NaN NaN
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 NaN NaN
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 NaN NaN
6 24.0 22.5 3 1.0 NaN NaN
7 24.0 23.0 4 0.0 NaN NaN
8 24.0 25.5 9 0.0 24.0 25.5
General solution for multiple rows:
N = 1
#create range for test value betwen -N to N
r = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
#create boolean mask by comparing with shift and join together by reduce
mask = np.logical_or.reduce([df['Grad'].shift(x) == 1 for x in r])
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
EDIT:
You can join both masks together:
N = 1
r1 = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
mask1 = np.logical_or.reduce([df['Grad'].shift(x) == 1 for x in r1])
N = 2
r2 = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
mask2 = np.logical_or.reduce([df['Grad'].shift(x) == 1.5 for x in r2])
#if not working ==1.5 because precision of floats
#mask2 = np.logical_or.reduce([np.isclose(df['Grad'].shift(x), 1.5) for x in r2])
mask = mask1 | mask2
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
print (df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 NaN NaN
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 NaN NaN
4 22.0 23.5 7 0.0 NaN NaN
5 23.0 21.5 1 0.0 NaN NaN
6 24.0 22.5 3 1.5 NaN NaN <- changed value to 1.5
7 24.0 23.0 4 0.0 NaN NaN
8 24.0 25.5 9 0.0 NaN NaN
You can use df.apply(f,axis=1), and define f to be what you want to do on each row. You description seems to be saying you want
def f(row):
if row['Grad']!=0:
row.loc[['Vorgabe','Temp']]=np.nan
return row
However, your example seems to be saying you want something else.

joint probability with a condition

I am working with wind speed (sknt) and visbility (vsby) data in hourly intervals from weather stations. I was able to calculate the joint probability for both wind speed and visibility using this,
df1=df.groupby('vsby').size().div(len(df))
df2=df.groupby(['vsby', 'sknt']).size().div(len(df)).div(vprob, axis=0, level='vsby')
vsby sknt 0
0 6.0 15.0 1.000000
1 10.0 0.0 1.000000
2 11.0 7.0 0.500000
3 11.0 16.0 0.500000
4 13.0 12.0 1.000000
5 14.0 3.0 0.500000
6 14.0 4.0 0.250000
7 14.0 12.0 0.250000
8 16.0 0.0 0.099796
9 16.0 2.0 0.209776
10 16.0 3.0 0.173116
11 16.0 4.0 0.134420
12 16.0 5.0 0.175153
13 16.0 6.0 0.024440
14 16.0 7.0 0.032587
15 16.0 8.0 0.018330
16 16.0 9.0 0.024440
17 16.0 10.0 0.024440
18 16.0 11.0 0.026477
19 16.0 12.0 0.016293
20 16.0 13.0 0.014257
21 16.0 14.0 0.008147
22 16.0 15.0 0.008147
23 16.0 16.0 0.004073
24 16.0 17.0 0.004073
25 16.0 18.0 0.002037
I am interested in finding the probability of wind speed >= x for all visibility recorded. For example, vsby 16, probability = (0.018330 + 0.024440 + 0.024440 + 0.026477 + 0.016293 + 0.014257 + 0.008147 + 0.008147 + 0.004073 + 0.004073 + 0.002037)
I tried,
df2.loc[df2.sknt >= 7, df2.vsby].sum()
but its not working.
Try the below. To select a column using .loc it is sufficient to just provide the name.
df2 = df2.reset_index()
df2.loc[df2['sknt'] >= 7, 'vsby'].sum()

Efficiently updating NaN's in a pandas dataframe from a prior row & specific columns value

I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0

Combining dataframes in pandas with the same rows and columns, but different cell values

I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)

Categories