Applying a function to chunks of the Dataframe - python

I have a Dataframe (df) (for instance - simplified version)
A B
0 2.0 3.0
1 3.0 4.0
and generated 20 bootstrap resamples that are all now in the same df but differ in the Resample Nr.
A B
0 1 0 2.0 3.0
1 1 1 3.0 4.0
2 2 1 3.0 4.0
3 2 1 3.0 4.0
.. ..
.. ..
39 20 0 2.0 3.0
40 20 0 2.0 3.0
Now I want to apply a certain function on each Reample Nr. Say:
C = sum(df['A'] * df['B']) / sum(df['B'] ** 2)
The outlook would look like this:
A B C
0 1 0 2.0 3.0 Calculated Value X1
1 1 1 3.0 4.0 Calculated Value X1
2 2 1 3.0 4.0 Calculated Value X2
3 2 1 3.0 4.0 Calculated Value X2
.. ..
.. ..
39 20 0 2.0 3.0 Calculated Value X20
40 20 0 2.0 3.0 Calculated Value X20
So there are 20 different new values.
I know there is a df.iloc command where I can specify my row selection df.iloc[row, column] but I would like to find a command where I don't have to repeat the code for the 20 samples.
My goal is to find a command that identifies the Resample Nr. automatically and then calculates the function for each Resample Nr.
How can I do this?
Thank you!

Use DataFrame.assign to create two new columns x and y that corresponds to df['A'] * df['B'] and df['B']**2, then use DataFrame.groupby on Resample Nr. (or level=1) and transform using sum:
s = df.assign(x=df['A'].mul(df['B']), y=df['B']**2)\
.groupby(level=1)[['x', 'y']].transform('sum')
df['C'] = s['x'].div(s['y'])
Result:
A B C
0 1 0 2.0 3.0 0.720000
1 1 1 3.0 4.0 0.720000
2 2 1 3.0 4.0 0.750000
3 2 1 3.0 4.0 0.750000
39 20 0 2.0 3.0 0.666667
40 20 0 2.0 3.0 0.666667

Related

How to populate NaN values based on conditions from two other columns using Pandas?

I have a dataframe that looks something like this:
ID
hiqual
Wave
1
1.0
g
1
NaN
i
1
NaN
k
2
1.0
g
2
NaN
i
2
NaN
k
3
1.0
g
3
NaN
i
4
5.0
g
4
NaN
i
This is a long format dataframe and I have my hiqual variable for my first measurement wave (g). I would like to populate the NaN values for the subsequent measurement waves (i and k) as the same value give in wave g for each ID.
I tried using fillna() but I am not sure how to provide the two conditions of ID and Wave and how to populate based on that. I would be grateful for any help/suggestions on this?
The exact expected output is unclear, but think you might want:
m = df['hiqual'].isna()
df.loc[m, 'hiqual'] = df['Wave'].mask(m).ffill()
If you dataframe is already ordered by ID and wave columns, you can simply fill forward values:
>>> df.sort_values(['ID', 'Wave']).ffill()
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
You can also use explicitly g values:
g_vals = df[df['Wave']=='g'].set_index('ID')['hiqual']
df['hiqual'] = df['hiqual'].fillna(df['ID'].map(g_vals))
print(df)
print(g_vals)
# Output
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
# g_vals
ID
1 1.0
2 1.0
3 1.0
4 5.0
Name: hiqual, dtype: float64

How to calculate cumulative sum until a threshold and reset it after the threshold is reached considering groups in pandas dataframe in python?

I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0

How to find the cumsum of a column?

dict={"asset":["S3","S2","E4","E1","A6","A8"],
"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
i want to do cumsum of the column "cards_plus1".
How can I do this?
the output of the column cumsum should be that:
0
2
5
8
10
13
i want to start with zero instead of 2.. i want this outup : cards_plus1_cumsum 0 2 5 8 10 13
We can just pad a zero before the sums:
dframe["cumsum"] = np.pad(dframe["cards_plus1"][:-1].cumsum(), (1, 0), 'constant')
Try this:
First, replace the blank values by nan
import pandas as pd
import numpy as np
dict={"asset":["S3","S2","E4","E1","A6","A8"],"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
## replace blank values by nan
print(dframe.replace(r'^\s*$', np.nan, regex=True, inplace=True))
print (dframe)
>>> asset Rank number_of_attributes number_of_cards cards_plus1
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
Now the data type of the cards_plus1 column is object - change to numeric
### convert data type of the cards_plus1 to numeric
dframe['cards_plus1'] = pd.to_numeric(dframe['cards_plus1'])
Now calculate cumulative sum
### now we can calculate cumsum
dframe['cards_plus1_cumsum'] = dframe['cards_plus1'].cumsum()
print(dframe)
>>>
asset Rank number_of_attributes number_of_cards cards_plus1 \
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
cards_plus1_cumsum
1 2.0
2 5.0
3 8.0
4 10.0
5 13.0
6 NaN
Instead of replacing the blank values by nan, you can replace them by zero, depends on what you want.. Hope this helped..

How to find the difference between multiple columns of a given data frame and save the result as a separate data frame

i have data frame as below ,
df = pd.DataFrame({'A':[1,4,7,1,4,7],'B':[2,5,8,2,5,8],'C':[3,6,9,3,6,9],'D':[1,2,3,1,2,3]})
A B C D
0 1 2 3 1
1 4 5 6 2
2 7 8 9 3
3 1 2 3 1
4 4 5 6 2
5 7 8 9 3
how can I find the difference between column (A & B) and save as AB, and do the same with (C & D) and save as CD within the data frame.
Expected output:
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0
tried using
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).diff()
as explained here, this works well for sum(), but does not work as expected for diff(). Can someone please explain why?
Difference is diff not aggregate values like sum, but return new 2 columns - first filled by NAN and second with values.
So possible solution here is remove only NaNs columns by DataFrame.dropna:
d = dict(A='AB', B='AB', C='CD', D='CD')
df1 = df.rename(columns=d).groupby(level=0, axis=1).diff().dropna(axis=1, how='all')
print (df1)
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0

Implement a counter which resets in python panda data frame

Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0

Categories