Pandas rolling function with overlap - python

I would like to apply a function to one pandas dataframe column which does the following task:
I have a cycle counter that starts from a value but sometimes restarts.
I would like to have the counter continue and increase its value.
The function I use at the moment is the following one:
Code
import pandas as pd
d = {'Cycle':[100,100,100,100,101,101,101,102,102,102,102,102,102,103,103,103,100,100,100,100,101,101,101,101]}
df = pd.DataFrame(data=d)
df.loc[:,'counter'] = df['Cycle'].to_numpy()
df.loc[:,'counter'] = df['counter'].rolling(2).apply(lambda x: x[0] if (x[0] == x[1]) else x[0]+1, raw=True)
print(df)
Output
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 100.0
18 100 100.0
19 100 100.0
20 101 101.0
21 101 101.0
22 101 101.0
23 101 101.0
My goal is to get a dataframe similar to this one:
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
How do I use the rolling function with one overlap?
Do you have any recommendation to reach my goal?
Best regards,
Matteo

Another approach would be to identify the points in the Cycle column where the value changes using .diff(). Then at those points increment from the original initial cycle value and merge to the original dataframe forward filling the new values.
df2 = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index()
df2['Target Count'] = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index().reset_index().apply(lambda x: df.iloc[0,0] + x['level_0'], axis = 1)
df = df.merge(df2.drop('Cycle', axis = 1), right_on = 'index', left_index = True, how = 'left').ffill().set_index('index', drop = True)
def df.index.name
df
Cycle Target Count
0 100 100.0
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0

We can use shift and ne (same as !=) to check where the Cycle column changes.
Then we use cumsum to make a counter which changes each time Cycle changes.
We add the first value of Cycle to the counter -1, to let it start at 100:
groups = df['Cycle'].ne(df['Cycle'].shift()).cumsum()
df['counter'] = groups + df['Cycle'].iat[0] - 1
Cycle counter
0 100 100
1 100 100
2 100 100
3 100 100
4 101 101
5 101 101
6 101 101
7 102 102
8 102 102
9 102 102
10 102 102
11 102 102
12 102 102
13 103 103
14 103 103
15 103 103
16 100 104
17 100 104
18 100 104
19 100 104
20 101 105
21 101 105
22 101 105
23 101 105
Details: groups gives us a counter starting at 1:
print(groups)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 5
17 5
18 5
19 5
20 6
21 6
22 6
23 6
Name: Cycle, dtype: int64

Related

Joining 2 dataframe based on a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
Following is one of my dataframe structure:
strike coi chgcoi
120 200 20
125 210 15
130 230 12
135 240 9
and the other one is:
strike poi chgpoi
125 210 15
130 230 12
135 240 9
140 225 12
What I want is:
strike coi chgcoi strike poi chgpoi
120 200 20 120 0 0
125 210 15 125 210 15
130 230 12 130 230 12
135 240 9 135 240 9
140 0 0 140 225 12
First, you need to create two dataframes using pandas
df1 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
df2 = pd.Dataframe({'column_1': [val_1, val_2, ..., val_n], 'column_2':[val_1, val_2, ..., val_n]})
Then you can use outer join
df1.merge(df2, on='common_column_name', how='outer')
db1
strike coi chgcoi
0 120 200 20
1 125 210 15
2 130 230 12
3 135 240 9
db2
strike poi chgpoi
0 125 210 15
1 130 230 12
2 135 240 9
3 140 225 12
merge = db1.merge(db2,how="outer",on='strike')
merge
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 NaN NaN
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 NaN NaN 225.0 12.0
merge.fillna(0)
strike coi chgcoi poi chgpoi
0 120 200.0 20.0 0.0 0.0
1 125 210.0 15.0 210.0 15.0
2 130 230.0 12.0 230.0 12.0
3 135 240.0 9.0 240.0 9.0
4 140 0.0 0.0 225.0 12.0
This is your expected result with the only difference that 'strike' is not repeated

Concatenate Dataframes on common values yields NaN values for non-matches [python]

I am trying to merge/concatenate two dataframes on a common column and match up all the corresponding values. However, while matching values receive corresponding values for that row, when there is no match, a NaN value is produced. I am using python for this. I will explain here in more detail.
I have this dataframe A:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 101 3.11 0 1.29 25.99
2 102 5.10 0 1.23 29.51
3 105 9.70 0 1.97 15.17
4 107 4.77 0 1.53 27.84
...
Each ID represents a different building footprint (polygon), with its Area Recorded, the height of the building, and the mean outdoor temperature recorded at the site of the building. The "Distance" column denotes the distance away from the building at which the temperature was recorded, and so onsite = 0 meters away.
And I have this dataframe B:
ID Temp Distance
---------------------------------
0 100 25.68 5
1 100 26.05 10
2 100 26.85 15
3 100 27.25 20
4 100 27.78 25
5 101 22.68 5
6 101 26.44 10
7 101 26.83 15
8 101 27.26 20
9 101 28.38 25
10 102 25.63 5
11 102 26.26 10
12 102 26.57 15
13 102 26.91 20
14 102 28.84 25
15 105 25.33 5
16 105 26.25 10
17 105 26.54 15
18 105 26.23 20
19 105 27.53 25
20 107 25.23 5
21 107 26.73 10
22 107 26.26 15
23 107 26.11 20
24 107 27.16 25
...
This shows for the same building IDs the temperatures recorded at different distances away from the building, and so for each building I want the recorded mean outdoor temperature 5 meters away, 10 meters away, 15 meters away, 20 meters away, and 25 meters away.
What I want to do is join dataframes A and B by the common "ID" column. And so what I want to do is produce a dataframe C that shows for each ID that buildings Temperature at Distances 0, 5, 10, 15, 20, and 25. The issue is that I want for each building ID the Area and Height to remain the same, because of course the buildings area and height will not change! And so I want to produce the following dataframe C:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 100 8.31 5 1.30 25.68
2 100 8.31 10 1.30 26.05
3 100 8.31 15 1.30 26.85
4 100 8.31 20 1.30 27.25
5 100 8.31 25 1.30 27.78
6 101 3.11 0 1.29 25.99
7 101 3.11 5 1.29 22.68
8 101 3.11 10 1.29 26.44
9 101 3.11 15 1.29 26.83
10 101 3.11 20 1.29 27.26
11 101 3.11 25 1.29 28.38
12 102 5.10 0 1.23 29.51
13 102 5.10 5 1.23 25.63
14 102 5.10 10 1.23 26.26
15 102 5.10 15 1.23 26.57
16 102 5.10 20 1.23 26.91
17 102 5.10 25 1.23 28.84
18 105 9.70 0 1.97 15.17
19 105 9.70 5 1.97 25.33
20 105 9.70 10 1.97 26.25
21 105 9.70 15 1.97 26.54
22 105 9.70 20 1.97 26.23
23 105 9.70 25 1.97 27.53
24 107 4.77 0 1.53 27.84
25 107 4.77 5 1.53 25.23
26 107 4.77 10 1.53 26.73
27 107 4.77 15 1.53 26.26
28 107 4.77 20 1.53 26.11
29 107 4.77 25 1.53 27.16
...
And so to obtain this I try the following, trying to concatenate dataframes A and B on the "ID" column, and then sorting the rows by "ID" and "Distance":
df_C = pd.concat([df_A, df_B]).sort_values(["ID", "Distance"]).reset_index(drop=True)
However this yields:
ID Area Distance Height Temp
----------------------------------------------------
0 100 8.31 0 1.30 24.27
1 100 NaN 5 NaN 25.68
2 100 NaN 10 NaN 26.05
3 100 NaN 15 NaN 26.85
4 100 NaN 20 NaN 27.25
5 100 NaN 25 NaN 27.78
6 101 3.11 0 1.29 25.99
7 101 NaN 5 NaN 22.68
8 101 NaN 10 NaN 26.44
9 101 NaN 15 NaN 26.83
10 101 NaN 20 NaN 27.26
11 101 NaN 25 NaN 28.38
12 102 5.10 0 1.23 29.51
13 102 NaN 5 NaN 25.63
14 102 NaN 10 NaN 26.26
15 102 NaN 15 NaN 26.57
16 102 NaN 20 NaN 26.91
17 102 NaN 25 NaN 28.84
18 105 9.70 0 1.97 15.17
19 105 NaN 5 NaN 25.33
20 105 NaN 10 NaN 26.25
21 105 NaN 15 NaN 26.54
22 105 NaN 20 NaN 26.23
23 105 NaN 25 NaN 27.53
24 107 4.77 0 1.53 27.84
25 107 NaN 5 NaN 25.23
26 107 NaN 10 NaN 26.73
27 107 NaN 15 NaN 26.26
28 107 NaN 20 NaN 26.11
29 107 NaN 25 NaN 27.16
...
And so it appears that the Area and Height values are not getting matched up because Dataframe B does not contain the corresponding Area and Height Values, and so there is nothing to report there when I merge the two dataframes. How can I fix this issue so that I get my intended Dataframe C?
If you are sure that all ID are in df_A and have a distance of 0, and no other nan than in Area and Height, then using ffill could do it once sorted as you did.
df_C = df_C.ffill()
If you are not sure, then you can use groupby.transform with first and fillna
df_C = df_C.fillna(df_C.groupby('ID')[['Area', 'Height']].transform('first'))
Finally, another option is to add the column Area and Height in df_B first, then concat so:
df_C = pd.concat([
df_A,
df_B.merge(df_A[['ID','Area','Height']], on='ID', how='left')]
).sort_values(["ID", "Distance"]).reset_index(drop=True)

How to add additional text to matplotlib annotations

I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92

Calculate mean of data rows in dataframe with date-headers, dictated by a 'datetime'-column

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0.
If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Initial dataframe:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID
0 100.0 122.0 324 632 NaN 2016-03-08 12
1 120.0 159.0 54 452 541.0 2015-04-09 96
2 NaN 164.0 687 165 245.0 2016-02-15 20
3 180.0 421.0 512 184 953.0 2018-05-01 73
4 110.0 654.0 913 173 103.0 2017-08-04 84
5 130.0 NaN 754 124 207.0 2016-07-03 26
6 170.0 256.0 843 97 806.0 2013-02-04 87
7 140.0 754.0 95 101 541.0 2016-06-08 64
8 80.0 985.0 184 84 90.0 2019-03-05 11
9 96.0 65.0 127 130 421.0 2014-05-14 34
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
Tried code: -> I'm still working on it, as I don't really know how to start for this, I only uploaded the dataframe so far, probably something with the 'datetime'-package has to be done to get the desired dataframe?
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
print(df)
Due to your naming convention, one need to extract the years from column names for comparison purpose. Then you can mask the data and taking mean:
# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)
# the years from Date
years = pd.to_datetime(df.Date).dt.year.values
df['mean'] = data.where(data_years<years[:,None]).mean(1)
Output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.00
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.00
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.00
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.00
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.00
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447.00
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.60
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
one more answer:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)

print(df)

iterating over loc on dataframes

I'm trying to extract data from a list of dataframes and extract row ranges. Each dataframe might not have the same data, therefore I have a list of possible index ranges that I would like loc to loop over, i.e. from the code sample below, I might want CIN to LAN, but on another dataframe, the CIN row doesn't exist, so I will want DET to LAN or HOU to LAN.
so I was thinking putting them in a list and iterating over the list, i.e.
for df in dfs:
ranges=[[df.loc["CIN":"LAN"]], [df.loc["DET":"LAN"]]]
extracted ranges = (i for i in ranges)
I'm not sure how you would iterate over a list and feed into loc, or perhaps .query().
df1 stint g ab r h X2b X3b hr rbi sb cs bb \
year team
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105
DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97
HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60
LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114
NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174
SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235
TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73
TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190
df2 so ibb hbp sh sf gidp
year team
2008 DET 176.0 3.0 10.0 4.0 8.0 28.0
HOU 212.0 3.0 9.0 16.0 6.0 17.0
LAN 141.0 8.0 9.0 3.0 8.0 29.0
NYN 310.0 24.0 23.0 18.0 15.0 48.0
SFN 188.0 51.0 8.0 16.0 6.0 41.0
TEX 140.0 4.0 5.0 2.0 8.0 16.0
TOR 265.0 16.0 12.0 4.0 16.0 38.0
Here is a solution:
import pandas as pd
# Prepare a list of ranges
ranges = [('CIN','LAN'), ('DET','LAN')]
# Declare an empty list of data frames and a list with the existing data frames
df_ranges = []
df_list = [df1, df2]
# Loop over multi-indices
for i, idx_range in enumerate(ranges):
df = df_list[i]
row1, row2 = idx_range
df_ranges.append(df.loc[(slice(None), slice(row1, row2)),:])
# Print the extracted data
print('Extracted data:\n')
print(df_ranges)
Output:
[ stint g ab r h X2b X3b hr rbi sb cs bb
year team
2007 CIN 6 379 745 101 203 35 2 36 125 10 1 105
DET 5 301 1062 162 283 54 4 37 144 24 7 97
HOU 4 311 926 109 218 47 6 14 77 10 4 60
LAN 11 413 1021 153 293 61 3 36 154 7 5 114
so ibb hbp sh sf gidp
year team
2008 DET 176 3 10 4 8 28
HOU 212 3 9 16 6 17
LAN 141 8 9 3 8 29]

Categories