This question already has answers here:
Pandas groupby(),agg() - how to return results without the multi index?
(3 answers)
Closed 4 years ago.
I have this pandas DataFrame df:
df.head()
windIntensity year month day hour AOBT delay
3 2015 1 1 0 0 0.0 15.0
2 2015 1 1 0 0 0.0 10.0
2 2015 1 1 1 0 0.0 5.0
2 2015 1 1 1 0 0.0 0.0
1 2015 1 1 2 0 0.0 0.0
When I execute this code:
df = dfj.groupby(["year","hour"]).agg({'windIntensity':'mean','delay':['mean','count']}).reset_index()
I get this result:
year hour windIntensity delay
mean mean count
0 2015 0 4.239207 24.240373 857
1 2015 1 4.029024 15.770449 758
2 2015 2 3.863928 7.431322 779
3 2015 3 3.859801 4.161290 806
4 2015 4 3.782659 4.722230 6851
But how can I rename columns to get one line of column, not two lines?
Expected result:
year hour windIntensity_mean delay_mean count
0 2015 0 4.239207 24.240373 857
1 2015 1 4.029024 15.770449 758
2 2015 2 3.863928 7.431322 779
3 2015 3 3.859801 4.161290 806
4 2015 4 3.782659 4.722230 6851
Demo:
source DF with multi-level columns:
In [223]: r
Out[223]:
year hour windIntensity delay
mean mean count
0 1 0 2015 6.0 5
solution:
In [224]: r.columns = r.columns.map(lambda c: ('_' if c[1] else '').join(c))
result:
In [225]: r
Out[225]:
year hour windIntensity_mean delay_mean delay_count
0 1 0 2015 6.0 5
Related
My dataframe looks like this:
customer_nr
order_value
year_ordered
payment_successful
1
50
1980
1
1
75
2017
0
1
10
2020
1
2
55
2000
1
2
300
2007
1
2
15
2010
0
I want to know the total amount a customer has successfully paid in the years before, for a specific order.
The expected output is as follows:
customer_nr
order_value
year_ordered
payment_successful
total_successfully_previously_paid
1
50
1980
1
0
1
75
2017
0
50
1
10
2020
1
50
2
55
2000
1
0
2
300
2007
1
55
2
15
2010
0
355
Closest i've gotten is this:
df.groupby(['customer_nr', 'payment_successful'], as_index=False)['order_value'].sum()
That just gives me the summed amount successfully and unsuccessfully paid all time per customer. It doesn't account for selecting only previous orders to participate in the sum.
Try:
df["total_successfully_previously_paid"] = (df["payment_successful"].mul(df["order_value"])
.groupby(df["customer_nr"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
>>> df
customer_nr ... total_successfully_previously_paid
0 1 ... 0.0
1 1 ... 50.0
2 1 ... 50.0
3 2 ... 0.0
4 2 ... 55.0
5 2 ... 355.0
[6 rows x 5 columns]
I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match
I am trying to change the structure data inside the data frame
year month count reason
2001 1 1 a
2001 2 3 b
2001 3 4 c
2005 1 4 a
2005 1 3 c
at new data frame should look like:
year month count reason_a reason_b reason_c
2001 1 1 1 0 0
2001 2 3 0 3 0
2001 3 4 0 0 4
2005 1 7 4 0 3
Is anyone can show some Python code to do it? Thank you in advance,
Using
DataFrame.join() - Join columns of another DataFrame.
pandas.get_dummies() - Convert categorical variable into dummy/indicator variables.
DataFrame.mul() - Get Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.groupby() - Group DataFrame or Series using a mapper or by a Series of columns.
DataFrameGroupBy.agg() - Aggregate using callable, string, dict, or list of string/callables.
Ex.
dummies = df.join(pd.get_dummies(df["reason"],prefix='reason').mul(df['count'], axis=0))
f = {'count': 'sum', 'reason_a': 'first', 'reason_b': 'first', 'reason_c': 'last'}
df1 = dummies.groupby(['year','month'],sort=False,as_index=False).agg(f)
print(df1)
O/P:
year month count reason_a reason_b reason_c
0 2001 1 1 1 0 0
1 2001 2 3 0 3 0
2 2001 3 4 0 0 4
3 2005 1 7 4 0 3
Using pivot_table:
df2 = pd.pivot_table(df,index=["year","month",],values=["count"],columns="reason").reset_index().fillna(0)
df2.columns = [i[0] if i[0]!="count" else f"reason_{i[1]}" for i in df2.columns]
df2["count"] = df2.iloc[:,2:5].sum(axis=1)
print (df2)
#
year month reason_a reason_b reason_c count
0 2001 1 1.0 0.0 0.0 1.0
1 2001 2 0.0 3.0 0.0 3.0
2 2001 3 0.0 0.0 4.0 4.0
3 2005 1 4.0 0.0 3.0 7.0
I have a pandas dataframe for which I'm trying to compute an expanding windowed aggregation after grouping by columns. The data structure is something like this:
df = pd.DataFrame([['A',1,2015,4],['A',1,2016,5],['A',1,2017,6],['B',1,2015,10],['B',1,2016,11],['B',1,2017,12],
['A',1,2015,24],['A',1,2016,25],['A',1,2017,26],['B',1,2015,30],['B',1,2016,31],['B',1,2017,32],
['A',2,2015,4],['A',2,2016,5],['A',2,2017,6],['B',2,2015,10],['B',2,2016,11],['B',2,2017,12]],columns=['Typ','ID','Year','dat'])\
.sort_values(by=['Typ','ID','Year'])
i.e.
Typ ID Year dat
0 A 1 2015 4
6 A 1 2015 24
1 A 1 2016 5
7 A 1 2016 25
2 A 1 2017 6
8 A 1 2017 26
12 A 2 2015 4
13 A 2 2016 5
14 A 2 2017 6
3 B 1 2015 10
9 B 1 2015 30
4 B 1 2016 11
10 B 1 2016 31
5 B 1 2017 12
11 B 1 2017 32
15 B 2 2015 10
16 B 2 2016 11
17 B 2 2017 12
In general, there is a completely varying number of years per Type-ID and rows per Type-ID-Year. I need to group this dataframe by the columns Type and ID, then compute an expanding windowed median & std of all observations by Year. I would like to get output results like this:
Typ ID Year median std
0 A 1 2015 14.0 14.14
1 A 1 2016 14.5 11.56
2 A 1 2017 15.0 10.99
3 A 2 2015 4.0 0
4 A 2 2016 4.5 0
5 A 2 2017 5.0 0
6 B 1 2015 20.0 14.14
7 B 1 2016 20.5 11.56
8 B 1 2017 21.0 10.99
9 B 2 2015 10.0 0
10 B 2 2016 10.5 0
11 B 2 2017 11.0 0
Hence, I want something like a groupby by ['Type','ID','Year'], with the median & std for each Type-ID-Year computed for all data with the same Type-ID and cumulative inclusive that Year.
How can I do this without manual iteration?
There's been no activity on this question, so I'll post the solution I found.
mn = df.groupby(by=['Typ','ID']).dat.expanding().median().reset_index().set_index('level_2')
mylast = lambda x: x.iloc[-1]
mn = mn.join(df['Year'])
mn = mn.groupby(by=['Typ','ID','Year']).agg(mylast).reset_index()
My solution follows this algorithm:
group the data, compute the windowed median, and get the original index back
with the original index back, get the year back from the original dataframe
group by the grouping columns, taking the last (in order) value for each
This gives the output desired. The same process can be followed for the standard deviation (or any other statistic desired).
I am working on a data with pandas in which a maintenance work is done at a location. The maintenance is done every four years at each site. I want to find the years since the last maintenance action at each site. I am giving here only two sites in the following example but in the original dataset, I have thousands of them. My data only covers the years 2014 through 2017.
Action = 0 means no action has been performed that year, Action = 1 means some action has been done. Measurement is a performance reading related to the effect of the action. The action can happen in any year. I know that if the action has been performed in Year Y, the previous maintenance has been performed in Year Y-4.
Site Year Action Measurement
A 2014 0 100
A 2015 0 150
A 2016 1 300
A 2017 0 80
B 2014 0 200
B 2015 1 250
B 2016 0 60
B 2017 0 110
Given this dataset; first, I want to have a temporary dataset like this:
Item Year Action Measurement Years_Since_Last_Action
A 2014 0 100 2
A 2015 0 150 3
A 2016 1 300 4
A 2017 0 80 1
B 2014 0 200 3
B 2015 1 250 4
B 2016 0 60 1
B 2017 0 110 2
Then, I want to have:
Years_Since_Last_Action Mean_Measurement
1 70
2 105
3 175
4 275
Thanks in advance!
Your first question
s=df.loc[df.Action==1,['Site','Year']].set_index('Site') # get all year have the action and map back to the whole dataframe
df['Newyear']=df.Site.map(s.Year)
s1=df.Year-df.Newyear
df['action since last year']=np.where(s1<=0,s1+4,s1)# using np.where get the condition
df
Out[167]:
Site Year Action Measurement Newyear action since last year
0 A 2014 0 100 2016 2
1 A 2015 0 150 2016 3
2 A 2016 1 300 2016 4
3 A 2017 0 80 2016 1
4 B 2014 0 200 2015 3
5 B 2015 1 250 2015 4
6 B 2016 0 60 2015 1
7 B 2017 0 110 2015 2
2nd question
df.groupby('action since last year').Measurement.mean()
Out[168]:
action since last year
1 70
2 105
3 175
4 275
Name: Measurement, dtype: int64
First, build your intermediate using groupby, *fill and a little arithmetic.
v = (df.Year
.where(df.Action.astype(bool))
.groupby(df.Site)
.ffill()
.bfill()
.sub(df.Year))
df['Years_Since_Last_Action'] = np.select([v > 0, v < 0], [4 - v, v.abs()], default=4)
df
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2.0
1 A 2015 0 150 3.0
2 A 2016 1 300 4.0
3 A 2017 0 80 1.0
4 B 2014 0 200 3.0
5 B 2015 1 250 4.0
6 B 2016 0 60 1.0
7 B 2017 0 110 2.0
Next,
df.groupby('Years_Since_Last_Action', as_index=False).Measurement.mean()
Years_Since_Last_Action Measurement
0 1.0 70
1 2.0 105
2 3.0 175
3 4.0 275
How about:
delta_year = df.loc[df.groupby("Site")["Action"].transform("idxmax"), "Year"].values
years_since = ((df.Year - delta_year) % 4).replace(0, 4)
df["Years_Since_Last_Action"] = years_since
out = df.groupby("Years_Since_Last_Action")["Measurement"].mean().reset_index()
out = out.rename(columns={"Measurement": "Mean_Measurement"})
which gives me
In [230]: df
Out[230]:
Site Year Action Measurement Years_Since_Last_Action
0 A 2014 0 100 2
1 A 2015 0 150 3
2 A 2016 1 300 4
3 A 2017 0 80 1
4 B 2014 0 200 3
5 B 2015 1 250 4
6 B 2016 0 60 1
7 B 2017 0 110 2
In [231]: out
Out[231]:
Years_Since_Last_Action Mean_Measurement
0 1 70
1 2 105
2 3 175
3 4 275