Pandas GroupBy and CumSum on a column - python

I have a data set which looks like the following
doc_created_month doc_created_year speciality doc_id count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 1
4 2017 Acupuncturist 1
4 2017 Allergist 1
5 2018 Allergist 1
10 2018 Allergist 2
I want to group by the month, year and speciality and get cumulative sum on 'doc_id count' column.
These are the following I tried:
1) docProfileDf2.groupby(by=['speciality','doc_created_year','doc_created_month']).sum().groupby(level=[0]).cumsum()
2) docProfileDf2.groupby(['doc_created_month','doc_created_year','speciality'])['doc_id count'].apply(lambda x: x.cumsum())
None of them are returning the proper cumulative sum.
Any Solution can help.
The expected output should be:
doc_created_month doc_created_year speciality doc_id count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 2
4 2017 Acupuncturist 3
4 2017 Allergist 1
5 2018 Allergist 2
10 2018 Allergist 4
For each year, month and speciality I want the cumsum of 'doc_id count'

It is Simple:
Solution is:
df.groupby(by=['speciality','doc_created_year','doc_created_month']).sum().groupby(level=[0]).cumsum()
i had to sum and groupby at speciality level.

Please note that I changed doc_id count to doc_id_count
You first call groupby('speciality') in order to group your data by that column. The second step is to call apply(). You'll then apply a function on each group. In this case we perform another groupby on the other required columns and further call group.sum().cumsum() to get the desired result.
from io import StringIO
import pandas as pd
data = """
doc_created_month doc_created_year speciality doc_id_count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 1
4 2017 Acupuncturist 1
4 2017 Allergist 1
5 2018 Allergist 1
10 2018 Allergist 2
"""
df = pd.read_csv(StringIO(data), sep='\s+')
(df.groupby('speciality')
.apply(lambda df_: df_.groupby(['doc_created_year', 'doc_created_month'])
.sum().cumsum())
)
which outputs:
doc_id_count
speciality doc_created_year doc_created_month
Acupuncturist 2016 8 1
2017 2 2
4 3
Allergist 2017 4 1
2018 5 2
10 4

Related

Add two dataframe rows?

I have a df with year, I'm trying to combine two rows in a dataframe.
df
year
0 2020
1 2019
2 2018
3 2017
4 2016
Final df
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN
Let us do shift
df['combine'] = df.year.astype(str) + '-' + df.year.astype(str).shift(-1)
df
Out[302]:
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0

Sort by both index and value in Multi-indexed data of Pandas dataframe

Suppose, I have a dataframe as below:
year month message
0 2018 2 txt1
1 2017 4 txt2
2 2019 5 txt3
3 2017 5 txt5
4 2017 5 txt4
5 2020 4 txt3
6 2020 6 txt3
7 2020 6 txt3
8 2020 6 txt4
I want to figure out top three number of messages in each year. So, I grouped the data as below:
df.groupby(['year','month']).count()
which results:
message
year month
2017 4 1
5 2
2018 2 1
2019 5 1
2020 4 1
6 3
The data is in ascending order for both indexes. But how to find the results as shown below where the data is sorted by year (ascending) and count (descending) for top n values. 'month' index will be free.
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
value_counts gives you sort by default:
df.groupby('year')['month'].value_counts()
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
If you want only 2 top values for each year, do another groupby:
(df.groupby('year')['month'].value_counts()
.groupby('year').head(2)
)
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
This will sort by year (ascending) and count (descending).
df = df.groupby(['year', 'month']).count().sort_values(['year', 'message'], ascending=[True, False])
You can use sort_index, specifying ascending=[True,False] so that only the second level is sorted in descending order:
df = df.groupby(['year','month']).count().sort_index(ascending=[True,False])
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
here you go
df.groupby(['year', 'month']).count().sort_values(axis=0, ascending=False, by='message').sort_values(axis=0, ascending=True, by='year')
you can use this code for it.
df.groupby(['year', 'month']).count().sort_index(axis=0, ascending=False).sort_values(by="year", ascending=True)

Sort GroupBy object by certain max. value within individual groups

I am trying to sort my groupby object by the highest value for a certain year - i.e. the 2018 values. However, unsuccessfully.
Code:
aggs = {'sales':'sum')
df.groupby(by=['segment', 'year'].agg(aggs)
Default result by pandas when grouping
(sorted alphabetically by Level0, then ascending by Level1)
Segment Year Sales
A 2016 2
A 2017 10
A 2018 6
B 2016 1
B 2017 4
B 2018 8
Expected result:
Segment Year Sales
B 2016 1
B 2017 4
B 2018 8
A 2016 2
A 2017 10
A 2018 6
i.e. A is sorted behind B, because sum of B in 2018 is 8 while for A it is 6.
Idea is create ordered Categorical with categories by filtered values with 2018 and sorted by Sales:
cats = df[df['Year'] == 2018].sort_values('Sales', ascending=False)['Segment']
aggs = {'Sales':'sum'}
df['Segment'] = pd.Categorical(df['Segment'], ordered=True, categories=cats)
df1 = df.groupby(by=['Segment', 'Year']).agg(aggs)
print (df1)
Sales
Segment Year
B 2016 1
2017 4
2018 8
A 2016 2
2017 10
2018 6

Remove rows that are not duplicated n time

I have a df with 4 observations per company (4 quarter). However, for several companies I have less than 4 observations. When I don't have the 4 quarters for a firm I would like to delete all observations relative to the firm. Any ideas how to do this ?
This is how the df looks like:
Quarter Year Company
1 2018 A
2 2018 A
3 2018 A
4 2018 A
1 2018 B
2 2018 B
1 2018 C
2 2018 C
3 2018 C
4 2018 C
In this df I would like to delete rows relative to company B because I only have 2 quarters.
Many thanks!
Use transform with size for Series with same size like original DataFrame, so possible filtering:
df = df[df.groupby('Company')['Quarter'].transform('size') == 4]
#if want check by Companies and years
#df = df[df.groupby(['Company','Year'])['Quarter'].transform('size') == 4]
print (df)
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
If performance is not important or small DataFrame use DataFrameGroupBy.filter:
df = df.groupby('Company').filter(lambda x: len(x) == 4)
Using value_counts
s=df.Company.value_counts()
df.loc[df.Company.isin(s[s==4].index)]
Out[527]:
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
You can go through your company column and check whether you have all 4 quarter results.
for i in set(df['Company']):
if len(df[df['Company']==i)!=4:
df=df[df['Company']!=i]

Categories