Python - Pandas subtotals on groupby - python

here's a sample of the data i m using :
SCENARIO DATE POD AREA IDOC STATUS TYPE
AAA 02.06.2015 JKJKJKJKJKK 4210 713375 51 1
AAA 02.06.2015 JWERWERE 4210 713375 51 1
AAA 02.06.2015 JAFDFDFDFD 4210 713375 51 9
BBB 02.06.2015 AAAAAAAA 5400 713504 51 43
CCC 05.06.2015 BBBBBBBBBB 4100 756443 51 187
AAA 05.06.2015 EEEEEEEE 4100 756457 53 228
I have written the following code in pandas to groupby:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("MRD.xlsx")
df = xl.parse("Sheet3")
#print (df.column.values)
# The following gave ValueError: Cannot label index with a null key
# dfi = df.pivot('SCENARIO)
# Here i do not actually need it to count every column, just a specific one
table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count'])
writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter')
table.to_excel(writer, sheet_name='Sheet1')
writer.save()
table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count())
print (table2)
writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter')
table2.to_excel(writer2, sheet_name='Sheet1')
writer2.save()
this yields a result :
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
Name: TYPE, dtype: int64
How could i add subtotals per group? Ideally i would want to achieve something like:
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
Is this possible?

Use:
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64

The same thing can be achived with pandas pivot table:
table = pd.pivot_table(df, values=['TYPE'], index=['SCENARIO', 'STATUS'], aggfunc='count')
table

Chris Moffitt has created a library named sidetable to ease this process which can be used with the groupby object with an accessor making it very easy. That said, the accepted answer and comments are a gold mine, which I feel it's worth checking it out first.

Related

Finding the 'Date' column in a dataframe

I'm programmatically trying to detect the column in a dataframe that contains dates & I'm converting the date values to the same format.
My logic is to find the column name that contains the word 'Date' either as a whole word or as a sub-word (using contains()) & then work on the dates in that column.
My code:
from dateutil.parser import parse
import re
from datetime import datetime
import calendar
import pandas as pd
def date_fun(filepath):
lst_to_ser=pd.Series(filepath.columns.values)
date_col_search= lst_to_ser.str.contains(pat = 'date')
#print(date_col_search.columns.values)
for i in date_col_search:
if i is True:
formatted_dates=pd.to_datetime(date_col_search[i], errors='coerce')
print(formatted_dates)
main_path = pd.read_csv('C:/Data_Cleansing/random_dateset.csv')
fpath=main_path.copy()
date_fun(fpath)
The retrieved column names are stored in an array & since contains() works only on 'Series' I converted the array to series.
This is what 'date_col_search' variable contains:
0 False
1 True
2 False
dtype: bool
I want to access the column corresponding to the 'True' value. But I'm getting the following error at the line formatted_dates=......:
Exception has occurred: KeyError
True
How should I access the 'True' column?
My dataframe:
random joiningdate branch
1 25.09.2019 rev
8 9/16/2015 pop
98 10.12.2017 switch
65 02.12.2014 high
45 08-Mar-18 aim
2 08-12-2016 docker
0 26.04.2016 grit
9 05-03-2016 trevor
56 24.12.2016 woll
4 10-Aug-19 qerty
78 abc yak
54 05-06-2015 water
42 12-2012-18 rance
43 24-02-2010 stream
38 2008,13,02 verge
78 16-09-2015 atom
I would use:
def mixed_datetime(s):
# this is just an example, adapt this function to your need
return (pd.to_datetime(s, yearfirst=False, dayfirst=True, errors='coerce')
.fillna(
pd.to_datetime(s, yearfirst=True, dayfirst=False, errors='coerce')
)
)
cols = df.columns.str.contains('date', case=False)
df.loc[:, cols] = df.loc[:, cols].apply(mixed_datetime)
Updated DataFrame:
random joiningdate branch
0 1 2019-09-25 rev
1 8 2015-09-16 pop
2 98 2017-12-10 switch
3 65 2014-12-02 high
4 45 2018-03-08 aim
5 2 2016-12-08 docker
6 0 2016-04-26 grit
7 9 2016-03-05 trevor
8 56 2016-12-24 woll
9 4 2019-08-10 qerty
10 78 NaT yak
11 54 2015-06-05 water
12 42 NaT rance
13 43 2010-02-24 stream
14 38 2008-02-01 verge
15 78 2015-09-16 atom

How can i calculate pct changes between groups of colums efficiently?

I have a set of columsn like so:
q1_cash_total, q2_cash_total, q3_cash_total,
q1_shop_us, q2_shop_us, q3_shop_us,
etc, i have about 40 similarly named column names like this. I wish to calculate the pct changes between each of these groups of 3. e.g. i know individually i can do:
df[['q1_cash_total', 'q2_cash_total', 'q3_cash_total']].pct_change().add_suffix('_PCT_CHG')
to do this for every 3 i do:
q1 = [col for col in df.columns if 'q1' in col ]
q2 = [col for col in df.columns if 'q2' in col ]
q3 = [col for col in df.columns if 'q3' in col ]
q_cols = q1+q2+q3
dflist = []
for col in df[q_cols].columns:
#col[3:] to just get col name without the q1_/q2_ etc
print(col[3:])
cols = [c for c in df.columns if col[3:] in c]
pct = df[cols].pct_change().add_suffix('_PCT_CHG')
dflist.append(pct)
pcts_df = pd.concat(dflist)
I cannot think of a cleaner way to do this. Does anybody have any ideas? How can i also do it such that i do the pct change between q1 and q3 too instead of successively.
You could create a dataframe containing only the desires columns, for that, filter column names starting with q immediately follow by one or more digits and an underscore (^q\d+?_). Remove the prefix and keep only unique column names using pd.unique. For each unique column name, filter columns with that specific name and apply the percentage change along the columns axis (.pct_change(axis='columns')) to obtain the changes between q1, q2 and q3.
To get the percentage change between q1 and q3 you can select those columns by name over the previous created dataframe (df_q) and apply the same pct_change executed earlier.
df used as input
q1_cash_total q1_shop_us q2_cash_total q2_shop_us q3_cash_total q3_shop_us another_col numCols dataCols
0 52 93 15 72 61 21 83 87 75
1 75 88 24 3 22 53 2 88 30
2 38 2 64 60 21 33 76 58 22
3 89 49 91 59 42 92 60 80 15
4 62 62 47 62 51 55 64 3 51
df_q = df.filter(regex='^q\d+?_')
unique_cols = pd.unique([c[3:] for c in df_q.columns])
dflist = []
for col in unique_cols:
q_name = df_q.filter(like=col)
df_s = q_name.pct_change(axis='columns').add_suffix('_PCT_CHG')
dflist.append(df_s)
df_s = df_q[[f'q1_{col}', f'q3_{col}']].pct_change(axis='columns').add_suffix('_Q1-Q3')
dflist.append(df_s)
pcts_df = pd.concat(dflist, axis=1)
Output from pcts_df
q1_cash_total_PCT_CHG q2_cash_total_PCT_CHG q3_cash_total_PCT_CHG ... q3_shop_us_PCT_CHG q1_shop_us_Q1-Q3 q3_shop_us_Q1-Q3
0 NaN -0.711538 3.066667 ... -0.708333 NaN -0.774194
1 NaN -0.680000 -0.083333 ... 16.666667 NaN -0.397727
2 NaN 0.684211 -0.671875 ... -0.450000 NaN 15.500000
3 NaN 0.022472 -0.538462 ... 0.559322 NaN 0.877551
4 NaN -0.241935 0.085106 ... -0.112903 NaN -0.112903
[5 rows x 10 columns]

How to rename the column names post flattening

This is
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-03-01 A 25 88
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
I want to pivot the above df
My Approach:
df_pivot = dealer_f.pivot_table(index='Group',columns='Date',fill_value=0)
df_pivot.columns = dealer_f_pivot.columns.map('_'.join)
ff_pivot = dealer_f_pivot.reset_index()
I am getting an error as TypeError: sequence item 1: expected str instance, int found
If I simply follow reset_index then I get the column names as ('Value',2018-01-01),('Value',2018-02-10) etc.
I want to flatten the columns so that my output looks like below
df_pivot.columns.tolist()
['2018-01-01_Value','2018-02-01_Value',.....'2018-01-01_Duration',...]
Any clue? Or where I am missing?
Use:
df_pivot.columns = [f'{b}_{a}' for a, b in df_pivot.columns]
Or:
df_pivot.columns = [f'{b.strftime("%Y-%m-%d")}_{a}' for a, b in df_pivot.columns]
df_pivot = df_pivot.reset_index()
print (df_pivot)
Group 2018-01-01_Duration 2018-02-01_Duration 2018-03-01_Duration \
0 A 30 60 88
1 B 180 210 238
2018-01-01_Value 2018-02-01_Value 2018-03-01_Value
0 20 10 25
1 15 30 25

changing wide to long table format and splitting dates by year

I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you
Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76

Python/Pandas: groupby / find specific row / drop all rows below

I've got a dataframe - and I want to drop specific rows per group ("id"):
id - month - max
1 - 112016 - 41
1 - 012017 - 46
1 - 022017 - 156
1 - 032017 - 164
1 - 042017 - 51
2 - 042017 - 26
2 - 052017 - 156
2 - 062017 - 17
for each "id", find location of first row (sorted by "month") where "max" is >62
keep all rows above (within this group), drop rest of rows
Expected result:
id - month - max
1 - 112016 - 41
1 - 012017 - 46
2 - 042017 - 26
I'm able to identify the first row which has to be deleted per group, but I'm stuck from that point on:
df[df.max > 62].sort_values(['month'], ascending=[True]).groupby('id', as_index=False).first()
How can I get rid of the rows?
Best regards,
david
Use:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
#comopare by gt (>) for cumulative sum per groups and filter equal 0
df1= df[df['max'].gt(62).groupby(df['id']).cumsum().eq(0)]
print (df1)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
Or use a custom function if need also first value >62:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
def f(x):
m = x['max'].gt(62)
first = m[m].index[0]
x = x.loc[ :first]
return x
df = df.groupby('id', group_keys=False).apply(f)
print (df)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
2 1 2017-02-01 156
5 2 2017-04-01 83
import pandas as pd
datadict = {
'id': [1,1,1,1,1,2,2,2],
'max': [41,46,156,164,51,83,156,17],
'month': ['112016', '012017', '022017', '032017', '042017', '042017', '052017', '062017'],
}
df = pd.DataFrame(datadict)
print (df)
id max month
0 1 41 112016
1 1 46 012017
2 1 156 022017
3 1 164 032017
4 1 51 042017
5 2 83 042017
6 2 156 052017
7 2 17 062017
df = df.loc[df['max']>62,:]
print (df)
id max month
2 1 156 022017
3 1 164 032017
5 2 83 042017
6 2 156 052017

Categories