How to rename the column names post flattening

How to rename the column names post flattening - python

This is
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-03-01 A 25 88
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
I want to pivot the above df
My Approach:
df_pivot = dealer_f.pivot_table(index='Group',columns='Date',fill_value=0)
df_pivot.columns = dealer_f_pivot.columns.map('_'.join)
ff_pivot = dealer_f_pivot.reset_index()
I am getting an error as TypeError: sequence item 1: expected str instance, int found
If I simply follow reset_index then I get the column names as ('Value',2018-01-01),('Value',2018-02-10) etc.
I want to flatten the columns so that my output looks like below
df_pivot.columns.tolist()
['2018-01-01_Value','2018-02-01_Value',.....'2018-01-01_Duration',...]
Any clue? Or where I am missing?

Use:
df_pivot.columns = [f'{b}_{a}' for a, b in df_pivot.columns]
Or:
df_pivot.columns = [f'{b.strftime("%Y-%m-%d")}_{a}' for a, b in df_pivot.columns]
df_pivot = df_pivot.reset_index()
print (df_pivot)
Group 2018-01-01_Duration 2018-02-01_Duration 2018-03-01_Duration \
0 A 30 60 88
1 B 180 210 238
2018-01-01_Value 2018-02-01_Value 2018-03-01_Value
0 20 10 25
1 15 30 25

Related

How can i calculate pct changes between groups of colums efficiently?

I have a set of columsn like so:
q1_cash_total, q2_cash_total, q3_cash_total,
q1_shop_us, q2_shop_us, q3_shop_us,
etc, i have about 40 similarly named column names like this. I wish to calculate the pct changes between each of these groups of 3. e.g. i know individually i can do:
df[['q1_cash_total', 'q2_cash_total', 'q3_cash_total']].pct_change().add_suffix('_PCT_CHG')
to do this for every 3 i do:
q1 = [col for col in df.columns if 'q1' in col ]
q2 = [col for col in df.columns if 'q2' in col ]
q3 = [col for col in df.columns if 'q3' in col ]
q_cols = q1+q2+q3
dflist = []
for col in df[q_cols].columns:
#col[3:] to just get col name without the q1_/q2_ etc
print(col[3:])
cols = [c for c in df.columns if col[3:] in c]
pct = df[cols].pct_change().add_suffix('_PCT_CHG')
dflist.append(pct)
pcts_df = pd.concat(dflist)
I cannot think of a cleaner way to do this. Does anybody have any ideas? How can i also do it such that i do the pct change between q1 and q3 too instead of successively.

You could create a dataframe containing only the desires columns, for that, filter column names starting with q immediately follow by one or more digits and an underscore (^q\d+?_). Remove the prefix and keep only unique column names using pd.unique. For each unique column name, filter columns with that specific name and apply the percentage change along the columns axis (.pct_change(axis='columns')) to obtain the changes between q1, q2 and q3.
To get the percentage change between q1 and q3 you can select those columns by name over the previous created dataframe (df_q) and apply the same pct_change executed earlier.
df used as input
q1_cash_total q1_shop_us q2_cash_total q2_shop_us q3_cash_total q3_shop_us another_col numCols dataCols
0 52 93 15 72 61 21 83 87 75
1 75 88 24 3 22 53 2 88 30
2 38 2 64 60 21 33 76 58 22
3 89 49 91 59 42 92 60 80 15
4 62 62 47 62 51 55 64 3 51
df_q = df.filter(regex='^q\d+?_')
unique_cols = pd.unique([c[3:] for c in df_q.columns])
dflist = []
for col in unique_cols:
q_name = df_q.filter(like=col)
df_s = q_name.pct_change(axis='columns').add_suffix('_PCT_CHG')
dflist.append(df_s)
df_s = df_q[[f'q1_{col}', f'q3_{col}']].pct_change(axis='columns').add_suffix('_Q1-Q3')
dflist.append(df_s)
pcts_df = pd.concat(dflist, axis=1)
Output from pcts_df
q1_cash_total_PCT_CHG q2_cash_total_PCT_CHG q3_cash_total_PCT_CHG ... q3_shop_us_PCT_CHG q1_shop_us_Q1-Q3 q3_shop_us_Q1-Q3
0 NaN -0.711538 3.066667 ... -0.708333 NaN -0.774194
1 NaN -0.680000 -0.083333 ... 16.666667 NaN -0.397727
2 NaN 0.684211 -0.671875 ... -0.450000 NaN 15.500000
3 NaN 0.022472 -0.538462 ... 0.559322 NaN 0.877551
4 NaN -0.241935 0.085106 ... -0.112903 NaN -0.112903
[5 rows x 10 columns]

changing wide to long table format and splitting dates by year

I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you

Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76

Python/Pandas: groupby / find specific row / drop all rows below

I've got a dataframe - and I want to drop specific rows per group ("id"):
id - month - max
1 - 112016 - 41
1 - 012017 - 46
1 - 022017 - 156
1 - 032017 - 164
1 - 042017 - 51
2 - 042017 - 26
2 - 052017 - 156
2 - 062017 - 17
for each "id", find location of first row (sorted by "month") where "max" is >62
keep all rows above (within this group), drop rest of rows
Expected result:
id - month - max
1 - 112016 - 41
1 - 012017 - 46
2 - 042017 - 26
I'm able to identify the first row which has to be deleted per group, but I'm stuck from that point on:
df[df.max > 62].sort_values(['month'], ascending=[True]).groupby('id', as_index=False).first()
How can I get rid of the rows?
Best regards,
david

Use:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
#comopare by gt (>) for cumulative sum per groups and filter equal 0
df1= df[df['max'].gt(62).groupby(df['id']).cumsum().eq(0)]
print (df1)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
Or use a custom function if need also first value >62:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
def f(x):
m = x['max'].gt(62)
first = m[m].index[0]
x = x.loc[ :first]
return x
df = df.groupby('id', group_keys=False).apply(f)
print (df)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
2 1 2017-02-01 156
5 2 2017-04-01 83

import pandas as pd
datadict = {
'id': [1,1,1,1,1,2,2,2],
'max': [41,46,156,164,51,83,156,17],
'month': ['112016', '012017', '022017', '032017', '042017', '042017', '052017', '062017'],
}
df = pd.DataFrame(datadict)
print (df)
id max month
0 1 41 112016
1 1 46 012017
2 1 156 022017
3 1 164 032017
4 1 51 042017
5 2 83 042017
6 2 156 052017
7 2 17 062017
df = df.loc[df['max']>62,:]
print (df)
id max month
2 1 156 022017
3 1 164 032017
5 2 83 042017
6 2 156 052017

Python Pandas: Create New Column With Calculations Based on Categorical Values in A Different Column

I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.

You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

Python - Pandas subtotals on groupby

here's a sample of the data i m using :
SCENARIO DATE POD AREA IDOC STATUS TYPE
AAA 02.06.2015 JKJKJKJKJKK 4210 713375 51 1
AAA 02.06.2015 JWERWERE 4210 713375 51 1
AAA 02.06.2015 JAFDFDFDFD 4210 713375 51 9
BBB 02.06.2015 AAAAAAAA 5400 713504 51 43
CCC 05.06.2015 BBBBBBBBBB 4100 756443 51 187
AAA 05.06.2015 EEEEEEEE 4100 756457 53 228
I have written the following code in pandas to groupby:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("MRD.xlsx")
df = xl.parse("Sheet3")
#print (df.column.values)
# The following gave ValueError: Cannot label index with a null key
# dfi = df.pivot('SCENARIO)
# Here i do not actually need it to count every column, just a specific one
table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count'])
writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter')
table.to_excel(writer, sheet_name='Sheet1')
writer.save()
table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count())
print (table2)
writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter')
table2.to_excel(writer2, sheet_name='Sheet1')
writer2.save()
this yields a result :
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
Name: TYPE, dtype: int64
How could i add subtotals per group? Ideally i would want to achieve something like:
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
Is this possible?

Use:
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64

The same thing can be achived with pandas pivot table:
table = pd.pivot_table(df, values=['TYPE'], index=['SCENARIO', 'STATUS'], aggfunc='count')
table

Chris Moffitt has created a library named sidetable to ease this process which can be used with the groupby object with an accessor making it very easy. That said, the accepted answer and comments are a gold mine, which I feel it's worth checking it out first.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.