changing wide to long table format and splitting dates by year

changing wide to long table format and splitting dates by year - python

I have a table that looks like this:
temp = [['K98R', 'AB',34,'2010-07-27', '2013-08-17', '2008-03-01', '2011-05-02', 44],['S33T','ES',55, '2009-07-23', '2012-03-12', '2010-09-17', '', 76]]
Data = pd.DataFrame(temp,columns=['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
What you see in the table above, is that there is an entry and exit dates, with dates for the events 1 and 2, there is also a missing date for event 2 for the second patient because the event didn't happen. Also note that the event1 for the first patient happened before entry date.
What I am trying to achieve is a two fold:
1. Split the time between the entry and exit into years
2. Convert the wide format to long one with one row per year
3. Check if event 1 and 2 have occurred during the time period included in each row
To explain further, here is the output I am trying to ge.
ID Initial Age Entry Exit Event1 Event2 Weight
K89R AB 34 27/07/2010 31/12/2010 1 0 44
K89R AB 35 1/01/2011 31/12/2011 1 1 44
K89R AB 36 1/01/2012 31/12/2012 1 1 44
K89R AB 37 1/01/2013 17/08/2013 1 1 44
S33T ES 55 23/07/2009 31/12/2009 0 0 76
S33T ES 56 1/01/2010 31/12/2010 1 0 76
S33T ES 57 1/01/2011 31/12/2011 1 0 76
S33T ES 58 1/01/2012 12/03/2012 1 0 76
What you notice here is that the entry to exit date period is split into individual rows per patient, each representing a year. The event columns are now coded as 0 (meaning the event has not yet happened) or 1 (the event happened) which is then carried over to the years after because the event has already happened.
The age increases in every row per patient as time progresses
The patient ID and initial remain the same as well as the weight.
Could anyone please help with this, thank you

Begin by getting the number of years between Entry and Exit:
# Convert to datetime
df.Entry = pd.to_datetime(df.Entry)
df.Exit = pd.to_datetime(df.Exit)
df.Event1 = pd.to_datetime(df.Event1)
df.Event2 = pd.to_datetime(df.Event2)
# Round up, to include the upper years
import math
df['Years_Between'] = (df.Exit - df.Entry).apply(lambda x: math.ceil(x.days/365))
# printing the df will provide the following:
ID Initials Age Entry Exit Event1 Event2 Weight Years_Between
0 K98R AB 34 2010-07-27 2013-08-17 2008-03-01 2011-05-02 44 4
1 S33T ES 55 2009-07-23 2012-03-12 2010-09-17 NaT 76 3
Loop through your data and create a new row for each year:
new_data = []
for idx, row in df.iterrows():
year = row['Entry'].year
new_entry = pd.to_datetime(year, format='%Y')
for y in range(row['Years_Between']):
new_entry = new_entry + pd.DateOffset(years=1)
new_exit = new_entry + pd.DateOffset(years=1) - pd.DateOffset(days=1)
record = {'Entry': new_entry,'Exit':new_exit}
if row['Entry']> new_entry:
record['Entry'] = row['Entry']
if row['Exit']< new_exit:
record['Exit'] = row['Exit']
for col in ['ID', 'Initials', 'Age', 'Event1', 'Event2', 'Weight']:
record[col] = row[col]
new_data.append(record)
Create a new DataFrame, the compare dates:
df_new = pd.DataFrame(new_data, columns = ['ID','Initials','Age', 'Entry','Exit','Event1','Event2','Weight'])
df_new['Event1'] = (df_new.Event1 <= df_new.Exit).astype(int)
df_new['Event2'] = (df_new.Event2 <= df_new.Exit).astype(int)
# printing df_new will provide:
ID Initials Age Entry Exit Event1 Event2 Weight
0 K98R AB 34 2011-01-01 2011-12-31 1 1 44
1 K98R AB 34 2012-01-01 2012-12-31 1 1 44
2 K98R AB 34 2013-01-01 2013-08-17 1 1 44
3 K98R AB 34 2014-01-01 2013-08-17 1 1 44
4 S33T ES 55 2010-01-01 2010-12-31 1 0 76
5 S33T ES 55 2011-01-01 2011-12-31 1 0 76
6 S33T ES 55 2012-01-01 2012-03-12 1 0 76

Related

Finding the 'Date' column in a dataframe

I'm programmatically trying to detect the column in a dataframe that contains dates & I'm converting the date values to the same format.
My logic is to find the column name that contains the word 'Date' either as a whole word or as a sub-word (using contains()) & then work on the dates in that column.
My code:
from dateutil.parser import parse
import re
from datetime import datetime
import calendar
import pandas as pd
def date_fun(filepath):
lst_to_ser=pd.Series(filepath.columns.values)
date_col_search= lst_to_ser.str.contains(pat = 'date')
#print(date_col_search.columns.values)
for i in date_col_search:
if i is True:
formatted_dates=pd.to_datetime(date_col_search[i], errors='coerce')
print(formatted_dates)
main_path = pd.read_csv('C:/Data_Cleansing/random_dateset.csv')
fpath=main_path.copy()
date_fun(fpath)
The retrieved column names are stored in an array & since contains() works only on 'Series' I converted the array to series.
This is what 'date_col_search' variable contains:
0 False
1 True
2 False
dtype: bool
I want to access the column corresponding to the 'True' value. But I'm getting the following error at the line formatted_dates=......:
Exception has occurred: KeyError
True
How should I access the 'True' column?
My dataframe:
random joiningdate branch
1 25.09.2019 rev
8 9/16/2015 pop
98 10.12.2017 switch
65 02.12.2014 high
45 08-Mar-18 aim
2 08-12-2016 docker
0 26.04.2016 grit
9 05-03-2016 trevor
56 24.12.2016 woll
4 10-Aug-19 qerty
78 abc yak
54 05-06-2015 water
42 12-2012-18 rance
43 24-02-2010 stream
38 2008,13,02 verge
78 16-09-2015 atom

I would use:
def mixed_datetime(s):
# this is just an example, adapt this function to your need
return (pd.to_datetime(s, yearfirst=False, dayfirst=True, errors='coerce')
.fillna(
pd.to_datetime(s, yearfirst=True, dayfirst=False, errors='coerce')
)
)
cols = df.columns.str.contains('date', case=False)
df.loc[:, cols] = df.loc[:, cols].apply(mixed_datetime)
Updated DataFrame:
random joiningdate branch
0 1 2019-09-25 rev
1 8 2015-09-16 pop
2 98 2017-12-10 switch
3 65 2014-12-02 high
4 45 2018-03-08 aim
5 2 2016-12-08 docker
6 0 2016-04-26 grit
7 9 2016-03-05 trevor
8 56 2016-12-24 woll
9 4 2019-08-10 qerty
10 78 NaT yak
11 54 2015-06-05 water
12 42 NaT rance
13 43 2010-02-24 stream
14 38 2008-02-01 verge
15 78 2015-09-16 atom

group by and concatenate dataframe

I have df with frame, m_label, and details so all of them can be duplicated, in same frame may be different labels with different details, but you need to know m_label+details have a constant pattern of several option for example Findings may be PL or DV , so "Findings PL start" always have "Findings PL end", except BBPS, it may be start in details as 3 and end as 2 or same number. In final I need to know which label when it start (for example Action IR, start in frame 31) and when it end (end as Action IR in frame 101).
That my input:
frame m_label details
0 BBPS 3
0 BBPS start
0 Findings DV
0 Findings start
0 Findings DV
0 Findings end
31 Actions IR
31 Actions start
99 BBPS 2
99 Findings PL
99 Findings start
99 BBPS end
99 Findings PL
99 Findings end
101 Action IR
101 Action end
So I want convert this df to something like this:
frame m_label details
0 Findings.DV start
0 Findings.DV end
0 BBPS.3 start
31 Actions.IR start
99 Action.IR end
99 Findings.PL start
99 Findings.PL end
99 BBPS.2 end
101 Action.IR end
So I need concatenate row only without start/end and groupby(?) or transform(?) by frame..
I try this code, but then I got stuck:
def concat_func(x):
if not x[1] in ['start', 'end']:
result = x[0]+'.'+x[1]
else:
result=np.nan
return result
data_cv["concat"]=data_cv[["m_label","details"]].apply(concat_func,axis=1)

First I find it useful to move the start/end info to a new column, which is done by merging together the rows that have start/end on one side and the ones that don’t on the other:
>>> detail_type = df['details'].isin({'start', 'end'})
>>> df = pd.merge(df[~detail_type], df[detail_type].rename(columns={'details': 'detail_type'}))
>>> df
frame m_label details detail_type
0 0 BBPS 3 start
1 0 Findings DV start
2 0 Findings DV end
3 0 Findings DV start
4 0 Findings DV end
5 31 Actions IR start
6 99 BBPS 2 end
7 99 Findings PL start
8 99 Findings PL end
9 99 Findings PL start
10 99 Findings PL end
11 101 Action IR end
Now we can replace the 2 columns by their concatenated text:
>>> df = df.drop(columns=['m_label', 'details']).join(df['m_label'].str.cat(df['details'], sep='.'))
>>> df.drop_duplicates()
frame detail_type m_label
0 0 start BBPS.3
1 0 start Findings.DV
2 0 end Findings.DV
5 31 start Actions.IR
6 99 end BBPS.2
7 99 start Findings.PL
8 99 end Findings.PL
11 101 end Action.IR
You could even pivot to have a start and an end column:
>>> df.drop_duplicates().pivot(columns='detail_type', index='m_label', values='frame')
detail_type end start
m_label
Action.IR 101.0 NaN
Actions.IR NaN 31.0
BBPS.2 99.0 NaN
BBPS.3 NaN 0.0
Findings.DV 0.0 0.0
Findings.PL 99.0 99.0
But for that to be efficient you’ll first need to define rules that uniquely name your labels, e.g. BBPS regardless of details 2 and 3, Action / Actions always spelled the same way, etc.

I don't think groupby would help, as the order inside the group also matter.
Try this (since you didn't post the df in a copiable way, I can't test it myself):
df = df.assign(new_label=None).sort_values(['frame', 'm_label'])
df.loc[~df['details'].isin(['start', 'end']), 'new_label'] = df['m_label'] + '.' + df['details']
df.loc[(df['frame'] == df['frame'].shift(-1).fillna('')) & (df['m_label'] == df['m_label'].shift(-1).fillna('')) & df['details'].shift(-1).isin(['start', 'end']), 'details'] = df['details'].shift(-1).fillna('')
df = df.loc[pd.notna(df['new_label']) & df['details'].isin(['start', 'end']), ['frame', 'new_label', 'details']]

Python/Pandas: groupby / find specific row / drop all rows below

I've got a dataframe - and I want to drop specific rows per group ("id"):
id - month - max
1 - 112016 - 41
1 - 012017 - 46
1 - 022017 - 156
1 - 032017 - 164
1 - 042017 - 51
2 - 042017 - 26
2 - 052017 - 156
2 - 062017 - 17
for each "id", find location of first row (sorted by "month") where "max" is >62
keep all rows above (within this group), drop rest of rows
Expected result:
id - month - max
1 - 112016 - 41
1 - 012017 - 46
2 - 042017 - 26
I'm able to identify the first row which has to be deleted per group, but I'm stuck from that point on:
df[df.max > 62].sort_values(['month'], ascending=[True]).groupby('id', as_index=False).first()
How can I get rid of the rows?
Best regards,
david

Use:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
#comopare by gt (>) for cumulative sum per groups and filter equal 0
df1= df[df['max'].gt(62).groupby(df['id']).cumsum().eq(0)]
print (df1)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
Or use a custom function if need also first value >62:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
def f(x):
m = x['max'].gt(62)
first = m[m].index[0]
x = x.loc[ :first]
return x
df = df.groupby('id', group_keys=False).apply(f)
print (df)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
2 1 2017-02-01 156
5 2 2017-04-01 83

import pandas as pd
datadict = {
'id': [1,1,1,1,1,2,2,2],
'max': [41,46,156,164,51,83,156,17],
'month': ['112016', '012017', '022017', '032017', '042017', '042017', '052017', '062017'],
}
df = pd.DataFrame(datadict)
print (df)
id max month
0 1 41 112016
1 1 46 012017
2 1 156 022017
3 1 164 032017
4 1 51 042017
5 2 83 042017
6 2 156 052017
7 2 17 062017
df = df.loc[df['max']>62,:]
print (df)
id max month
2 1 156 022017
3 1 164 032017
5 2 83 042017
6 2 156 052017

Python - Pandas subtotals on groupby

here's a sample of the data i m using :
SCENARIO DATE POD AREA IDOC STATUS TYPE
AAA 02.06.2015 JKJKJKJKJKK 4210 713375 51 1
AAA 02.06.2015 JWERWERE 4210 713375 51 1
AAA 02.06.2015 JAFDFDFDFD 4210 713375 51 9
BBB 02.06.2015 AAAAAAAA 5400 713504 51 43
CCC 05.06.2015 BBBBBBBBBB 4100 756443 51 187
AAA 05.06.2015 EEEEEEEE 4100 756457 53 228
I have written the following code in pandas to groupby:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("MRD.xlsx")
df = xl.parse("Sheet3")
#print (df.column.values)
# The following gave ValueError: Cannot label index with a null key
# dfi = df.pivot('SCENARIO)
# Here i do not actually need it to count every column, just a specific one
table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count'])
writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter')
table.to_excel(writer, sheet_name='Sheet1')
writer.save()
table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count())
print (table2)
writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter')
table2.to_excel(writer2, sheet_name='Sheet1')
writer2.save()
this yields a result :
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
Name: TYPE, dtype: int64
How could i add subtotals per group? Ideally i would want to achieve something like:
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
Is this possible?

Use:
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64

The same thing can be achived with pandas pivot table:
table = pd.pivot_table(df, values=['TYPE'], index=['SCENARIO', 'STATUS'], aggfunc='count')
table

Chris Moffitt has created a library named sidetable to ease this process which can be used with the groupby object with an accessor making it very easy. That said, the accepted answer and comments are a gold mine, which I feel it's worth checking it out first.

Two DataFrames Random Sample by Day grouping instead of hour

I have two dataframes, One is Price and the other one is Volume. They are both hourly and for the the same timeframe (one year).
dfP = pd.DataFrame(np.random.randint(5, 10, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
dfV = pd.DataFrame(np.random.randint(50, 100, (8760,4)), index=pd.date_range('2008-01-01', periods=8760, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
Each Day is a SET in the sense that the values have to stay together. When a sample is generated, it needs to be a full day. so a sample would be (for example 24 hours of Feb 2, 2008) in this data set. I would like to generate a 185 day (50%) sample set for dfP and have the Volumes from the same days so i can generate a sum product.
dfProduct = dfP_Sample * dfV_Sample
I am lost on how to achieve this. Any help is appreciated.

It sounds like you're expecting to get the sum of the volumes and prices for each day and then multiply them together?
If that's the case, try the following. If not, please clarify your question.
priceGroup = dfP.groupby(by=dfP.index.date).sum()
volumeGroup = dfV.grouby(by=dfV.index.date).sum()
dfProduct = priceGroup*volumeGroup
If you want to just look at a specific date range, try
import datetime as datetime
dfProduct[np.logical_and(dfProduct.index > datetime.date(2006,08,09),dfProduct.index < datetime.date(2007,01,02))]

First of all we'll generate a column that refers to the day index of the year for example 2008-01-01 will be assigned 1 because it indicates first day of the year and so on
day_order = [date.timetuple().tm_yday for date in dfP.index]
dfP['day_order'] = day_order
then generate random days from 1 to 365 this will represent the day order in the year for example if you get random number 1 this indicates 2008-01-01
random_days = np.random.choice(np.arange(1 , 366) , size = 185 , replace=False)
then slice your original data frame to get only values from random sample according to the day order column we've created previously
dfP_sample = dfP[dfP.day_order.isin(random_days)]
then you can merge both frames on index , and you can do whatever you want
final = pd.merge(dfP_sample , dfV , left_index=True , right_index=True)
final.head()
Out[47]:
Col1_x Col2_x Col3_x Col4_x day_order Col1_y Col2_y Col3_y Col4_y
2008-01-03 00:00:00 9 6 9 9 3 66 85 62 82
2008-01-03 01:00:00 5 8 9 8 3 54 89 65 98
2008-01-03 02:00:00 7 5 5 9 3 83 58 60 96
2008-01-03 03:00:00 9 5 7 6 3 59 54 67 78
2008-01-03 04:00:00 9 5 8 9 3 92 66 66 55
if you don't want to merge both frames , you can apply the same logic on dfV
and then you will get samples from both data frames on the same days

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.