Transforming dates in chronological order using pandas dataframe - python

I need help with comparing dates in different rows and in different columns and making sure that they follow a chronological order.
First, I group data based on Id and group columns. Next, each date value is supposed to occur in the future.
The first group [1111 + A ] contains an error because the dates don't follow a chronological order :
1/1/2016 > 2/20/2016 > **2/19/2016** > 4/25/2016 > **4/1/2016** > 5/1/2016
Current result
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/19/2016 04/25/2016 A
2 1111 04/01/2016 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 D
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 05/16/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
The answer should be:
1/1/2016 > 2/20/2016 > **2/21/2016** > 4/25/2016 > **4/26/2016** > 5/1/2016
Desired output
id start end group
0 1111 01/01/2016 02/20/2016 A
1 1111 02/21/2016 04/25/2016 A
2 1111 04/26/2018 05/01/2016 A
3 2345 05/01/2016 05/28/2016 B
4 2345 05/29/2016 06/28/2016 B
5 1234 08/01/2016 09/16/2016 F
6 9882 01/01/2016 08/29/2016 C
7 9992 03/01/2016 03/15/2016 C
8 9992 03/16/2016 08/03/2016 C
9 9992 08/04/2016 09/16/2016 C
10 9992 09/17/2016 10/16/2016 C
11 9992 10/17/2016 12/13/2016 C
Any help will be greatly appreciated.

One way is to apply your logic to each group, then concatenate your groups.
# convert series to datetime
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# iterate groups and add results to grps list
grps = []
for _, group in df.groupby(['id', 'group'], sort=False):
end_shift = group['end'].shift()
group.loc[group['start'] <= end_shift, 'start'] = end_shift + pd.DateOffset(1)
grps.append(group)
# concatenate dataframes in grps to build a single dataframe
res = pd.concat(grps, ignore_index=True)
print(res)
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C

I believe this should work:
# First make sure your column are datetimes:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# Get your new start times:
new_times = (df.groupby(['id', 'group'])
.apply(lambda x: (x.end + pd.Timedelta(days=1)).shift())
.reset_index(['id', 'group'], drop=True))
# put back into original dataframe
df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()]
>>> df
id start end group
0 1111 2016-01-01 2016-02-20 A
1 1111 2016-02-21 2016-04-25 A
2 1111 2016-04-26 2016-05-01 A
3 2345 2016-05-01 2016-05-28 B
4 2345 2016-05-29 2016-06-28 B
5 1234 2016-08-01 2016-09-16 F
6 9882 2016-01-01 2016-08-29 D
7 9992 2016-03-01 2016-03-15 C
8 9992 2016-03-16 2016-08-03 C
9 9992 2016-08-04 2016-09-16 C
10 9992 2016-09-17 2016-10-16 C
11 9992 2016-10-17 2016-12-13 C
Explanation:
new_times looks like this:
>>> new_times
0 NaT
1 2016-02-21
2 2016-04-26
5 NaT
3 NaT
4 2016-05-29
6 NaT
7 NaT
8 2016-03-16
9 2016-08-04
10 2016-09-17
11 2016-10-17
You can then use df.loc[new_times.notnull(), 'start'] = new_times[new_times.notnull()] to find where new_times is not null (i.e. where it is not the first row in a given group), and insert those new_times into your original start column.

Related

How to keep observations for individuals who showed up for the first time in week t in the data

I have the following data-frame:
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
3 B 2022-01-30 2
4 B 2022-02-15 2
5 B 2022-02-18 7
6 C 2021-12-01 7
7 C 2021-12-15 4
8 C 2022-01-30 2
9 C 2022-02-15 7
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9
I want to keep the observations for those IDs who first showed up in week, say, 51 of the year (I would like to change this parameter in the future).
For example, IDs A and D showed up first in week 51 in the data, B didn't, C showed up in week 51, but not for the first time.
So I want to keep in this example only the data pertaining to A and D.
Filter if week match variable week and it is first time by ID in DataFrame by Series.duplicated, then get ID values:
week = 50
df['date'] = pd.to_datetime(df['date'])
s = df.loc[df['date'].dt.isocalendar().week.eq(week) & ~df['ID'].duplicated(), 'ID']
Or:
df1 = df.drop_duplicates(['ID'])
s = df1.loc[df1['date'].dt.isocalendar().week.eq(week) ,'ID']
print (s)
0 A
10 D
Name: ID, dtype: object
Last filter by ID with Series.isin and boolean indexing:
df = df[df['ID'].isin(s)]
print (df)
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9

Count days by ID - Pandas

By having the following table, how can I count the days by ID?
without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1 for first value of group by Series.where and Series.duplicated:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
You could do something like (df your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int instead of pd.Timedelta then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

Dataframe has Everyother column timestamp, how to get it in one column?

I have a dataframe I import from excel that is of 'n x n' length, that looks like the following (sorry, i do not know how to easily duplicate this with code)
How do I get the timestamps into one column? Like the following (I've tried pivot)
You may need to extract the data by 3 columns group. Then rename the columns and add the "A,B,C" flag column and concatenate them together. See the test as below:
abc_list = [["2017-10-01",0,"2017-10-02",1,"2017-10-03",8],["2017-11-01",3,"2017-11-01",5,"2017-11-05",10],["2017-12-01",0,"2017-12-07",7,"2017-12-07",12]]
df = pd.DataFrame(abc_list,columns=["Time1","A","Time2","B","Time3","C"])
The output:
Time1 A Time2 B Time3 C
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Then:
df_a=df.iloc[:,0:2].rename(columns={'Time1':'time','A':'value'})
df_a['flag']="A"
df_b=df.iloc[:,2:4].rename(columns={'Time2':'time','B':'value'})
df_b['flag']="B"
df_c=df.iloc[:,4:].rename(columns={'Time3':'time','C':'value'})
df_c['flag']="C"
df_final=pd.concat([df_a,df_b,df_c])
df_final.reset_index(drop=True)
output:
time value flag
0 2017-10-01 0 A
1 2017-11-01 3 A
2 2017-12-01 0 A
3 2017-10-02 1 B
4 2017-11-01 5 B
5 2017-12-07 7 B
6 2017-10-03 8 C
7 2017-11-05 10 C
8 2017-12-07 12 C
This is a quit bit not a pythonic way to do it.
Here is another way:
columns = pd.MultiIndex.from_tuples([('A','Time'),('A','Value'),('B','Time'),('B','Value'),('C','Time'),('C','Value')],names=['Group','Sub_value'])
df.columns=columns
Output:
Group A B C
Sub_value Time Value Time Value Time Value
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Run:
df.stack(level='Group')
Output:
Sub_value Time Value
Group
0 A 2017-10-01 0
B 2017-10-02 1
C 2017-10-03 8
1 A 2017-11-01 3
B 2017-11-01 5
C 2017-11-05 10
2 A 2017-12-01 0
B 2017-12-07 7
C 2017-12-07 12
This is one method. It is fairly easy to extend to any number of columns.
import pandas as pd
dfs = {}
# read in pairs of columns and assign 'Category' column
dfs[i] = {i: pd.read_excel('file.xlsx', usecols=[2*i, 2*i+1], skiprows=[0],
header=None, columns=['Date', 'Value']).assign(Category=j) \
for i, j in enumerate(['A', 'B', 'C'])}
# concatenate dataframes
df = pd.concat(list(dfs.values()), ignore_index=True)

Categories