to_datetime assemblage error due to extra keys - python

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.

Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00

You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Related

How to keep observations for individuals who showed up for the first time in week t in the data

I have the following data-frame:
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
3 B 2022-01-30 2
4 B 2022-02-15 2
5 B 2022-02-18 7
6 C 2021-12-01 7
7 C 2021-12-15 4
8 C 2022-01-30 2
9 C 2022-02-15 7
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9
I want to keep the observations for those IDs who first showed up in week, say, 51 of the year (I would like to change this parameter in the future).
For example, IDs A and D showed up first in week 51 in the data, B didn't, C showed up in week 51, but not for the first time.
So I want to keep in this example only the data pertaining to A and D.
Filter if week match variable week and it is first time by ID in DataFrame by Series.duplicated, then get ID values:
week = 50
df['date'] = pd.to_datetime(df['date'])
s = df.loc[df['date'].dt.isocalendar().week.eq(week) & ~df['ID'].duplicated(), 'ID']
Or:
df1 = df.drop_duplicates(['ID'])
s = df1.loc[df1['date'].dt.isocalendar().week.eq(week) ,'ID']
print (s)
0 A
10 D
Name: ID, dtype: object
Last filter by ID with Series.isin and boolean indexing:
df = df[df['ID'].isin(s)]
print (df)
ID date X
0 A 2021-12-15 7
1 A 2022-01-30 6
2 A 2022-02-15 2
10 D 2021-12-16 5
11 D 2022-01-30 4
12 D 2022-03-15 9

Count days by ID - Pandas

By having the following table, how can I count the days by ID?
without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1 for first value of group by Series.where and Series.duplicated:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
You could do something like (df your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int instead of pd.Timedelta then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.

Pandas groupby datetime columns by periods

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

How to group time series data by Monday, Tuesday .. ? pandas

I have time series pandas DataFrame looks like
value
12-01-2014 1
13-01-2014 2
....
01-05-2014 5
I want to group them into
1 (Monday, Tuesday, ..., Saturday, Sonday)
2 (Workday, Weekend)
How could I achieve that in pandas ?
Make sure your dates column is a datetime object and use the datetime attributes:
df = pd.DataFrame({'dates':['1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15',
'1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15'],
'values':[1,2,3,4,5,1,2,3,1,2,3,4]})
df['dates'] = pd.to_datetime(df['dates'])
df['dayofweek'] = df['dates'].apply(lambda x: x.dayofweek)
dates values dayofweek
0 2015-01-01 1 3
1 2015-01-02 2 4
2 2015-01-03 3 5
3 2015-01-04 4 6
4 2015-01-05 5 0
5 2015-01-06 1 1
6 2015-01-07 2 2
7 2015-01-08 3 3
8 2015-01-09 1 4
9 2015-01-10 2 5
10 2015-01-11 3 6
11 2015-01-12 4 0
df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Output:
In [1]: df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
Out[1]:
values
dates
0 9
1 1
2 2
3 4
4 3
5 5
6 7
In [2]: df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Out[2]:
values
dates
0 12
1 19

Categories