How to group time series data by Monday, Tuesday .. ? pandas - python

I have time series pandas DataFrame looks like
value
12-01-2014 1
13-01-2014 2
....
01-05-2014 5
I want to group them into
1 (Monday, Tuesday, ..., Saturday, Sonday)
2 (Workday, Weekend)
How could I achieve that in pandas ?

Make sure your dates column is a datetime object and use the datetime attributes:
df = pd.DataFrame({'dates':['1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15',
'1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15'],
'values':[1,2,3,4,5,1,2,3,1,2,3,4]})
df['dates'] = pd.to_datetime(df['dates'])
df['dayofweek'] = df['dates'].apply(lambda x: x.dayofweek)
dates values dayofweek
0 2015-01-01 1 3
1 2015-01-02 2 4
2 2015-01-03 3 5
3 2015-01-04 4 6
4 2015-01-05 5 0
5 2015-01-06 1 1
6 2015-01-07 2 2
7 2015-01-08 3 3
8 2015-01-09 1 4
9 2015-01-10 2 5
10 2015-01-11 3 6
11 2015-01-12 4 0
df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Output:
In [1]: df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
Out[1]:
values
dates
0 9
1 1
2 2
3 4
4 3
5 5
6 7
In [2]: df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Out[2]:
values
dates
0 12
1 19

Related

convert month of dates into sequence

i want to combine months from years into sequence, for example, i have dataframe like this:
stuff_id date
1 2015-02-03
2 2015-03-03
3 2015-05-19
4 2015-10-13
5 2016-01-07
6 2016-03-20
i want to sequence the months of the date. the desired output is:
stuff_id date month
1 2015-02-03 1
2 2015-03-03 2
3 2015-05-19 4
4 2015-10-13 9
5 2016-01-07 12
6 2016-03-20 14
which means feb'15 is the first month in the date list and jan'2016 is the 12th month after feb'2015
If your date column is a datetime (if it's not, cast it to one), you can use the .dt.month and .dt.year properties for this!
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html
recast
(text copy from Answer to Pasting data into a pandas dataframe)
>>> df = pd.read_table(io.StringIO(s), delim_whitespace=True) # text from SO
>>> df["date"] = pd.to_datetime(df["date"])
>>> df
stuff_id date
0 1 2015-02-03
1 2 2015-03-03
2 3 2015-05-19
3 4 2015-10-13
4 5 2016-01-07
5 6 2016-03-20
>>> df.dtypes
stuff_id int64
date datetime64[ns]
dtype: object
extract years and months to decimal months and reduce to relative
>>> months = df["date"].dt.year * 12 + df["date"].dt.month # series
>>> df["months"] = months - min(months) + 1
>>> df
stuff_id date months
0 1 2015-02-03 1
1 2 2015-03-03 2
2 3 2015-05-19 4
3 4 2015-10-13 9
4 5 2016-01-07 12
5 6 2016-03-20 14

Pandas groupby datetime columns by periods

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

How to indicate the multi index columns using read_sql_query (pandas dataframes)

I have a table with the following columns:
| Date | ProductId | SubProductId | Value |
I am trying to retrieve the data from that table and to put it in a pandas DataFrame.
I want the DataFrame to have the following structure:
index: dates
columns: products
sub-columns: sub-products
(products) 1 2 ...
(subproducts) 1 2 3 1 2 3 ...
date
2015-01-02 val val val ...
2015-01-03 val val val ...
2015-01-04 ...
2015-01-05
...
I already have dataframes with the products and the subproducts and the dates.
I understand that I need to use the MultiIndex, here is what I tried:
query ="SELECT Date, ProductId, SubProductId, Value " \
" FROM table "\
" WHERE SubProductId in (1,2,3)"\
" AND ProductId in (1,2,3)"\
" AND Date BETWEEN '2015-01-02' AND '2015-01-08' "\
" GROUP BY Date, ProductId, SubProductId, Value "\
" ORDER BY Date, ProductId, SubProductId "
df = pd.read_sql_query(query, conn, index_col=pd.MultiIndex.from_product([df_products['products'].tolist(), df_subproducts['subproducts'].tolist()])
But it does not work because the query returns a vector of "value" (shape is nb of value x 1), while I need to have a matrix (shape: nb of distinct dates x (nb of subproducts*nb of prodcuts)) in the dataframe.
How can it be achieved:
directly via the read sql query ?
or by "trandofrming" the dataframe once the database values inserted in ?
NB: I am using Microsoft SQL Server.
IIUC you can use unstack() method:
df = pd.read_sql_query(query, conn, index_col=['Date','ProductID','SubProductId']) \
.unstack(['ProductID','SubProductId'])
Demo:
In [413]: df
Out[413]:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 13
3 2015-01-02 2 1 14
4 2015-01-02 2 2 15
5 2015-01-02 2 3 16
6 2015-01-03 1 1 17
7 2015-01-03 1 2 18
8 2015-01-03 1 3 19
9 2015-01-03 2 1 20
10 2015-01-03 2 2 21
In [414]: df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
Out[414]:
Value
ProductID 1 2
SubProductId 1 2 3 1 2 3
Date
2015-01-02 11.0 12.0 13.0 14.0 15.0 16.0
2015-01-03 17.0 18.0 19.0 20.0 21.0 NaN
You can also use pivot_table
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-03-31', periods=2).repeat(9),
ProductId=[1, 1, 1, 2, 2, 2, 3, 3, 3] * 2,
SubProductId=list('abc') * 6,
Value=np.random.randint(10, size=18)
))
print(df)
Date ProductId SubProductId Value
0 2017-03-31 1 a 8
1 2017-03-31 1 b 2
2 2017-03-31 1 c 5
3 2017-03-31 2 a 4
4 2017-03-31 2 b 3
5 2017-03-31 2 c 2
6 2017-03-31 3 a 9
7 2017-03-31 3 b 3
8 2017-03-31 3 c 1
9 2017-04-01 1 a 3
10 2017-04-01 1 b 5
11 2017-04-01 1 c 7
12 2017-04-01 2 a 3
13 2017-04-01 2 b 6
14 2017-04-01 2 c 4
15 2017-04-01 3 a 5
16 2017-04-01 3 b 2
17 2017-04-01 3 c 0
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
ProductId 1 2 3
SubProductId a b c a b c a b c
Date
2017-03-31 8 2 5 4 3 2 9 3 1
2017-04-01 3 5 7 3 6 4 5 2 0

time interval partitioned by 2 fields in pandas

I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1

Categories