Pandas groupby datetime columns by periods - python

I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)

If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714

I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:

Related

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

How to calculate Quarterly difference and add missing Quarterly with count in python pandas

I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2

How to indicate the multi index columns using read_sql_query (pandas dataframes)

I have a table with the following columns:
| Date | ProductId | SubProductId | Value |
I am trying to retrieve the data from that table and to put it in a pandas DataFrame.
I want the DataFrame to have the following structure:
index: dates
columns: products
sub-columns: sub-products
(products) 1 2 ...
(subproducts) 1 2 3 1 2 3 ...
date
2015-01-02 val val val ...
2015-01-03 val val val ...
2015-01-04 ...
2015-01-05
...
I already have dataframes with the products and the subproducts and the dates.
I understand that I need to use the MultiIndex, here is what I tried:
query ="SELECT Date, ProductId, SubProductId, Value " \
" FROM table "\
" WHERE SubProductId in (1,2,3)"\
" AND ProductId in (1,2,3)"\
" AND Date BETWEEN '2015-01-02' AND '2015-01-08' "\
" GROUP BY Date, ProductId, SubProductId, Value "\
" ORDER BY Date, ProductId, SubProductId "
df = pd.read_sql_query(query, conn, index_col=pd.MultiIndex.from_product([df_products['products'].tolist(), df_subproducts['subproducts'].tolist()])
But it does not work because the query returns a vector of "value" (shape is nb of value x 1), while I need to have a matrix (shape: nb of distinct dates x (nb of subproducts*nb of prodcuts)) in the dataframe.
How can it be achieved:
directly via the read sql query ?
or by "trandofrming" the dataframe once the database values inserted in ?
NB: I am using Microsoft SQL Server.
IIUC you can use unstack() method:
df = pd.read_sql_query(query, conn, index_col=['Date','ProductID','SubProductId']) \
.unstack(['ProductID','SubProductId'])
Demo:
In [413]: df
Out[413]:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 13
3 2015-01-02 2 1 14
4 2015-01-02 2 2 15
5 2015-01-02 2 3 16
6 2015-01-03 1 1 17
7 2015-01-03 1 2 18
8 2015-01-03 1 3 19
9 2015-01-03 2 1 20
10 2015-01-03 2 2 21
In [414]: df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
Out[414]:
Value
ProductID 1 2
SubProductId 1 2 3 1 2 3
Date
2015-01-02 11.0 12.0 13.0 14.0 15.0 16.0
2015-01-03 17.0 18.0 19.0 20.0 21.0 NaN
You can also use pivot_table
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-03-31', periods=2).repeat(9),
ProductId=[1, 1, 1, 2, 2, 2, 3, 3, 3] * 2,
SubProductId=list('abc') * 6,
Value=np.random.randint(10, size=18)
))
print(df)
Date ProductId SubProductId Value
0 2017-03-31 1 a 8
1 2017-03-31 1 b 2
2 2017-03-31 1 c 5
3 2017-03-31 2 a 4
4 2017-03-31 2 b 3
5 2017-03-31 2 c 2
6 2017-03-31 3 a 9
7 2017-03-31 3 b 3
8 2017-03-31 3 c 1
9 2017-04-01 1 a 3
10 2017-04-01 1 b 5
11 2017-04-01 1 c 7
12 2017-04-01 2 a 3
13 2017-04-01 2 b 6
14 2017-04-01 2 c 4
15 2017-04-01 3 a 5
16 2017-04-01 3 b 2
17 2017-04-01 3 c 0
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
ProductId 1 2 3
SubProductId a b c a b c a b c
Date
2017-03-31 8 2 5 4 3 2 9 3 1
2017-04-01 3 5 7 3 6 4 5 2 0

How to group time series data by Monday, Tuesday .. ? pandas

I have time series pandas DataFrame looks like
value
12-01-2014 1
13-01-2014 2
....
01-05-2014 5
I want to group them into
1 (Monday, Tuesday, ..., Saturday, Sonday)
2 (Workday, Weekend)
How could I achieve that in pandas ?
Make sure your dates column is a datetime object and use the datetime attributes:
df = pd.DataFrame({'dates':['1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15',
'1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15'],
'values':[1,2,3,4,5,1,2,3,1,2,3,4]})
df['dates'] = pd.to_datetime(df['dates'])
df['dayofweek'] = df['dates'].apply(lambda x: x.dayofweek)
dates values dayofweek
0 2015-01-01 1 3
1 2015-01-02 2 4
2 2015-01-03 3 5
3 2015-01-04 4 6
4 2015-01-05 5 0
5 2015-01-06 1 1
6 2015-01-07 2 2
7 2015-01-08 3 3
8 2015-01-09 1 4
9 2015-01-10 2 5
10 2015-01-11 3 6
11 2015-01-12 4 0
df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Output:
In [1]: df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
Out[1]:
values
dates
0 9
1 1
2 2
3 4
4 3
5 5
6 7
In [2]: df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Out[2]:
values
dates
0 12
1 19

Categories