Group periodic data in pandas dataframe - python

I have a pandas dataframe that looks like this:
idx
A
B
01/01/01 00:00:01
5
2
01/01/01 00:00:02
4
5
01/01/01 00:00:03
5
4
02/01/01 00:00:01
3
8
02/01/01 00:00:02
7
4
02/01/01 00:00:03
1
3
I would like to group data based on its periodicity such that the final dataframe is:
new_idx
01/01/01
02/01/01
old_column
00:00:01
5
3
A
00:00:02
4
7
A
00:00:03
5
1
A
00:00:01
2
8
B
00:00:02
5
4
B
00:00:03
4
3
B
Is there a way to this that holds when the first dataframe gets big (more columns, more periods and more samples)?

One way is to melt the DataFrame, then split the datetime to dates and times; finally pivot the resulting DataFrame for the final output:
df = df.melt('idx', var_name='old_column')
df[['date','new_idx']] = df['idx'].str.split(expand=True)
out = df.pivot(['new_idx','old_column'], 'date', 'value').reset_index().rename_axis(columns=[None]).sort_values(by='old_column')
Output
new_idx old_column 01/01/01 02/01/01
0 00:00:01 A 5 3
2 00:00:02 A 4 7
4 00:00:03 A 5 1
1 00:00:01 B 2 8
3 00:00:02 B 5 4
5 00:00:03 B 4 3

Related

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Dataframe has Everyother column timestamp, how to get it in one column?

I have a dataframe I import from excel that is of 'n x n' length, that looks like the following (sorry, i do not know how to easily duplicate this with code)
How do I get the timestamps into one column? Like the following (I've tried pivot)
You may need to extract the data by 3 columns group. Then rename the columns and add the "A,B,C" flag column and concatenate them together. See the test as below:
abc_list = [["2017-10-01",0,"2017-10-02",1,"2017-10-03",8],["2017-11-01",3,"2017-11-01",5,"2017-11-05",10],["2017-12-01",0,"2017-12-07",7,"2017-12-07",12]]
df = pd.DataFrame(abc_list,columns=["Time1","A","Time2","B","Time3","C"])
The output:
Time1 A Time2 B Time3 C
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Then:
df_a=df.iloc[:,0:2].rename(columns={'Time1':'time','A':'value'})
df_a['flag']="A"
df_b=df.iloc[:,2:4].rename(columns={'Time2':'time','B':'value'})
df_b['flag']="B"
df_c=df.iloc[:,4:].rename(columns={'Time3':'time','C':'value'})
df_c['flag']="C"
df_final=pd.concat([df_a,df_b,df_c])
df_final.reset_index(drop=True)
output:
time value flag
0 2017-10-01 0 A
1 2017-11-01 3 A
2 2017-12-01 0 A
3 2017-10-02 1 B
4 2017-11-01 5 B
5 2017-12-07 7 B
6 2017-10-03 8 C
7 2017-11-05 10 C
8 2017-12-07 12 C
This is a quit bit not a pythonic way to do it.
Here is another way:
columns = pd.MultiIndex.from_tuples([('A','Time'),('A','Value'),('B','Time'),('B','Value'),('C','Time'),('C','Value')],names=['Group','Sub_value'])
df.columns=columns
Output:
Group A B C
Sub_value Time Value Time Value Time Value
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Run:
df.stack(level='Group')
Output:
Sub_value Time Value
Group
0 A 2017-10-01 0
B 2017-10-02 1
C 2017-10-03 8
1 A 2017-11-01 3
B 2017-11-01 5
C 2017-11-05 10
2 A 2017-12-01 0
B 2017-12-07 7
C 2017-12-07 12
This is one method. It is fairly easy to extend to any number of columns.
import pandas as pd
dfs = {}
# read in pairs of columns and assign 'Category' column
dfs[i] = {i: pd.read_excel('file.xlsx', usecols=[2*i, 2*i+1], skiprows=[0],
header=None, columns=['Date', 'Value']).assign(Category=j) \
for i, j in enumerate(['A', 'B', 'C'])}
# concatenate dataframes
df = pd.concat(list(dfs.values()), ignore_index=True)

How to indicate the multi index columns using read_sql_query (pandas dataframes)

I have a table with the following columns:
| Date | ProductId | SubProductId | Value |
I am trying to retrieve the data from that table and to put it in a pandas DataFrame.
I want the DataFrame to have the following structure:
index: dates
columns: products
sub-columns: sub-products
(products) 1 2 ...
(subproducts) 1 2 3 1 2 3 ...
date
2015-01-02 val val val ...
2015-01-03 val val val ...
2015-01-04 ...
2015-01-05
...
I already have dataframes with the products and the subproducts and the dates.
I understand that I need to use the MultiIndex, here is what I tried:
query ="SELECT Date, ProductId, SubProductId, Value " \
" FROM table "\
" WHERE SubProductId in (1,2,3)"\
" AND ProductId in (1,2,3)"\
" AND Date BETWEEN '2015-01-02' AND '2015-01-08' "\
" GROUP BY Date, ProductId, SubProductId, Value "\
" ORDER BY Date, ProductId, SubProductId "
df = pd.read_sql_query(query, conn, index_col=pd.MultiIndex.from_product([df_products['products'].tolist(), df_subproducts['subproducts'].tolist()])
But it does not work because the query returns a vector of "value" (shape is nb of value x 1), while I need to have a matrix (shape: nb of distinct dates x (nb of subproducts*nb of prodcuts)) in the dataframe.
How can it be achieved:
directly via the read sql query ?
or by "trandofrming" the dataframe once the database values inserted in ?
NB: I am using Microsoft SQL Server.
IIUC you can use unstack() method:
df = pd.read_sql_query(query, conn, index_col=['Date','ProductID','SubProductId']) \
.unstack(['ProductID','SubProductId'])
Demo:
In [413]: df
Out[413]:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 13
3 2015-01-02 2 1 14
4 2015-01-02 2 2 15
5 2015-01-02 2 3 16
6 2015-01-03 1 1 17
7 2015-01-03 1 2 18
8 2015-01-03 1 3 19
9 2015-01-03 2 1 20
10 2015-01-03 2 2 21
In [414]: df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
Out[414]:
Value
ProductID 1 2
SubProductId 1 2 3 1 2 3
Date
2015-01-02 11.0 12.0 13.0 14.0 15.0 16.0
2015-01-03 17.0 18.0 19.0 20.0 21.0 NaN
You can also use pivot_table
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
demo
df = pd.DataFrame(dict(
Date=pd.date_range('2017-03-31', periods=2).repeat(9),
ProductId=[1, 1, 1, 2, 2, 2, 3, 3, 3] * 2,
SubProductId=list('abc') * 6,
Value=np.random.randint(10, size=18)
))
print(df)
Date ProductId SubProductId Value
0 2017-03-31 1 a 8
1 2017-03-31 1 b 2
2 2017-03-31 1 c 5
3 2017-03-31 2 a 4
4 2017-03-31 2 b 3
5 2017-03-31 2 c 2
6 2017-03-31 3 a 9
7 2017-03-31 3 b 3
8 2017-03-31 3 c 1
9 2017-04-01 1 a 3
10 2017-04-01 1 b 5
11 2017-04-01 1 c 7
12 2017-04-01 2 a 3
13 2017-04-01 2 b 6
14 2017-04-01 2 c 4
15 2017-04-01 3 a 5
16 2017-04-01 3 b 2
17 2017-04-01 3 c 0
df.pivot_table('Value', 'Date', ['ProductId', 'SubProductId'])
ProductId 1 2 3
SubProductId a b c a b c a b c
Date
2017-03-31 8 2 5 4 3 2 9 3 1
2017-04-01 3 5 7 3 6 4 5 2 0

time interval partitioned by 2 fields in pandas

I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1

How to group time series data by Monday, Tuesday .. ? pandas

I have time series pandas DataFrame looks like
value
12-01-2014 1
13-01-2014 2
....
01-05-2014 5
I want to group them into
1 (Monday, Tuesday, ..., Saturday, Sonday)
2 (Workday, Weekend)
How could I achieve that in pandas ?
Make sure your dates column is a datetime object and use the datetime attributes:
df = pd.DataFrame({'dates':['1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15',
'1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15'],
'values':[1,2,3,4,5,1,2,3,1,2,3,4]})
df['dates'] = pd.to_datetime(df['dates'])
df['dayofweek'] = df['dates'].apply(lambda x: x.dayofweek)
dates values dayofweek
0 2015-01-01 1 3
1 2015-01-02 2 4
2 2015-01-03 3 5
3 2015-01-04 4 6
4 2015-01-05 5 0
5 2015-01-06 1 1
6 2015-01-07 2 2
7 2015-01-08 3 3
8 2015-01-09 1 4
9 2015-01-10 2 5
10 2015-01-11 3 6
11 2015-01-12 4 0
df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Output:
In [1]: df.groupby(df['dates'].apply(lambda x: x.dayofweek)).sum()
Out[1]:
values
dates
0 9
1 1
2 2
3 4
4 3
5 5
6 7
In [2]: df.groupby(df['dates'].apply(lambda x: 0 if x.dayofweek in [5,6] else 1)).sum()
Out[2]:
values
dates
0 12
1 19

Categories