Create a dataframe based in common timestamps of multiple dataframes - python

I am looking for a elegant solution to selecting common timestamps from multiple dataframes. I know that something like this could work supposing the dataframe of common timestamps to be df:
df = df1[df1['Timestamp'].isin(df2['Timestamp'])]
However, if I have several other dataframes, this solution becomes quite unelegant. Therefore, I have been wondering if there is an easier approach to achieve my goal when working with multiple dataframes.
So, let's say for example that I have:
date1 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='H')
date2 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='15min')
date3 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='45min')
date4 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='30min')
data1 = np.random.randn(len(date1))
data2 = np.random.randn(len(date2))
data3 = np.random.randn(len(date3))
data4 = np.random.randn(len(date4))
df1 = pd.DataFrame(data = {'date1' : date1, 'data1' : data1})
df2 = pd.DataFrame(data = {'date2' : date2, 'data2' : data2})
df3 = pd.DataFrame(data = {'date3' : date3, 'data3' : data3})
df4 = pd.DataFrame(data = {'date4' : date4, 'data4' : data4})
I would like as an output a dataframe containing the common timestamps of the four dataframes as well as the respective data column out of each of them, for example (just to illustrate what I mean, it doesn't reflect on the result):
commom Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -1.129439 1.2312 1.11 -0.83
1 2018-01-01 01:00:00 0.853421 0.423 0.241 0.123
2 2018-01-01 02:00:00 -1.606047 1.001 -0.005 -0.12
3 2018-01-01 03:00:00 -0.668267 0.98 1.11 -0.23
[...]

You can use reduce from functools to perform the complete inner merge. We'll need to rename the columns just so the merge is a bit easier.
from functools import reduce
lst = [df1.rename(columns={'date1': 'Timestamp'}), df2.rename(columns={'date2': 'Timestamp'}),
df3.rename(columns={'date3': 'Timestamp'}), df4.rename(columns={'date4': 'Timestamp'})]
reduce(lambda l,r: l.merge(r, on='Timestamp'), lst)
Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.971201 -0.978107 1.163339 0.048824
1 2018-01-01 03:00:00 -1.063810 0.125318 -0.818835 -0.777500
2 2018-01-01 06:00:00 0.862549 -0.671529 1.902272 0.011490
3 2018-01-01 09:00:00 1.030826 -1.306481 0.438610 -1.817053
4 2018-01-01 12:00:00 -1.191646 -1.700694 1.007190 -1.932421
5 2018-01-01 15:00:00 -1.803248 0.415256 0.690243 1.387650
6 2018-01-01 18:00:00 -0.304502 0.514616 0.974318 -0.062800
7 2018-01-01 21:00:00 -0.668874 -1.262635 -0.504298 -0.043383
8 2018-01-02 00:00:00 -0.943615 1.010958 1.343095 0.119853
Alternatively concat with an 'inner' join and setting the Timestamp to the index
pd.concat([x.set_index('Timestamp') for x in lst], axis=1, join='inner')

If it would be acceptable to name every timestamp column in the same way (date for example), something like this could work:
def common_stamps(*args): # *args lets you feed it any number of dataframes
df = pd.concat([df_i.set_index('date') for df_i in args], axis=1)\
.dropna()\ # this removes all rows with `uncommon stamps`
.reset_index()
return df
df = common_stamps(df1, df2, df3, df4)
print(df)
Output:
date data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.667090 0.487676 -1.001807 -0.200328
1 2018-01-01 03:00:00 -1.639815 2.320734 -0.396013 -1.838732
2 2018-01-01 06:00:00 0.469890 0.626428 0.040004 -2.063454
3 2018-01-01 09:00:00 -0.916928 -0.260329 -0.598313 0.383281
4 2018-01-01 12:00:00 0.132670 1.771344 -0.441651 0.664980
5 2018-01-01 15:00:00 -0.761542 0.255955 1.378836 -1.235562
6 2018-01-01 18:00:00 -0.120083 0.243652 -1.261733 1.045454
7 2018-01-01 21:00:00 0.339921 -0.901171 1.492577 -0.797161
8 2018-01-02 00:00:00 -1.397864 -0.173818 -0.581590 -0.402472

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Dataframe missing the time label after assigning clusters

I have the below data sample
date,00:00:00,00:15:00,00:30:00,00:45:00,01:00:00,01:15:00,01:30:00,01:45:00,02:00:00,event
2008-01-01,115.87869701,115.37569504,79.9510802,123.68891355,110.89528693, 112.15190765,110.1277647,76.16662078,100.39338951,A
2008-01-02,104.29757522,89.11652179,91.80890697,109.91423556,112.91809129,114.91459611,117.50170579,111.08030786,81.5893157,B
2008-01-02,81.16506701,97.13170328,89.25478466,93.51884481,107.11447296,120.40638709,116.1653649,79.8861492,111.99530301,C
2008-01-02,121.98507602,105.20973701,84.46996209,96.2210916,107.65437228,121.4604217,120.96638889,117.94695867,94.33309319,D
2008-01-02,82.5839125,104.3308685,98.32658468,101.79562494,86.02883206,90.61788466,109.89027977,107.89093632,101.64082595,E
2008-01-02,100.68446746,89.90700858,115.97450984,112.85364917,100.76204374,87.49141078,81.69930821,79.78106694,99.97354515,F
2008-01-02,98.49917234,112.93161335,85.30015915,120.59233515,102.15602621,84.9536008,116.98786228,107.95753105,112.75693735,G
2008-01-02,76.5186262,111.22137123,102.20065099,88.4490991,84.67584098,86.00205813,95.02734271,114.29076806,102.62969032,H
2008-01-02,93.27785451,122.90242719,123.27263927,102.83454346,87.84973282,95.38098403,88.03719802,108.68335342,97.6581398,I
2008-01-02,119.589143,94.15858259,94.32809506,120.5637488,120.43827996,79.66190052,100.40782173,89.90362719,80.46005726,J
I want to assign clusters to the data and have the final output in the below format
Expected output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
I have tried the below and the current output does not return 'time' label in the first row
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
from sklearn import metrics
#read data
df = pd.read_csv('df.csv', index_col=0)
df = df.drop(['event'], axis=1)
#stack the data
df = df.stack()
df.index = pd.to_datetime([' '.join(i) for i in df.index])
df = df.rename_axis('event_timestamp').reset_index(name='value')
df.index = df.event_timestamp
df = df.drop(['event_timestamp'], axis=1)
df.columns = ['value']
#normalize the df
df_norm = (df - df.mean()) / (df.max() - df.min())
df['time'] = df.index.map(lambda t: t.time())
df['date'] = df.index.map(lambda t: t.date())
df_norm['time'] = df_norm.index.map(lambda t: t.time())
df_norm['date'] = df_norm.index.map(lambda t: t.date())
#pivot data
df_daily = pd.pivot_table(df, values='value', index='date', columns='time', aggfunc='mean')
df_daily_norm = pd.pivot_table(df_norm, values='value', index='date', columns='time', aggfunc='mean')
#assign clusters to daily data
df_daily_matrix_norm = np.matrix(df_daily_norm.dropna())
centers, _ = kmeans(df_daily_matrix_norm, 2)
cluster, _ = vq(df_daily_matrix_norm, centers)
clusterdf = pd.DataFrame(cluster, columns=['cluster_num'])
dailyclusters = pd.concat([df_daily.dropna().reset_index(), clusterdf], axis=1)
print(dailyclusters)
Current output
date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
What do I need to do to get the desired output with the 'time' label.
simply add the name to the index:
dailyclusters.index.name = "time"
Use:
dailyclusters = df_daily.dropna().assign(cluster_num=cluster).reset_index()
print(dailyclusters)
# Output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 1
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 0

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Pandas Pivot with datetime index

I am having some troubles pivoting a dataframe with a datetime value as the index.
my df looks like this:
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
Where the timestamp is a, datetime64[ns]. I am trying to pivot the table so that it looks like this.
Hour 0 1 2 4 ....
Date
2016-01-01 16.5 16.4 16.3 17 ....
....
....
I've tried using the code below but am getting an error when I run it.
df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
KeyError Traceback (most recent call last)
in ()
1 # Pivot Table
----> 2 df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
~\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
56 for i in values:
57 if i not in data:
---> 58 raise KeyError(i)
59
60 to_filter = []
KeyError: 16.5469
Any help or insights would be greatly appreciated.
A different way of accomplishing this without the lambda is to create the indices from the DateTimeIndex.
df2 = pd.pivot_table(df, index=df.index.date, columns=df.index.hour, values="Value")
I slightly extended input data like below (assuming no duplicated entries in the same date/hour)
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
2016-01-01 04:00:00 16.023928
2016-01-03 04:00:00 16.101919
2016-01-05 23:00:00 13.405928
It looks a bit awkward, but something like below works.
df2['Date'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[0])
df2['Hour'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[1].split(":")[0])
df3 = pd.pivot_table(df2, values='Value', index='Date', columns='Hour')
[Output]
Hour 00 01 02 04 23
Date
2016-01-01 16.5469 16.402375 16.32425 16.023928 NaN
2016-01-03 NaN NaN NaN 16.101919 NaN
2016-01-05 NaN NaN NaN NaN 13.405928
Finally if your columns need to be integer,
df3.columns = [int(x) for x in df3.columns]
Hope this helps.
Adapting #Seanny123 's answer above for an arbitrary cadence:
start = [2018, 1, 1, 0, 0, 0]
end = [date.today().year, date.today().month, date.today().day]
quant='freq'
sTime_tmp = datetime.datetime(start[0], start[1], start[2], tzinfo = pytz.UTC)
eTime_tmp = datetime.datetime(end[0], end[1], end[2], tzinfo = pytz.UTC)
cadence = '5min'
t = pd.date_range(start=sTime_tmp,
end=eTime_tmp,
freq = cadence)
keo = pd.DataFrame(np.nan, index=t, columns=[quant])
keo[quant] = 0
keo = pd.pivot_table(keo, index=keo.index.time, columns=keo.index.date, values=quant)
keo

rearrange groups in dataframe based on day and month from another dataframe index

I have 2 dataframes:
df_a
datetime var
2016-10-15 110.232790
2016-10-16 111.020661
2016-10-17 112.193496
2016-10-18 113.638143
2016-10-19 115.241448
2017-01-01 113.638143
2017-01-02 115.241448
and df_b
datetime var
2000-01-01 165.792185
2000-01-02 166.066959
2000-01-03 166.411669
2000-01-04 167.816046
2000-01-05 169.777814
2000-10-15 114.232790
2000-10-16 113.020661
2001-01-01 164.792185
2001-01-02 161.066959
2001-01-03 156.411669
2002-01-04 167.816046
2002-01-05 169.777814
2002-10-15 174.232790
2003-10-16 114.020661
df_a has information for the year 2016, 2017 and df_b has information for years from 2000 to 2015 (there is no overlap in the years).
Can I arrange each group in the df_b dataframe to have the same order in terms of day of year as df_a? A group is defined as rows with the same year e.g. 2000
You can chain new condition for check year:
df = df_b[df_b.index.month.isin(df_a.index.month) &
df_b.index.day.isin(df_a.index.day) &
(df_b.index.year == 2000)]
print (df)
var
datetime
2000-01-01 165.792185
2000-01-02 166.066959
2000-10-15 114.232790
2000-10-16 113.020661
EDIT:
df = df_b[df_b.index.month.isin(df_a.index.month) & df_b.index.day.isin(df_a.index.day)]
print (df)
var
datetime
2000-01-01 165.792185
2000-01-02 166.066959
2000-10-15 114.232790
2000-10-16 113.020661
2001-01-01 164.792185
2001-01-02 161.066959
2002-10-15 174.232790
2003-10-16 114.020661
#create dictionary of weights by factorize
a = pd.factorize(df_a.index.strftime('%m-%d'))
d = dict(zip(a[1], a[0]))
print (d)
{'01-02': 6, '10-19': 4, '10-18': 3, '10-15': 0, '01-01': 5, '10-16': 1, '10-17': 2}
#ordering Series, multiple by 1000 becasue possible 1 to 366 MMDD
order = pd.Series(df.index.strftime('%m-%d'), index=df.index).map(d) + df.index.year * 1000
print (order)
datetime
2000-01-01 2000005
2000-01-02 2000006
2000-10-15 2000000
2000-10-16 2000001
2001-01-01 2001005
2001-01-02 2001006
2002-10-15 2002000
2003-10-16 2003001
Name: datetime, dtype: int64
Last reindex by sorted order index:
df = df.reindex(order.sort_values().index)
print (df)
var
datetime
2000-10-15 114.232790
2000-10-16 113.020661
2000-01-01 165.792185
2000-01-02 166.066959
2001-01-01 164.792185
2001-01-02 161.066959
2002-10-15 174.232790
2003-10-16 114.020661

Categories