Merge two pandas dataframes with timeseries index - python

I have two pandas dataframes that I would like to merge/join together
For example:
#required packages
import os
import pandas as pd
import numpy as np
import datetime as dt
# create sample time series
dates1 = pd.date_range('1/1/2000', periods=4, freq='10min')
dates2 = dates1
column_names = ['A','B','C']
df1 = pd.DataFrame(np.random.randn(4, 3), index=dates1,
columns=column_names)
df2 = pd.DataFrame(np.random.randn(4, 3), index=dates2,
columns=column_names)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=('_x', '_y'))
From here I would like to merge the two datasets in the following manner(Note the order of columns):
A_x A_y B_x B_y C_x C_y
2000-01-01 00:00:00 2000-01-01 00:00:00 -0.572616 -0.867554 -0.382594 1.866238 -0.756318 0.564087
2000-01-01 00:10:00 2000-01-01 00:10:00 -0.814776 -0.458378 1.011491 0.196498 -0.523433 -0.296989
2000-01-01 00:20:00 2000-01-01 00:20:00 -0.617766 0.081141 1.405145 -1.183592 0.400720 -0.872507
2000-01-01 00:30:00 2000-01-01 00:30:00 1.083721 0.137422 -1.013840 -1.610531 -1.258841 0.142301
I would like to preserve both dataframe indexes by either creating a multi-index dataframe or creating a column for the second index. Would it be easier to use merge_ordered instead of merge or join?
Any help is appreciated.

I think you want to concat rather than merge:
In [11]: pd.concat([df1, df2], keys=["df1", "df2"], axis=1)
Out[11]:
df1 df2
A B C A B C
2000-01-01 00:00:00 1.621737 0.093015 -0.698715 0.319212 1.021829 1.707847
2000-01-01 00:10:00 0.780523 -1.169127 -1.097695 -0.444000 0.170283 1.652005
2000-01-01 00:20:00 1.560046 -0.196604 -1.260149 0.725005 -1.290074 0.606269
2000-01-01 00:30:00 -1.074419 -2.488055 -0.548531 -1.046327 0.895894 0.423743

Using concat
pd.concat([df1.reset_index().add_suffix('_x'),\
df2.reset_index().add_suffix('_y')], axis = 1)\
.set_index(['index_x', 'index_y'])
A_x B_x C_x A_y B_y C_y
index_x index_y
2000-01-01 00:00:00 2000-01-01 00:00:00 -1.437311 -1.414127 0.344057 -0.533669 -0.260106 -1.316879
2000-01-01 00:10:00 2000-01-01 00:10:00 0.662025 1.860933 -0.485169 -0.825603 -0.973267 -0.760737
2000-01-01 00:20:00 2000-01-01 00:20:00 -0.300213 0.047812 -2.279631 -0.739694 -1.872261 2.281126
2000-01-01 00:30:00 2000-01-01 00:30:00 1.499468 0.633967 -1.067881 0.174793 1.197813 -0.879132

merge will indeed merge both indices.
You can create the extra column in df2 before you merge :
df2["index_2"]=df2.index
Which will create a column in the final result that will be the value of the index in df2.
Please note that the only case this column will be different from the index is when the element does not appear in df2, in which case it will be null, so I'm not sure I understand your final goal in this.

Related

Dataframe missing the time label after assigning clusters

I have the below data sample
date,00:00:00,00:15:00,00:30:00,00:45:00,01:00:00,01:15:00,01:30:00,01:45:00,02:00:00,event
2008-01-01,115.87869701,115.37569504,79.9510802,123.68891355,110.89528693, 112.15190765,110.1277647,76.16662078,100.39338951,A
2008-01-02,104.29757522,89.11652179,91.80890697,109.91423556,112.91809129,114.91459611,117.50170579,111.08030786,81.5893157,B
2008-01-02,81.16506701,97.13170328,89.25478466,93.51884481,107.11447296,120.40638709,116.1653649,79.8861492,111.99530301,C
2008-01-02,121.98507602,105.20973701,84.46996209,96.2210916,107.65437228,121.4604217,120.96638889,117.94695867,94.33309319,D
2008-01-02,82.5839125,104.3308685,98.32658468,101.79562494,86.02883206,90.61788466,109.89027977,107.89093632,101.64082595,E
2008-01-02,100.68446746,89.90700858,115.97450984,112.85364917,100.76204374,87.49141078,81.69930821,79.78106694,99.97354515,F
2008-01-02,98.49917234,112.93161335,85.30015915,120.59233515,102.15602621,84.9536008,116.98786228,107.95753105,112.75693735,G
2008-01-02,76.5186262,111.22137123,102.20065099,88.4490991,84.67584098,86.00205813,95.02734271,114.29076806,102.62969032,H
2008-01-02,93.27785451,122.90242719,123.27263927,102.83454346,87.84973282,95.38098403,88.03719802,108.68335342,97.6581398,I
2008-01-02,119.589143,94.15858259,94.32809506,120.5637488,120.43827996,79.66190052,100.40782173,89.90362719,80.46005726,J
I want to assign clusters to the data and have the final output in the below format
Expected output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
I have tried the below and the current output does not return 'time' label in the first row
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
from sklearn import metrics
#read data
df = pd.read_csv('df.csv', index_col=0)
df = df.drop(['event'], axis=1)
#stack the data
df = df.stack()
df.index = pd.to_datetime([' '.join(i) for i in df.index])
df = df.rename_axis('event_timestamp').reset_index(name='value')
df.index = df.event_timestamp
df = df.drop(['event_timestamp'], axis=1)
df.columns = ['value']
#normalize the df
df_norm = (df - df.mean()) / (df.max() - df.min())
df['time'] = df.index.map(lambda t: t.time())
df['date'] = df.index.map(lambda t: t.date())
df_norm['time'] = df_norm.index.map(lambda t: t.time())
df_norm['date'] = df_norm.index.map(lambda t: t.date())
#pivot data
df_daily = pd.pivot_table(df, values='value', index='date', columns='time', aggfunc='mean')
df_daily_norm = pd.pivot_table(df_norm, values='value', index='date', columns='time', aggfunc='mean')
#assign clusters to daily data
df_daily_matrix_norm = np.matrix(df_daily_norm.dropna())
centers, _ = kmeans(df_daily_matrix_norm, 2)
cluster, _ = vq(df_daily_matrix_norm, centers)
clusterdf = pd.DataFrame(cluster, columns=['cluster_num'])
dailyclusters = pd.concat([df_daily.dropna().reset_index(), clusterdf], axis=1)
print(dailyclusters)
Current output
date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
What do I need to do to get the desired output with the 'time' label.
simply add the name to the index:
dailyclusters.index.name = "time"
Use:
dailyclusters = df_daily.dropna().assign(cluster_num=cluster).reset_index()
print(dailyclusters)
# Output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 1
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 0

Create a dataframe based in common timestamps of multiple dataframes

I am looking for a elegant solution to selecting common timestamps from multiple dataframes. I know that something like this could work supposing the dataframe of common timestamps to be df:
df = df1[df1['Timestamp'].isin(df2['Timestamp'])]
However, if I have several other dataframes, this solution becomes quite unelegant. Therefore, I have been wondering if there is an easier approach to achieve my goal when working with multiple dataframes.
So, let's say for example that I have:
date1 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='H')
date2 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='15min')
date3 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='45min')
date4 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='30min')
data1 = np.random.randn(len(date1))
data2 = np.random.randn(len(date2))
data3 = np.random.randn(len(date3))
data4 = np.random.randn(len(date4))
df1 = pd.DataFrame(data = {'date1' : date1, 'data1' : data1})
df2 = pd.DataFrame(data = {'date2' : date2, 'data2' : data2})
df3 = pd.DataFrame(data = {'date3' : date3, 'data3' : data3})
df4 = pd.DataFrame(data = {'date4' : date4, 'data4' : data4})
I would like as an output a dataframe containing the common timestamps of the four dataframes as well as the respective data column out of each of them, for example (just to illustrate what I mean, it doesn't reflect on the result):
commom Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -1.129439 1.2312 1.11 -0.83
1 2018-01-01 01:00:00 0.853421 0.423 0.241 0.123
2 2018-01-01 02:00:00 -1.606047 1.001 -0.005 -0.12
3 2018-01-01 03:00:00 -0.668267 0.98 1.11 -0.23
[...]
You can use reduce from functools to perform the complete inner merge. We'll need to rename the columns just so the merge is a bit easier.
from functools import reduce
lst = [df1.rename(columns={'date1': 'Timestamp'}), df2.rename(columns={'date2': 'Timestamp'}),
df3.rename(columns={'date3': 'Timestamp'}), df4.rename(columns={'date4': 'Timestamp'})]
reduce(lambda l,r: l.merge(r, on='Timestamp'), lst)
Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.971201 -0.978107 1.163339 0.048824
1 2018-01-01 03:00:00 -1.063810 0.125318 -0.818835 -0.777500
2 2018-01-01 06:00:00 0.862549 -0.671529 1.902272 0.011490
3 2018-01-01 09:00:00 1.030826 -1.306481 0.438610 -1.817053
4 2018-01-01 12:00:00 -1.191646 -1.700694 1.007190 -1.932421
5 2018-01-01 15:00:00 -1.803248 0.415256 0.690243 1.387650
6 2018-01-01 18:00:00 -0.304502 0.514616 0.974318 -0.062800
7 2018-01-01 21:00:00 -0.668874 -1.262635 -0.504298 -0.043383
8 2018-01-02 00:00:00 -0.943615 1.010958 1.343095 0.119853
Alternatively concat with an 'inner' join and setting the Timestamp to the index
pd.concat([x.set_index('Timestamp') for x in lst], axis=1, join='inner')
If it would be acceptable to name every timestamp column in the same way (date for example), something like this could work:
def common_stamps(*args): # *args lets you feed it any number of dataframes
df = pd.concat([df_i.set_index('date') for df_i in args], axis=1)\
.dropna()\ # this removes all rows with `uncommon stamps`
.reset_index()
return df
df = common_stamps(df1, df2, df3, df4)
print(df)
Output:
date data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.667090 0.487676 -1.001807 -0.200328
1 2018-01-01 03:00:00 -1.639815 2.320734 -0.396013 -1.838732
2 2018-01-01 06:00:00 0.469890 0.626428 0.040004 -2.063454
3 2018-01-01 09:00:00 -0.916928 -0.260329 -0.598313 0.383281
4 2018-01-01 12:00:00 0.132670 1.771344 -0.441651 0.664980
5 2018-01-01 15:00:00 -0.761542 0.255955 1.378836 -1.235562
6 2018-01-01 18:00:00 -0.120083 0.243652 -1.261733 1.045454
7 2018-01-01 21:00:00 0.339921 -0.901171 1.492577 -0.797161
8 2018-01-02 00:00:00 -1.397864 -0.173818 -0.581590 -0.402472

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
PotĂȘncia Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

rearrange groups in dataframe based on day and month from another dataframe index

I have 2 dataframes:
df_a
datetime var
2016-10-15 110.232790
2016-10-16 111.020661
2016-10-17 112.193496
2016-10-18 113.638143
2016-10-19 115.241448
2017-01-01 113.638143
2017-01-02 115.241448
and df_b
datetime var
2000-01-01 165.792185
2000-01-02 166.066959
2000-01-03 166.411669
2000-01-04 167.816046
2000-01-05 169.777814
2000-10-15 114.232790
2000-10-16 113.020661
2001-01-01 164.792185
2001-01-02 161.066959
2001-01-03 156.411669
2002-01-04 167.816046
2002-01-05 169.777814
2002-10-15 174.232790
2003-10-16 114.020661
df_a has information for the year 2016, 2017 and df_b has information for years from 2000 to 2015 (there is no overlap in the years).
Can I arrange each group in the df_b dataframe to have the same order in terms of day of year as df_a? A group is defined as rows with the same year e.g. 2000
You can chain new condition for check year:
df = df_b[df_b.index.month.isin(df_a.index.month) &
df_b.index.day.isin(df_a.index.day) &
(df_b.index.year == 2000)]
print (df)
var
datetime
2000-01-01 165.792185
2000-01-02 166.066959
2000-10-15 114.232790
2000-10-16 113.020661
EDIT:
df = df_b[df_b.index.month.isin(df_a.index.month) & df_b.index.day.isin(df_a.index.day)]
print (df)
var
datetime
2000-01-01 165.792185
2000-01-02 166.066959
2000-10-15 114.232790
2000-10-16 113.020661
2001-01-01 164.792185
2001-01-02 161.066959
2002-10-15 174.232790
2003-10-16 114.020661
#create dictionary of weights by factorize
a = pd.factorize(df_a.index.strftime('%m-%d'))
d = dict(zip(a[1], a[0]))
print (d)
{'01-02': 6, '10-19': 4, '10-18': 3, '10-15': 0, '01-01': 5, '10-16': 1, '10-17': 2}
#ordering Series, multiple by 1000 becasue possible 1 to 366 MMDD
order = pd.Series(df.index.strftime('%m-%d'), index=df.index).map(d) + df.index.year * 1000
print (order)
datetime
2000-01-01 2000005
2000-01-02 2000006
2000-10-15 2000000
2000-10-16 2000001
2001-01-01 2001005
2001-01-02 2001006
2002-10-15 2002000
2003-10-16 2003001
Name: datetime, dtype: int64
Last reindex by sorted order index:
df = df.reindex(order.sort_values().index)
print (df)
var
datetime
2000-10-15 114.232790
2000-10-16 113.020661
2000-01-01 165.792185
2000-01-02 166.066959
2001-01-01 164.792185
2001-01-02 161.066959
2002-10-15 174.232790
2003-10-16 114.020661

pandas transform timeseries into multiple column DataFrame

I have a timeseries of intraday day data looks like below
ts =pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
I am hoping to transform the data into a DataFrame, with the columns as each date, and rows as the time in the date.
I have tried these,
key = lambda x:x.date()
grouped = ts.groupby(key)
But how do I transform the groups into date columned DataFrame? or is there any better way?
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=60, freq='2h')
ts = pd.Series(np.random.randn(60), index = index)
key = lambda x: x.time()
groups = ts.groupby(key)
print pd.DataFrame({k:g for k,g in groups}).resample('D').T
out:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06 \
00:00:00 0.109959 -0.124291 -0.137365 0.054729 -1.305821 -1.928468
03:00:00 1.336467 0.874296 0.153490 -2.410259 0.906950 1.860385
06:00:00 -1.172638 -0.410272 -0.800962 0.568965 -0.270307 -2.046119
09:00:00 -0.707423 1.614732 0.779645 -0.571251 0.839890 0.435928
12:00:00 0.865577 -0.076702 -0.966020 0.589074 0.326276 -2.265566
15:00:00 1.845865 -1.421269 -0.141785 0.433011 -0.063286 0.129706
18:00:00 -0.054569 0.277901 0.383375 -0.546495 -0.644141 -0.207479
21:00:00 1.056536 0.031187 -1.667686 -0.270580 -0.678205 0.750386
2000-01-07 2000-01-08
00:00:00 -0.657398 -0.630487
03:00:00 2.205280 -0.371830
06:00:00 -0.073235 0.208831
09:00:00 1.720097 -0.312353
12:00:00 -0.774391 NaN
15:00:00 0.607250 NaN
18:00:00 1.379823 NaN
21:00:00 0.959811 NaN

Categories