Pandas merge on `datetime` or `datetime` in `datetimeIndex` - python

Currently I have two data frames representing excel spreadsheets. I wish to join the data where the dates are equal. This is a one to many join as one spread sheet has a date then I need to add data which has multiple rows with the same date
an example:
A B
date data date data
0 2015-0-1 ... 0 2015-0-1 to 2015-0-2 ...
1 2015-0-2 ... 1 2015-0-1 to 2015-0-2 ...
In this case both rows from A would recieve rows 0 and 1 from B because they are in that range.
I tried using
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
to accomplish this but received this error.
Traceback (most recent call last):
File "<pyshell#61>", line 1, in <module>
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge
validate=validate)
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 555, in __init__
self._maybe_coerce_merge_keys()
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 990, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
I can add more information as needed of course

So here's the option with merging:
Assume you have two DataFrames:
import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
Now do some cleaning to get all of the dates you need and make sure they are datetime
df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
# No need for this anymore
df2 = df2.drop(columns='date')
Now merge it all together. You'll get 99x10K rows.
df = df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy').drop(columns='dummy')
And subset to the dates that fall in between the ranges:
df[(df.date >= df.start) & (df.date <= df.end)]
# date data_x data_y start end
#0 2015-01-01 A E 2015-01-01 2015-01-02
#1 2015-01-01 A F 2015-01-01 2015-01-02
#3 2015-01-02 B E 2015-01-01 2015-01-02
#4 2015-01-02 B F 2015-01-01 2015-01-02
#5 2015-01-02 B G 2015-01-02 2015-01-03
#8 2015-01-03 C G 2015-01-02 2015-01-03
If for instance, some dates in df2 were a single date, since we're using .str.split we will get None for the second date. Then just use .loc to set it appropriately.
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03',
'2015-01-03'],
'data': ['E', 'F', 'G', 'H']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2.loc[df2.end.isnull(), 'end'] = df2.loc[df2.end.isnull(), 'start']
# data start end
#0 E 2015-01-01 2015-01-02
#1 F 2015-01-01 2015-01-02
#2 G 2015-01-02 2015-01-03
#3 H 2015-01-03 2015-01-03
Now the rest follows unchanged

Let's use this numpy method by #piRSquared:
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
df1['date'] = pd.to_datetime(df1['date'])
a = df1['date'].values
bh = df2['end'].values
bl = df2['start'].values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns))
Output:
date data date data start end
0 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
1 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
2 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
3 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
4 2015-01-02 00:00:00 B 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00
5 2015-01-03 00:00:00 C 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00

Related

Generating DataFrame with combination of columns and sum of grouped values

So, I have a DataFrame, in which each row represent an event, formed by essentially 4 columns:
happen start_date end_date number
0 2015-01-01 2015-01-01 2015-01-03 100.0
1 2015-01-01 2015-01-01 2015-01-01 20.0
2 2015-01-01 2015-01-02 2015-01-02 50.0
3 2015-01-02 2015-01-02 2015-01-02 40.0
4 2015-01-02 2015-01-02 2015-01-03 50.0
where happen is the date the event took place, start_date and end_date are the validity of that event, and number is just a summable variable.
What I'd like to get is a DataFrame that has for each row the combination of the happen date and validity date and, contextually, the sum of the number column.
What I tried so far is a double for loop on all dates, knowing that start_date >= happen:
startdate = pd.to_datetime('01/06/2014', format='%d/%m/%Y') # the minimum possible happen
enddate = pd.to_datetime('31/12/2021', format='%d/%m/%Y') # the maximum possible happen (and validity)
df_day = pd.DataFrame()
for dt1 in pd.date_range(start=startdate, end=enddate):
for dt2 in pd.date_range(start=dt1, end=enddate):
num_sum = df[(df['happen'] == dt1)&(df['start_date'] <= dt2)&
(df['end_date'] >= dt2)]['number'].sum()
row = {'happen':dt1,'valid':dt2,'number':num_sum}
df_day = df_day.append(row,ignore_index = True)
and that never came to an end. So I tried other way, I generated the df with all date combination first (like 3.8e6 rows), and then tried to fill it with a lambda func (it's crazy, I know, but don't know how to work around it):
dt1 = pd.date_range(start=startdate, end=enddate).tolist()
df_day = pd.DataFrame()
for i in dt1:
dt_acc1 = [i]
dt2 = pd.date_range(start=i, end=enddate).tolist()
df_comb = pd.DataFrame(list(product(dt_acc1, dt2)), columns=['happen', 'valid'])
df_day = df_day.append(df_comb, ignore_index=True)
df_day['number'] = 0
def append_num(happen,valid):
return df[(df['happen'] == happen)&(df['start_date'] <= valid)&
(df['end_date'] >= valid)]['number'].sum()
df_day['number'] = df_day.apply(lambda x: append_num(x['happen'],x['valid']), axis=1)
and also this loop take forever.
My expected output is something like this:
happen valid number
0 2015-01-01 2015-01-01 120.0
1 2015-01-01 2015-01-02 150.0
2 2015-01-01 2015-01-03 100.0
3 2015-01-02 2015-01-02 90.0
4 2015-01-02 2015-01-03 50.0
5 2015-01-03 2015-01-03 0.0
As you can see the first row represents the sum of all rows with happen on 2015-01-01 and with a start_date and end_date that contain the 2015-01-01 in valid. The number column contains the sum (with 120. = 100. + 20.). On the second row, with valid going one day forward, I "lose" element with index 1 and I "gain" element with index 2 (150. = 100. + 50.).
Every help or suggestion is appreciated!

Inserting rows in specific location using pandas

I have a CSV-file containing the following data structure:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815
I have the following code:
df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])
for i in range(len(df)):
if i != len(df)-1:
next_i = i+1
if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
now = int(df.loc[i, "minute"][:2]+df.loc[i, "minute"][3:5])
future = int(df.loc[next_i, "minute"][:2]+df.loc[next_i, "minute"][3:5])
while now + 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
newminutes = str(int(df.loc[i, "minute"][3:5])+5*a)
newtime = df.loc[next_i, "minute"][:2] +":"+newminutes+":00"
df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
df = df.sort_index().reset_index(drop=True)
now = int(newtime[:2]+newtime[3:5])
future = int(df.loc[next_i+1, "minute"][:2]+df.loc[next_i+1, "minute"][3:5])
However, it's not working.
I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815.
To accomodate that as well you can reindex using pd.DateRange.
Creating data
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
Creating datetime column for reindexing
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df.set_index('datetime', inplace=True)
Generating output
df = df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
Output
This gives us the expected output
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
7 2015-01-02 10:05:00 64.9269
However if that was a typo and you don't want the last row you can do this :
df = df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
which gives is
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
Try pandas merge_ordered function.
Create the original data frame:
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
df['datetime']=pd.to_datetime(df['date']+' '+df['time'])
Create a second data frame df2 with 5 minute time intervals from min to max of df1
df2=pd.DataFrame(pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='5 min').rename('datetime'))
Using panda's merge_ordered function:
result=pd.merge_ordered(df2,df, on='datetime',how='left')
result['date']=result['datetime'].dt.date
result['time']=result['datetime'].dt.time
result['val']=result['val'].ffill()
result=result.drop('datetime', axis=1)

Create a dataframe based in common timestamps of multiple dataframes

I am looking for a elegant solution to selecting common timestamps from multiple dataframes. I know that something like this could work supposing the dataframe of common timestamps to be df:
df = df1[df1['Timestamp'].isin(df2['Timestamp'])]
However, if I have several other dataframes, this solution becomes quite unelegant. Therefore, I have been wondering if there is an easier approach to achieve my goal when working with multiple dataframes.
So, let's say for example that I have:
date1 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='H')
date2 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='15min')
date3 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='45min')
date4 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='30min')
data1 = np.random.randn(len(date1))
data2 = np.random.randn(len(date2))
data3 = np.random.randn(len(date3))
data4 = np.random.randn(len(date4))
df1 = pd.DataFrame(data = {'date1' : date1, 'data1' : data1})
df2 = pd.DataFrame(data = {'date2' : date2, 'data2' : data2})
df3 = pd.DataFrame(data = {'date3' : date3, 'data3' : data3})
df4 = pd.DataFrame(data = {'date4' : date4, 'data4' : data4})
I would like as an output a dataframe containing the common timestamps of the four dataframes as well as the respective data column out of each of them, for example (just to illustrate what I mean, it doesn't reflect on the result):
commom Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -1.129439 1.2312 1.11 -0.83
1 2018-01-01 01:00:00 0.853421 0.423 0.241 0.123
2 2018-01-01 02:00:00 -1.606047 1.001 -0.005 -0.12
3 2018-01-01 03:00:00 -0.668267 0.98 1.11 -0.23
[...]
You can use reduce from functools to perform the complete inner merge. We'll need to rename the columns just so the merge is a bit easier.
from functools import reduce
lst = [df1.rename(columns={'date1': 'Timestamp'}), df2.rename(columns={'date2': 'Timestamp'}),
df3.rename(columns={'date3': 'Timestamp'}), df4.rename(columns={'date4': 'Timestamp'})]
reduce(lambda l,r: l.merge(r, on='Timestamp'), lst)
Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.971201 -0.978107 1.163339 0.048824
1 2018-01-01 03:00:00 -1.063810 0.125318 -0.818835 -0.777500
2 2018-01-01 06:00:00 0.862549 -0.671529 1.902272 0.011490
3 2018-01-01 09:00:00 1.030826 -1.306481 0.438610 -1.817053
4 2018-01-01 12:00:00 -1.191646 -1.700694 1.007190 -1.932421
5 2018-01-01 15:00:00 -1.803248 0.415256 0.690243 1.387650
6 2018-01-01 18:00:00 -0.304502 0.514616 0.974318 -0.062800
7 2018-01-01 21:00:00 -0.668874 -1.262635 -0.504298 -0.043383
8 2018-01-02 00:00:00 -0.943615 1.010958 1.343095 0.119853
Alternatively concat with an 'inner' join and setting the Timestamp to the index
pd.concat([x.set_index('Timestamp') for x in lst], axis=1, join='inner')
If it would be acceptable to name every timestamp column in the same way (date for example), something like this could work:
def common_stamps(*args): # *args lets you feed it any number of dataframes
df = pd.concat([df_i.set_index('date') for df_i in args], axis=1)\
.dropna()\ # this removes all rows with `uncommon stamps`
.reset_index()
return df
df = common_stamps(df1, df2, df3, df4)
print(df)
Output:
date data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.667090 0.487676 -1.001807 -0.200328
1 2018-01-01 03:00:00 -1.639815 2.320734 -0.396013 -1.838732
2 2018-01-01 06:00:00 0.469890 0.626428 0.040004 -2.063454
3 2018-01-01 09:00:00 -0.916928 -0.260329 -0.598313 0.383281
4 2018-01-01 12:00:00 0.132670 1.771344 -0.441651 0.664980
5 2018-01-01 15:00:00 -0.761542 0.255955 1.378836 -1.235562
6 2018-01-01 18:00:00 -0.120083 0.243652 -1.261733 1.045454
7 2018-01-01 21:00:00 0.339921 -0.901171 1.492577 -0.797161
8 2018-01-02 00:00:00 -1.397864 -0.173818 -0.581590 -0.402472

Pandas resample doesn't return anything

I am learning to use pandas resample() function, however, the following code does not return anything as expected. I re-sampled the time series by day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print df.head()
weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('D').mean()
weekly_summary['distance'] = df.distance.resample('D').sum()
print weekly_summary.head()
Output
speed distance cumulative_distance
2015-01-01 00:00:00 40 10.00 10.00
2015-01-01 00:15:00 6 1.50 11.50
2015-01-01 00:30:00 31 7.75 19.25
2015-01-01 00:45:00 41 10.25 29.50
2015-01-01 01:00:00 59 14.75 44.25
[5 rows x 3 columns]
Empty DataFrame
Columns: [speed, distance]
Index: []
[0 rows x 2 columns]
Depending on your pandas version, how you will do this will vary.
In pandas 0.19.0, your code works as expected:
In [7]: pd.__version__
Out[7]: '0.19.0'
In [8]: df.speed.resample('D').mean().head()
Out[8]:
2015-01-01 28.562500
2015-01-02 30.302083
2015-01-03 30.864583
2015-01-04 29.197917
2015-01-05 30.708333
Freq: D, Name: speed, dtype: float64
In older versions, your solution might not work but at least in 0.14.1, you can tweak it to do so:
>>> pd.__version__
'0.14.1'
>>> df.speed.resample('D').mean()
29.41087328767123
>>> df.speed.resample('D', how='mean').head()
2015-01-01 29.354167
2015-01-02 26.791667
2015-01-03 31.854167
2015-01-04 26.593750
2015-01-05 30.312500
Freq: D, Name: speed, dtype: float64
This looks like an issue with old version of pandas, in newer versions it will enlarge the df when assigning a new column where the index is not the same shape. What should work is to not make an empty df and instead pass the initial call to resample as the data arg for the df ctor:
In [8]:
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print (df.head())
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
weekly_summary['distance'] = df.distance.resample('D').sum()
print( weekly_summary.head())
speed distance cumulative_distance
2015-01-01 00:00:00 28 7.0 7.0
2015-01-01 00:15:00 8 2.0 9.0
2015-01-01 00:30:00 10 2.5 11.5
2015-01-01 00:45:00 56 14.0 25.5
2015-01-01 01:00:00 6 1.5 27.0
speed distance
2015-01-01 27.895833 669.50
2015-01-02 29.041667 697.00
2015-01-03 27.104167 650.50
2015-01-04 28.427083 682.25
2015-01-05 27.854167 668.50
Here I pass the call to resample as the data arg for the df ctor, this will take the index and column name and create a single column df:
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
then subsequent assignments should work as expected

Convert pandas DateOffset to microsecond

I would like to retrieve the sampling frequency of a dataframe say as an integer in microseconds, or a float in seconds.
I found the following to work
import pandas as pd
(pd.datetime(1,1,1) + data_frame.index.freq - pd.datetime(1,1,1)).total_seconds()
but somehow I think there might be a less cumbersome way of doing it…
You might want to use pd.Timedelta.
import pandas as pd
import numpy as np
# your dataframe with some unknown freq
# ====================================
df = pd.DataFrame(np.random.randn(100), columns=['col'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='20ms'))
Out[263]:
col
2015-01-01 00:00:00.000 0.8647
2015-01-01 00:00:00.020 -0.2269
2015-01-01 00:00:00.040 0.8112
2015-01-01 00:00:00.060 0.2878
2015-01-01 00:00:00.080 -0.5385
2015-01-01 00:00:00.100 1.9085
2015-01-01 00:00:00.120 -0.4758
2015-01-01 00:00:00.140 1.4407
2015-01-01 00:00:00.160 -1.1491
2015-01-01 00:00:00.180 0.8057
... ...
2015-01-01 00:00:01.800 -0.6615
2015-01-01 00:00:01.820 0.7059
2015-01-01 00:00:01.840 -0.3586
2015-01-01 00:00:01.860 0.7320
2015-01-01 00:00:01.880 -0.0364
2015-01-01 00:00:01.900 0.5889
2015-01-01 00:00:01.920 -0.7796
2015-01-01 00:00:01.940 0.4763
2015-01-01 00:00:01.960 0.8339
2015-01-01 00:00:01.980 1.3138
[100 rows x 1 columns]
# processing using pd.Timedelta()
# =================================
# get the freq in ms
(df.index[1] - df.index[0])/pd.Timedelta('1ms')
Out[262]: 20.0

Categories