Pandas: how to map one dataframe onto another with missing data? - python

I'm sure an easy command exists to do this in pandas, but for the life of me I can't figure it out.
I have two dataframes, the first is an **ideal **stock market timeline (times where I expect data to exist). The second dataframe is the **actual **data, with gaps. I need to map one to the other, and fill in the gaps with NaN.
First DataFrame: (an ideal timeline)
datetime
0 2005-01-03 10:00:00
1 2005-01-03 11:00:00
2 2005-01-03 12:00:00
3 2005-01-03 13:00:00
4 2005-01-03 14:00:00
Second DataFrame: (actual data with missing value at time 12:00:00)
datetime open high low close volume
1 2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2 2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
3 2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
4 2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
I want the finished product to be:
datetime open high low close volume
1 2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2 2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
3 2005-01-03 12:00:00 Nan NaN NaN NaN NaN
4 2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
5 2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
where the dataframe's datetime column is now the ideal timeseries, and missing points are NaN
I've tried to study the documentation on this but I'm still a noob and I can't figure this out. Any suggestions?

You can use merge function on pandas. Merge the two dataframes on datatime and use outer on how parameter. Outer uses union of keys from both dataframes.
sample code:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({'datetime': ['2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 12:00:00', '2005-01-03 13:00:00']})
df2 = pd.DataFrame({'datetime': ['2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 13:00:00', '2005-01-03 14:00:00'],
'open': [15.1118, 14.8294,15.0490,14.9674],
'high': [15.1745, 14.9737,15.0929,15.0616],
'low': [14.7478,14.7792,14.9549,14.9674],
'close': [14.8294,14.9423,14.9612,15.0051],
'volume': [586463,344888,343767,364739]})
merged_df = pd.merge(df1, df2,how='outer' ,on = 'datetime')
merged_df
output:
Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

IIUC, it is quite simple. You can prepare the desired index (ix below), then df.reindex(ix) is the result you are looking for (assuming your df has 'datetime' as index --if not, make it so with df = df.set_index('datetime')). If the index is that of another DataFrame, then use that instead of making ix from scratch. If you just want to make sure the index is at hourly frequency and complete, then no need for ix: just df.resample('H').asfreq() is the value you want.
Note: here, there is no need to use pd.merge(). It is overkill for this problem and many times slower in this case.
Example (using your df):
start, end = '2005-01-03 10:00:00', '2005-01-03 14:00:00'
ix = pd.date_range(start, end, freq='H')
>>> df.reindex(ix)
open high low close volume
2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463.0
2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888.0
2005-01-03 12:00:00 NaN NaN NaN NaN NaN
2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767.0
2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739.0
With this specific ix, you get the same with simply:
>>> df.resample('H').asfreq()
# same as above
Note: the initial df is the one you use as example, with 'datetime' set as index:
df = pd.DataFrame({
'datetime': pd.to_datetime([
'2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 13:00:00', '2005-01-03 14:00:00']),
'open': [15.1118, 14.8294, 15.049, 14.9674],
'high': [15.1745, 14.9737, 15.0929, 15.0616],
'low': [14.7478, 14.7792, 14.9549, 14.9674],
'close': [14.8294, 14.9423, 14.9612, 15.0051],
'volume': [586463, 344888, 343767, 364739],
}).set_index('datetime')
>>> df
open high low close volume
datetime
2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
Speed
Let's see what happens at scale, and compare solutions:
start, end = '2005', '2010'
ix = pd.date_range(start, end, freq='H', inclusive='left')
cols = ['open', 'high', 'low', 'close', 'volume']
np.random.seed(0) # reproducible example
fulldf = pd.DataFrame(np.random.uniform(size=(len(ix), len(cols))), index=ix, columns=cols)
df = fulldf.sample(frac=.9).sort_index().copy(deep=True)
>>> df.shape
(39442, 5)
Now:
t0 = %timeit -o df.reindex(ix)
# 1.57 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
By contrast:
df1 = pd.DataFrame({'datetime': ix})
df2 = df.rename_axis(index='datetime').reset_index()
t1 = %timeit -o pd.merge(df1, df2, how='outer', on = 'datetime')
# 14.3 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> t1.best / t0.best
9.312707763833396
Conclusion: reindex() is 9x faster than merge() in this example.

Related

Group timestamps by day in pandas

I want to combine multiple datestamps (datetime64) to a single row representing one day. Then I want to sum up the amount in the last column getting the total sales per day.
In this case I want to have two lines, with the two days and the total sales.
I have tried to solve my problem with the groupby operation, but it won't work.
You could try to use resample
df_1d=df.resample('1d', on='timestamp').sum()
It will sum all data for all day or from another time
The one-liner df.resample('1d', on='timestamp').sum()
from Aeroxer Support is perfect, but it does not explain why your attempts with groupby failed.
In order to groupby to work, you would need a column with just the day in it. Then you could groupby by that day column.
Below is the example code. I add the extra column with just the day in it at In [4] and then df.groupby('day').sum() is what you are looking for.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'timestamp': map(pd.Timestamp, ['2022-09-30 11:21', '2022-09-30 20:55', '2022-10-01 10:35', '2022-10-01 22:42']),
...: 'sales': [99.90, 10.20, 5.99, 21.00]
...: })
In [3]: df
Out[3]:
timestamp sales
0 2022-09-30 11:21:00 99.90
1 2022-09-30 20:55:00 10.20
2 2022-10-01 10:35:00 5.99
3 2022-10-01 22:42:00 21.00
In [4]: df['day'] = df.timestamp.dt.floor('1D')
In [5]: df
Out[5]:
timestamp sales day
0 2022-09-30 11:21:00 99.90 2022-09-30
1 2022-09-30 20:55:00 10.20 2022-09-30
2 2022-10-01 10:35:00 5.99 2022-10-01
3 2022-10-01 22:42:00 21.00 2022-10-01
In [6]: df.groupby('day').sum()
Out[6]:
sales
day
2022-09-30 110.10
2022-10-01 26.99
You don't have to explicitly save the day in a new column, this works just as well:
df.groupby(df.timestamp.dt.floor('1D')).sum()
although I find it hard to read. See the docs on Series.dt.floor().

How to index out open and close in pandas datetime dataframe?

Okay so I have a csv with minute data for the S&P 500 index for 2020, and I am looking how to index out only the close and open for 9:30 and 4:00 only. In essence I just want what the market open and close was. So far the code is:
import pandas as pd
import datetime as dt
import numpy as np
d = pd.read_csv('/Volumes/Seagate Portable/usindex_2020_all_tickers_awvbxk9/SPX_2020_2020.txt')
d.columns = ['Dates', 'Open', 'High', 'Low', 'Close']
d.drop(['High', 'Low'], axis=1, inplace=True)
d.set_index('Dates', inplace=True)
d.head()
It wont let me share the csv file but this is what the output looks like:
Open Close
Dates
2020-01-02 09:31:00 3247.19 3245.22
2020-01-02 09:32:00 3245.07 3244.66
2020-01-02 09:33:00 3244.89 3247.61
2020-01-02 09:34:00 3247.38 3246.92
2020-01-02 09:35:00 3246.89 3249.09
I have tried using loc and dt.time, which I am assmuning is the right way to code I just cannot think of the exact code to index out these 2 times. Any ideas? Thank you!
If the .dt extractor is used on the 'Dates' column (d.Dates.dt.time[0]), the .time component is datetime.time(9, 30), therefore d.Dates.dt.time == dtime(9, 30) must be used for the Boolean match, and not d.Dates.dt.time == '09:30:00'
import pandas as pd
from datetime import time as dtime
# test dataframe
d = pd.DataFrame({'Dates': ['2020-01-02 09:30:00', '2020-01-02 09:31:00', '2020-01-02 09:32:00', '2020-01-02 09:33:00', '2020-01-02 09:34:00', '2020-01-02 09:35:00', '2020-01-02 16:00:00'], 'Open': [3247.19, 3247.19, 3245.07, 3244.89, 3247.38, 3246.89, 3247.19], 'Close': [3245.22, 3245.22, 3244.66, 3247.61, 3246.92, 3249.09, 3245.22]})
# display(d)
Dates Open Close
0 2020-01-02 09:30:00 3247.19 3245.22
1 2020-01-02 09:31:00 3247.19 3245.22
2 2020-01-02 09:32:00 3245.07 3244.66
3 2020-01-02 09:33:00 3244.89 3247.61
4 2020-01-02 09:34:00 3247.38 3246.92
5 2020-01-02 09:35:00 3246.89 3249.09
6 2020-01-02 16:00:00 3247.19 3245.22
# verify Dates is a datetime format
d.Dates = pd.to_datetime(d.Dates)
# use Boolean selection for 9:30 and 16:00 (4pm)
d = d[(d.Dates.dt.time == dtime(9, 30)) | (d.Dates.dt.time == dtime(16, 0))].copy()
# set the index
d.set_index('Dates', inplace=True)
# display(d)
Open Close
Dates
2020-01-02 09:30:00 3247.19 3245.22
2020-01-02 16:00:00 3247.19 3245.22
Try:
import pandas as pd
# create dummy daterange
date_range = pd.DatetimeIndex(pd.date_range("00:00", "23:59", freq='1min'))
# create df with enumerated column as data, and with daterange(DatetimeIndex) as index
df = pd.DataFrame(data=[i for i, d in enumerate(date_range)], index=date_range)
# boolean index using strings
four_and_nine = df[(df.index == '16:00:00') | (df.index == '21:00:00')]
print(four_and_nine)
0
2021-01-01 16:00:00 960
2021-01-01 21:00:00 1260
Pandas is pretty smart with comparing strings to actual datetimes(DatetimeIndex in this case).
Above is selecting top of the hour. If you wanted all minutes/seconds within specific hours, use boolean index like: df[(df.index.hour == 4) | (df.index.hour == 9)]

Extend datetimeindex to previous times in pandas

MRE:
idx = pd.date_range('2015-07-03 08:00:00', periods=30,
freq='H')
data = np.random.randint(1, 100, size=len(idx))
df = pd.DataFrame({'index':idx, 'col':data})
df.set_index("index", inplace=True)
which looks like:
col
index
2015-07-03 08:00:00 96
2015-07-03 09:00:00 79
2015-07-03 10:00:00 15
2015-07-03 11:00:00 2
2015-07-03 12:00:00 84
2015-07-03 13:00:00 86
2015-07-03 14:00:00 5
.
.
.
Note that dataframe contain multiple days. Since frequency is in hours, starting from 07/03 08:00:00 it will contain hourly date.
I want to get all data from 05:00:00 including day 07/03 even if it will contain value 0 in "col" column.
I want to extend it backwards so it starts from 05:00:00.
No I just can't start from 05:00:00 since I already have dataframe that starts from 08:00:00. I am trying to keep everything same but add 3 rows in the beginning to include 05:00:00, 06:00:00, and 07:00:00
The reindex method is handy for changing the index values:
idx = pd.date_range('2015-07-03 08:00:00', periods=30, freq='H')
data = np.random.randint(1, 100, size=len(idx))
# use the index param to set index or you might lose the freq
df = pd.DataFrame({'col':data}, index=idx)
# reindex with a new index
start = df.tshift(-3).index[0]
end = df.index[-1]
new_index = pd.date_range(start, end, freq='H')
new_df = df.reindex(new_index)
resample is also very useful for date indices
Just change the time from 08:00:00 to 05:00:00 in your code and create 3 more rows and update this dataframe to the existing one.
idx1 = pd.date_range('2015-07-03 05:00:00', periods=3,freq='H')
df1 = pd.DataFrame({'index': idx1 ,'col':np.random.randint(1,100,size = 3)})
df1.set_index('index',inplace=True)
df = df1.append(df)
print(df)
Add this snippet to your code...

Pandas resample offset from the most recent year end date?

Can somone explain what is going on with my resampling?
For example,
In [53]: daily_3mo_treasury.resample('5Y').mean()
Out[53]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.534476
Where the last date in my time series is 2018-08-23 2.04
I really want my resample from the most recent year-end instead, so for example from 2017-12-31 to 2012-12-31 and so on.
I tried,
end = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
daily_3mo_treasury.iloc[:end].resample('5Y').mean()
In [66]: daily_3mo_treasury.iloc[:end].resample('5Y').mean()
Out[66]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.333467
dtype: float64
Where the last value in daily_3mo_treasury.iloc[:end] is 2017-12-29 1.37
How come my second 5 year resample is not ending 2017-12-31?
Edit: My index is sorted.
From #ALollz - When you resample, the bins are based on the first date in your index.
sistart = daily_3mo_treasury.index.searchsorted(date(1992,12,31))
siend = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
In [95]: daily_3mo_treasury.iloc[sistart:siend].resample('5Y').mean()
Out[95]:
1992-12-31 3.080000
1997-12-31 4.562246
2002-12-31 4.050696
2007-12-31 2.925971
2012-12-31 0.360775
2017-12-31 0.278233
dtype: float64

Modify hour in datetimeindex in pandas dataframe

I have a dataframe that looks like this:
master.head(5)
Out[73]:
hour price
day
2014-01-01 0 1066.24
2014-01-01 1 1032.11
2014-01-01 2 1028.53
2014-01-01 3 963.57
2014-01-01 4 890.65
In [74]: master.index.dtype
Out[74]: dtype('<M8[ns]')
What I need to do is update the hour in the index with the hour in the column but the following approaches don't work:
In [82]: master.index.hour = master.index.hour(master['hour'])
TypeError: 'numpy.ndarray' object is not callable
In [83]: master.index.hour = [master.index.hour(master.iloc[i,0]) for i in len(master.index.hour)]
TypeError: 'int' object is not iterable
How to proceed?
IIUC I think you want to construct a TimedeltaIndex:
In [89]:
df.index += pd.TimedeltaIndex(df['hour'], unit='h')
df
Out[89]:
hour price
2014-01-01 00:00:00 0 1066.24
2014-01-01 01:00:00 1 1032.11
2014-01-01 02:00:00 2 1028.53
2014-01-01 03:00:00 3 963.57
2014-01-01 04:00:00 4 890.65
Just to compare against using apply:
In [87]:
%timeit df.index + pd.TimedeltaIndex(df['hour'], unit='h')
%timeit df.index + df['hour'].apply(lambda x: pd.Timedelta(x, 'h'))
1000 loops, best of 3: 291 µs per loop
1000 loops, best of 3: 1.18 ms per loop
You can see that using a TimedeltaIndex is significantly faster
master.index =
pd.to_datetime(master.index.map(lambda x : x.strftime('%Y-%m-%d')) + '-' + master.hour.map(str) , format='%Y-%m-%d-%H.0')

Categories