I have a set of Data points some of which share the same date index
import pandas as pd
df = pd.DataFrame({'Date':["2016-01-08","2016-01-15", "2016-01-15", "2016-01-23"],
'Set': ["1", "2", "3", "4"]})
df
Out[2]:
Date Set
0 2016-01-08 1
1 2016-01-15 2
2 2016-01-15 3
3 2016-01-23 4
how can I achieve to obtain a pandas data frame that has the business days of a specified period as index (here say January 2016) and the numbers from the df aligned to it?
df_out
Out[3]:
Set
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 1
2016-01-11 NaN
2016-01-12 NaN
2016-01-13 NaN
2016-01-14 NaN
2016-01-15 2
2016-01-15 3
2016-01-18 NaN
2016-01-19 NaN
2016-01-20 NaN
2016-01-21 NaN
2016-01-22 NaN
2016-01-25 NaN
2016-01-26 NaN
2016-01-27 NaN
2016-01-28 NaN
2016-01-29 NaN
Since you're working on the DatetimeIndex I build your example using a Series rather than a DataFrame:
s = pd.Series({"2016-01-08":1,
"2016-01-15":2,
"2016-01-16":3,
"2016-01-23":3})
Then I would assign the datetime index:
s.index = pd.DatetimeIndex(s.index)
Then I build the new index of business days only with:
bd = pd.bdate_range('2016-01-01', '2016-01-31')
and reindex back the original Series:
s = s.reindex(bd)
This returns:
2016-01-01 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 1
2016-01-11 NaN
2016-01-12 NaN
2016-01-13 NaN
2016-01-14 NaN
2016-01-15 2
2016-01-18 NaN
2016-01-19 NaN
2016-01-20 NaN
2016-01-21 NaN
2016-01-22 NaN
2016-01-25 NaN
2016-01-26 NaN
2016-01-27 NaN
2016-01-28 NaN
2016-01-29 NaN
Freq: B, dtype: float64
This does not handle the problem of the duplicate index, but hope that helps.
Related
My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.
I wonder if there is a way to up resample a DataFrame without having to decide how NAs should be filled immediately.
I tried the following but got the Future Warning:
FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)
Code:
import pandas as pd
dates = pd.date_range('2015-01-01', '2016-01-01', freq='BM')
dummy = [i for i in range(len(dates))]
df = pd.DataFrame({'A': dummy})
df.index = dates
df.resample('B')
Is there a better way to do this, that doesn't show warnings?
Thanks.
Use Resampler.asfreq:
print (df.resample('B').asfreq())
A
2015-01-30 0.0
2015-02-02 NaN
2015-02-03 NaN
2015-02-04 NaN
2015-02-05 NaN
2015-02-06 NaN
2015-02-09 NaN
2015-02-10 NaN
2015-02-11 NaN
2015-02-12 NaN
2015-02-13 NaN
2015-02-16 NaN
2015-02-17 NaN
2015-02-18 NaN
2015-02-19 NaN
2015-02-20 NaN
2015-02-23 NaN
2015-02-24 NaN
2015-02-25 NaN
2015-02-26 NaN
2015-02-27 1.0
2015-03-02 NaN
2015-03-03 NaN
2015-03-04 NaN
...
...
Im attempting to convert a dataframe into a series using code which, simplified, looks like this:
dates = ['2016-1-{}'.format(i)for i in range(1,21)]
values = [i for i in range(20)]
data = {'Date': dates, 'Value': values}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
ts = pd.Series(df['Value'], index=df['Date'])
print(ts)
However, print output looks like this:
Date
2016-01-01 NaN
2016-01-02 NaN
2016-01-03 NaN
2016-01-04 NaN
2016-01-05 NaN
2016-01-06 NaN
2016-01-07 NaN
2016-01-08 NaN
2016-01-09 NaN
2016-01-10 NaN
2016-01-11 NaN
2016-01-12 NaN
2016-01-13 NaN
2016-01-14 NaN
2016-01-15 NaN
2016-01-16 NaN
2016-01-17 NaN
2016-01-18 NaN
2016-01-19 NaN
2016-01-20 NaN
Name: Value, dtype: float64
Where does NaN come from? Is a view on a DataFrame object not a valid input for the Series class ?
I have found the to_series function for pd.Index objects, is there something similar for DataFrames ?
I think you can use values, it convert column Value to array:
ts = pd.Series(df['Value'].values, index=df['Date'])
import pandas as pd
import numpy as np
import io
dates = ['2016-1-{}'.format(i)for i in range(1,21)]
values = [i for i in range(20)]
data = {'Date': dates, 'Value': values}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print df['Value'].values
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
ts = pd.Series(df['Value'].values, index=df['Date'])
print(ts)
Date
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
2016-01-06 5
2016-01-07 6
2016-01-08 7
2016-01-09 8
2016-01-10 9
2016-01-11 10
2016-01-12 11
2016-01-13 12
2016-01-14 13
2016-01-15 14
2016-01-16 15
2016-01-17 16
2016-01-18 17
2016-01-19 18
2016-01-20 19
dtype: int64
Or you can use:
ts1 = pd.Series(data=values, index=pd.to_datetime(dates))
print(ts1)
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
2016-01-06 5
2016-01-07 6
2016-01-08 7
2016-01-09 8
2016-01-10 9
2016-01-11 10
2016-01-12 11
2016-01-13 12
2016-01-14 13
2016-01-15 14
2016-01-16 15
2016-01-17 16
2016-01-18 17
2016-01-19 18
2016-01-20 19
dtype: int64
Thank you #ajcr for better explanation why you get NaN:
When you give a Series or DataFrame column to pd.Series, it will reindex it using the index you specify. Since your DataFrame column has an integer index (not a date index) you get lots of missing values.
You can just do:
s = df.set_index('Date')
Which is now a one column dataframe.
If you really want it as a Series:
s = df.set_index('Date').Value
btw, NaN is numpy's Not-a-Number.
Using your method, you could use:
ts = pd.Series(df['Value'].values, name='Value', index=df['Date'])
The reason you are getting the NaNs is that you are not providing the data in the correct format. You are passing a Series to a Series.
If you are only looking for a to create series with those values you could have also done:
pd.Series( [i for i in range(20)], pd.date_range('2016-01-02', periods=20, freq='D'))
I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!