Is datetime data in Pandas supposed to be in the index? - python

By supposed to, what I mean is
Is that the way Pandas is designed?, Are all Pandas time series functions built upon that assumption?
A few weeks ago I was experimenting with pandas.rolling_mean which seemed to want the datetime to be in the index.
Given a dataframe like this:
df = pd.DataFrame({'date' : ['23/10/2017', '24/10/2017', '25/10/2017','26/10/2017','27/10/2017'], 'dax-close' : [13003.14, 13013.19, 12953.41,13133.28,13217.54]})
df['date'] = pd.to_datetime(df['date'])
df
...is it important to always do this:
df.set_index('date', inplace=True)
df
...as one of the first steps of an analysis?

The short answer is usually timeseries data has date as a DatetimeIndex. and many pandas functions do make use of that e.g. resample is a big one.
That said, you don't need to have Dates as an index, for example you may even have multiple Datetime columns, then you're out of luck calling the vanilla resample... however you can use pd.Grouper to define the "resample" on a column (or as part of a larger/multi-column groupby)
In [11]: df.groupby(pd.Grouper(key="date", freq="2D")).sum()
Out[11]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
In [12]: df.set_index("date").resample("2D").sum()
Out[12]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
The former gives more flexibility in that you can groupby multiple columns:
In [21]: df["X"] = list("AABAC")
In [22]: df.groupby(["X", pd.Grouper(key="date", freq="2D")]).sum()
Out[22]:
dax-close
X date
A 2017-10-23 26016.33
2017-10-25 13133.28
B 2017-10-25 12953.41
C 2017-10-27 13217.54

Related

How to fill in missing dates and values in a Pandas DataFrame?

so the data set I am using is only business days but I want to change the date index such that it reflects every calendar day. When I use reindex and have to use reindex(), I am unsure how to use 'fill value' field of reindex to inherit the value above.
import pandas as pd
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex()
Currently, my data set looks like this.
However, when I use reindex I get the below result
In reality I want it to inherit the values directly above if it is a NaN result so the data set becomes the following
Thank you guys for your help!
You were close! You just need to pass the index you want to reindex on (idx in this case) as a parameter to the reindex method, and then you can set the method parameter to 'ffill' to propagate the last valid value forward.
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex(idx, method='ffill')
It seems that you have created a 'Series', not a dataframe. See if the code below helps you.
df = df.to_frame().reset_index() #to convert series to dataframe
df = df.fillna(method='ffill')
print(df)
Output You will have to rename columns
index 0
0 2019-12-18 22.63
1 2019-12-19 22.20
2 2019-12-20 21.03
3 2019-12-21 21.03
4 2019-12-22 21.03
5 2019-12-23 17.00
6 2019-12-24 19.65

Extracting just Month and Year separately from Pandas Datetime column

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Convert tick data to daily

I'd like to convert a csv file with tick data to daily prices and volume. the csv file I have is formatted as: unix,price,volume.
the groupby function has only gotten me to group by unix seconds. What is a good way to get daily close prices AND the sum of volume for each day?
Im working with python 2.7 and also have pandas installed, but im not very familiar with it yet.
really, the furthest I've got anything to work is this:
import pandas as pd
data = pd.read_csv('file.csv',names=['unix','price','vol'])
datagr = data.groupby('unix')
dataPrice = datagr['price'].last()
dataVol = datagr['vol'].sum()
Sample data:
1391067323,772.000000000000,0.020200000000
1391067323,772.000000000000,0.020000000000
1391067323,771.379000000000,1.389480000000
1391067323,772.000000000000,1.244540000000
1391067326,774.955000000000,0.084830600000
1391067326,774.955000000000,0.084833400000
1391067327,774.955000000000,0.084830600000
1391067331,774.953000000000,0.200000000000
1391067336,774.951000000000,0.101202000000
This retrieves the last price per unix second and sums the volume of trades that took place within the unix second. The problem is that it groups to the unix second, and I don't want to use any super convoluted method because of time considerations
You can convert unix time to pandas' datetime using to_datetime:
df['unix'] = pd.to_datetime(df['unix'], unit='s')
Now you can now set this as the index and resample:
df = df.set_index('unix')
df.resample('D', how={'volume': 'sum', 'price': 'last'})
Note: We're using different methods for the respective columns.
Example:
In [11]: df = pd.DataFrame(np.random.randn(5, 2), pd.date_range('2014-01-01', periods=5, freq='H'), columns=list('AB'))
In [12]: df
Out[12]:
A B
2014-01-01 00:00:00 -1.185459 -0.854037
2014-01-01 01:00:00 -1.232376 -0.817346
2014-01-01 02:00:00 0.478683 -0.467169
2014-01-01 03:00:00 -0.407009 0.290612
2014-01-01 04:00:00 0.181207 -0.171356
In [13]: df.resample('D', how={'A': 'sum', 'B': 'last'})
Out[13]:
A B
2014-01-01 -2.164955 -0.171356

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Forcing dates to conform to a given frequency in pandas

Suppose we have a monthly time series, possibly with missing months, and upon loading the data into a pandas Series object with DatetimeIndex we wish to make sure each date observation is labeled as an end-of-month date. However, the raw input dates may fall anywhere in the month, so we need to force them to end-of-month observations.
My first thought was to do something like this:
import pandas as pd
pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
However, this just leaves the dates as is [2012-01-20,2012-07-31] and does not force them to end-of-month values [2012-01-31,2012-07-31].
My second attempt was:
ix = pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
s = pd.Series(np.random.randn(len(ix)), index=ix)
s.asfreq('M')
But this gives:
2012-01-31 NaN
2012-02-29 NaN
2012-03-31 NaN
2012-04-30 NaN
2012-05-31 NaN
2012-06-30 NaN
2012-07-31 0.79173
Freq: M
as under the hood the asfreq function is calling date_range for a DatetimeIndex.
This problem is easily solved if I'm using PeriodIndex instead of DatetimeIndex; however, I need to support some frequencies that are not currently supported by PeriodIndex and as far as I know there is no way to extend pandas with my own Period frequencies.
It's a workaround, but it works without using periodindex:
from pandas.tseries.offsets import *
In [164]: s
Out[164]:
2012-01-20 -1.266376
2012-07-31 -0.865573
In [165]: s.index=s.index+MonthEnd(n=0)
In [166]: s
Out[166]:
2012-01-31 -1.266376
2012-07-31 -0.865573

Categories