Convert tick data to daily - python

I'd like to convert a csv file with tick data to daily prices and volume. the csv file I have is formatted as: unix,price,volume.
the groupby function has only gotten me to group by unix seconds. What is a good way to get daily close prices AND the sum of volume for each day?
Im working with python 2.7 and also have pandas installed, but im not very familiar with it yet.
really, the furthest I've got anything to work is this:
import pandas as pd
data = pd.read_csv('file.csv',names=['unix','price','vol'])
datagr = data.groupby('unix')
dataPrice = datagr['price'].last()
dataVol = datagr['vol'].sum()
Sample data:
1391067323,772.000000000000,0.020200000000
1391067323,772.000000000000,0.020000000000
1391067323,771.379000000000,1.389480000000
1391067323,772.000000000000,1.244540000000
1391067326,774.955000000000,0.084830600000
1391067326,774.955000000000,0.084833400000
1391067327,774.955000000000,0.084830600000
1391067331,774.953000000000,0.200000000000
1391067336,774.951000000000,0.101202000000
This retrieves the last price per unix second and sums the volume of trades that took place within the unix second. The problem is that it groups to the unix second, and I don't want to use any super convoluted method because of time considerations

You can convert unix time to pandas' datetime using to_datetime:
df['unix'] = pd.to_datetime(df['unix'], unit='s')
Now you can now set this as the index and resample:
df = df.set_index('unix')
df.resample('D', how={'volume': 'sum', 'price': 'last'})
Note: We're using different methods for the respective columns.
Example:
In [11]: df = pd.DataFrame(np.random.randn(5, 2), pd.date_range('2014-01-01', periods=5, freq='H'), columns=list('AB'))
In [12]: df
Out[12]:
A B
2014-01-01 00:00:00 -1.185459 -0.854037
2014-01-01 01:00:00 -1.232376 -0.817346
2014-01-01 02:00:00 0.478683 -0.467169
2014-01-01 03:00:00 -0.407009 0.290612
2014-01-01 04:00:00 0.181207 -0.171356
In [13]: df.resample('D', how={'A': 'sum', 'B': 'last'})
Out[13]:
A B
2014-01-01 -2.164955 -0.171356

Related

H2O python - How to let h2oframe to dataframe with correctly character and datetime

I have a csv file, and want to use H2O to do DeepLearning. But it has some Chinese and datetime that when I finish my Deeplearning need to save output to csv, it can't return to original data.
I use small data to show my problem here.
In[1]: df = pd.DataFrame({'datetime':['2016-12-17 00:00:00'],'time':['00:00:30'],'month':['月'], 'weekend':['周六']})
print(df.dtypes)
df
out[1]: datetime object
time object
month object
weekend object
dtype: object
datetime time month weekend
0 2016-12-17 00:00:00 00:00:30 月 周六
In[2]: h2o_frame = h2o.H2OFrame(df);h2o_frame ;h2o_frame.types ;h2o_frame
C:\Users\thi\Anaconda3\lib\site-packages\h2o\utils\shared_utils.py:170: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
data = _handle_python_lists(python_obj.as_matrix().tolist(), -1)[1]
out[2]: Parse progress: |█████████████████████████████████████████████████████████| 100%
datetime time month weekend
2016-12-17 00:00:00 1970-01-01 00:00:30 <0xA4EB> <0xA9>P<0xA4BB>
the time I want it just only 00:00:30, any way to fix it?
month and weekends I don't find any way to let it show Chinese, but I still finish my deeplearning
But when I want to let h2oframe back to DataFrame and save to csv file, it save <0xA4EB> for me but not 月, and datetime change to int
In[3]: dff = h2o_frame.as_data_frame();dff
out[3]: datetime time month weekend
0 1481932800000 30000 <0xA4EB> <0xA9>P<0xA4BB>
How to correctly return character from h2oframe to DataFrame
How to correctly return datetime from h2oframe to DataFrame
One simplest way to solve this is, when you convet pandas frame to H2OFrame use argument column_types ,as below:
In [69]: col_types
Out[69]: ['categorical', 'categorical', 'categorical', 'categorical']
In [70]: h2o_frame = h2o.H2OFrame(df,column_types=col_types);h2o_frame ;h2o_frame.types ;h2o_frame
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
Out[70]:
datetime month time weekend
------------------- ------- -------- ---------
2016-12-17 00:00:00 月 00:00:30 周六
[1 row x 4 columns]
In [71]: dff = h2o_frame.as_data_frame();dff
Out[71]:
datetime month time weekend
0 2016-12-17 00:00:00 月 00:00:30 周六
allfiles = h2o.import_file(path='data/', pattern=".csv")
df = allfiles.as_data_frame()
df['datetime'] = pd.to_datetime(df["datetime"], unit='ms')

Pandas index setting and importing values to columns

I am new to Python and working my way through my crawling project. I have two questions regarding few pandas module.
Below is my data table "js"
apple banana
period
2017-01-01 100.00000 22.80130
2017-02-01 94.13681 16.28664
2017-03-01 85.34201 13.68078
2017-04-01 65.79804 9.77198
2017-05-01 43.32247 13.35504
2017-06-01 72.63843 9.44625
2017-07-01 78.82736 9.77198
2017-08-01 84.03908 10.09771
2017-09-01 90.55374 13.35504
2017-10-01 86.64495 9.12052
Below is my code to apply apple and banana values to new DataFrame.
import pandas as pd
from datetime import datetime, timedelta
dd = pd.date_range('2017-01-01',datetime.now().date() - timedelta(1))
df = pd.DataFrame.set_index(dd) #this part has error
first step is to set my df index as data_range ('2017-01-01' to yesterday (daily)). And the error message is saying I am missing 1 required positional argument: 'keys'. Is it possible to set index as daily dates from '2017-01-01' to yesterday?
After that is solved, I am trying to put my "js" Data such as 'apple' and 'banana' as column, and put each value respect to df index dates. This example only shows 'apple' and 'banana' columns, but in my real data set, I have thousands more...
Please let me know the efficient way to solve my problem. Thanks in advance!
------------------EDIT------------------------
The date indexing works perfect with #COLDSPEED answer.
dd = pd.date_range('2017-01-01',datetime.now().date() - timedelta(1))
df.index = pd.to_datetime(df.index) # ignore if not needed
df = df.reindex(dd, fill_value=0.0)
One problem is that if I have another dataframe "js2"(below) and combine these data in a single df (above) I believe it will not work. Any sugguestions?
kiwi mango
period
2017-01-01 9.03614 100.00000
2017-02-01 5.42168 35.54216
2017-03-01 7.83132 50.00000
2017-04-01 10.24096 55.42168
2017-05-01 10.84337 60.84337
2017-06-01 12.04819 65.66265
2017-07-01 17.46987 34.93975
2017-08-01 9.03614 30.72289
2017-09-01 9.63855 56.02409
2017-10-01 12.65060 45.18072
You can use pd.to_datetime and pd.Timedelta -
idx = pd.date_range('2017-01-01', pd.to_datetime('today') - pd.Timedelta(days=1))
idx
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21',
'2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25',
'2017-11-26', '2017-11-27'],
dtype='datetime64[ns]', length=331, freq='D')
This, you can then use to reindex your dataframe -
df.index = pd.to_datetime(df.index) # ignore if not needed
df = df.reindex(idx, fill_value=0.0)
If your date are day-first (day first, followed by month), make sure you specify that when converting your index -
df.index = pd.to_datetime(df.index, dayfirst=True)

Is datetime data in Pandas supposed to be in the index?

By supposed to, what I mean is
Is that the way Pandas is designed?, Are all Pandas time series functions built upon that assumption?
A few weeks ago I was experimenting with pandas.rolling_mean which seemed to want the datetime to be in the index.
Given a dataframe like this:
df = pd.DataFrame({'date' : ['23/10/2017', '24/10/2017', '25/10/2017','26/10/2017','27/10/2017'], 'dax-close' : [13003.14, 13013.19, 12953.41,13133.28,13217.54]})
df['date'] = pd.to_datetime(df['date'])
df
...is it important to always do this:
df.set_index('date', inplace=True)
df
...as one of the first steps of an analysis?
The short answer is usually timeseries data has date as a DatetimeIndex. and many pandas functions do make use of that e.g. resample is a big one.
That said, you don't need to have Dates as an index, for example you may even have multiple Datetime columns, then you're out of luck calling the vanilla resample... however you can use pd.Grouper to define the "resample" on a column (or as part of a larger/multi-column groupby)
In [11]: df.groupby(pd.Grouper(key="date", freq="2D")).sum()
Out[11]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
In [12]: df.set_index("date").resample("2D").sum()
Out[12]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
The former gives more flexibility in that you can groupby multiple columns:
In [21]: df["X"] = list("AABAC")
In [22]: df.groupby(["X", pd.Grouper(key="date", freq="2D")]).sum()
Out[22]:
dax-close
X date
A 2017-10-23 26016.33
2017-10-25 13133.28
B 2017-10-25 12953.41
C 2017-10-27 13217.54

Handling monthly-binned data in pandas

I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories