I have previously only worked in Stata but am now trying to switch to python. I want to conduct an event study. More specifically, I have 4 fixed dates a year. Every first day of every quarter, e.g. 1st January, 1st April...., and an event window +- 10 days around the event date. In order to partition my sample to the desired window I am using the following command:
smpl = merged.ix[datetime.date(year=2013,month=12,day=21):datetime.date(year=2014,month=1,day=10)]
I want to write a loop that automatically shifts the choosen sample period 90 days forward in every run of the loop so that I can subsequently run the required analysis in that step. I know how to run the analysis, but I do not know how to shift the sample 90 days forward for every step in the loop. For example, the next sample in the loop should be:
smpl = merged.ix[datetime.date(year=2014,month=3,day=21):datetime.date(year=2014,month=4,day=10)]
Its probably pretty simple, something like month=I and then shift by +3 every month. I am just to much of a noob in python to get the syntax done.
Any help is greatly appreciated.
I'd use this:
for beg in pd.date_range('2013-12-21', '2017-05-17', freq='90D'):
smpl = merged.loc[beg:beg + pd.Timedelta('20D')]
...
Demo:
In [158]: for beg in pd.date_range('2013-12-21', '2017-05-17', freq='90D'):
...: print(beg, beg + pd.Timedelta('20D'))
...:
2013-12-21 00:00:00 2014-01-10 00:00:00
2014-03-21 00:00:00 2014-04-10 00:00:00
2014-06-19 00:00:00 2014-07-09 00:00:00
2014-09-17 00:00:00 2014-10-07 00:00:00
2014-12-16 00:00:00 2015-01-05 00:00:00
2015-03-16 00:00:00 2015-04-05 00:00:00
2015-06-14 00:00:00 2015-07-04 00:00:00
2015-09-12 00:00:00 2015-10-02 00:00:00
2015-12-11 00:00:00 2015-12-31 00:00:00
2016-03-10 00:00:00 2016-03-30 00:00:00
2016-06-08 00:00:00 2016-06-28 00:00:00
2016-09-06 00:00:00 2016-09-26 00:00:00
2016-12-05 00:00:00 2016-12-25 00:00:00
2017-03-05 00:00:00 2017-03-25 00:00:00
Related
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I have a DataFrame that has time stamps in the form of (yyyy-mm-dd hh:mm:ss). I'm trying to delete data between two different time stamps. At the moment I can delete the data between 1 range of time stamps but I have trouble extending this to multiple time stamps.
For example, with the DataFrame I can delete a range of rows (e.g. 2015-03-01 00:20:00 to 2015-08-01 01:10:00) however, I'm not sure how to go about deleting another range alongside it. The code that does that is shown below.
index_list= df.timestamp[(df.timestamp >= "2015-07-01 00:00:00") & (df.timestamp <= "2015-12-30 23:50:00")].index.tolist()
df1.drop(df1.index[index_list1, inplace = True)
The DataFrame extends over 3 years and has every day in the 3 years included.
I'm trying to delete all the rows from months July to December (2015-07-01 00:00:00 to 2015-12-30 23:50:00) for all 3 years.
I was thinking that I create a helper column that gets the Month from the Date column and then drops based off the Month from the helper column.
I would greatly appreciate any advice. Thanks!
Edit:
I've added in a small summarised version of the DataFrame. This is what the intial DataFrame looks like.
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-04-01 00:30:00 65.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-07-01 01:00:00 74.0
2015-08-01 01:10:00 54.0
2015-09-01 01:20:00 86.0
2015-10-01 01:30:00 91.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
To get something like this
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
Where time stamps "2015-07-01 00:20:00 to 2015-10-01 00:30:00"and "2015-07-01 01:00:00 to 2015-10-01 01:30:00" are removed. Sorry if my formatting isn't up to standard.
If your timestamp column uses the correct dtype, you can just do:
df.loc[df.timestamp.dt.month.isin([1, 2, 3, 5, 6, 11, 12])]
This should filter out the months not inside the list.
As you hinted, data manipulation is always easier when you use the right data types. To support time stamps, pandas has the Timestamp type. You can do this as follows:
df['Date'] = pd.to_datetime(df['Date']) # No date format needs to be specified,
# "YYYY-MM-DD HH:MM:SS" is the standard
Then, removing all entries in the months of July to December for all years is straightforward:
df = df[df['Date'].dt.month < 7] # Keep only months less than July
I have a list of nodes (about 2300 of them) that have hourly price data for about a year. I have a script that, for each node, loops through the times of the day to create a 4-hour trailing average, then groups the averages by month and hour. Finally, these hours in a month are averaged to give, for each month, a typical day of prices. I'm wondering if there is a faster way to do this because what I have seems to take a significant amount of time (about an hour). I also save the dataframes as csv files for later visualization (that's not the slow part).
df (before anything is done to it)
Price_Node_Name Local_Datetime_HourEnding Price Irrelevant_column
0 My-node 2016-08-17 01:00:00 20.95 EST
1 My-node 2016-08-17 02:00:00 21.45 EST
2 My-node 2016-08-17 03:00:00 25.60 EST
df_node (after the groupby as it looks going to csv)
Month Hour MA
1 0 23.55
1 1 23.45
1 2 21.63
for node in node_names:
df_node = df[df['Price_Node_Name'] == node]
df_node['MA'] = df_node['Price'].rolling(4).mean()
df_node = df_node.groupby([df_node['Local_Datetime_HourEnding'].dt.month,
df_node['Local_Datetime_HourEnding'].dt.hour]).mean()
df_node.to_csv('%s_rollingavg.csv' % node)
I get an weak error warning me about SetWithCopy, but I haven't quite figured out how to use .loc here since the column ['MA'] doesn't exist until I create it in this snippet and any way I can think of to create it before hand and fill it seems slower than what I have. Could be totally wrong though. Any help would be great.
python 3.6
edit: I might have misread the question here, hopefully this at least sparks some ideas for the solution.
I think it is useful to have the index as the datetime column when working with time series data in Pandas.
Here is some sample data:
Out[3]:
price
date
2015-01-14 00:00:00 155.427361
2015-01-14 01:00:00 205.285202
2015-01-14 02:00:00 205.305021
2015-01-14 03:00:00 195.000000
2015-01-14 04:00:00 213.102000
2015-01-14 05:00:00 214.500000
2015-01-14 06:00:00 222.544375
2015-01-14 07:00:00 227.090251
2015-01-14 08:00:00 227.700000
2015-01-14 09:00:00 243.456190
We use Series.rolling to create an MA column, i.e. we apply the method to the price column, with a two-period window, and call mean on the resulting rolling object:
In [4]: df['MA'] = df.price.rolling(window=2).mean()
In [5]: df
Out[5]:
price MA
date
2015-01-14 00:00:00 155.427361 NaN
2015-01-14 01:00:00 205.285202 180.356281
2015-01-14 02:00:00 205.305021 205.295111
2015-01-14 03:00:00 195.000000 200.152510
2015-01-14 04:00:00 213.102000 204.051000
2015-01-14 05:00:00 214.500000 213.801000
2015-01-14 06:00:00 222.544375 218.522187
2015-01-14 07:00:00 227.090251 224.817313
2015-01-14 08:00:00 227.700000 227.395125
2015-01-14 09:00:00 243.456190 235.578095
And if you want month and hour columns, can extract those from the index:
In [7]: df['month'] = df.index.month
In [8]: df['hour'] = df.index.hour
In [9]: df
Out[9]:
price MA month hour
date
2015-01-14 00:00:00 155.427361 NaN 1 0
2015-01-14 01:00:00 205.285202 180.356281 1 1
2015-01-14 02:00:00 205.305021 205.295111 1 2
2015-01-14 03:00:00 195.000000 200.152510 1 3
2015-01-14 04:00:00 213.102000 204.051000 1 4
2015-01-14 05:00:00 214.500000 213.801000 1 5
2015-01-14 06:00:00 222.544375 218.522187 1 6
2015-01-14 07:00:00 227.090251 224.817313 1 7
2015-01-14 08:00:00 227.700000 227.395125 1 8
2015-01-14 09:00:00 243.456190 235.578095 1 9
Then we can use groupby:
In [11]: df.groupby([
...: df['month'],
...: df['hour']
...: ]).mean()[['MA']]
Out[11]:
MA
month hour
1 0 NaN
1 180.356281
2 205.295111
3 200.152510
4 204.051000
5 213.801000
6 218.522187
7 224.817313
8 227.395125
9 235.578095
Here's a few things to try:
set 'Price_Node_name' to the index before the loop
df.set_index('Price_Node_name', inplace=True)
for node in node_names:
df_node = df[node]
use sort=False as a kwarg in the groupby
df_node.groupby(..., sort=False).mean()
Perform the rolling average AFTER the groupby, or don't do it at all--I don't think you need it in your case. Averaging the hourly totals for a month will give you the expected values for a typical day, which is what you desire. If you still want the rolling average, perform it on the averaged hourly totals for each month.
I have a group of dates. I would like to subtract them from their forward neighbor to get the delta between them. My code look like this:
import pandas, numpy, StringIO
txt = '''ID,DATE
002691c9cec109e64558848f1358ac16,2003-08-13 00:00:00
002691c9cec109e64558848f1358ac16,2003-08-13 00:00:00
0088f218a1f00e0fe1b94919dc68ec33,2006-05-07 00:00:00
0088f218a1f00e0fe1b94919dc68ec33,2006-06-03 00:00:00
00d34668025906d55ae2e529615f530a,2006-03-09 00:00:00
00d34668025906d55ae2e529615f530a,2006-03-09 00:00:00
0101d3286dfbd58642a7527ecbddb92e,2007-10-13 00:00:00
0101d3286dfbd58642a7527ecbddb92e,2007-10-27 00:00:00
0103bd73af66e5a44f7867c0bb2203cc,2001-02-01 00:00:00
0103bd73af66e5a44f7867c0bb2203cc,2008-01-20 00:00:00
'''
df = pandas.read_csv(StringIO.StringIO(txt))
df = df.sort('DATE')
df.DATE = pandas.to_datetime(df.DATE)
grouped = df.groupby('ID')
df['X_SEQUENCE_GAP'] = pandas.concat([g['DATE'].sub(g['DATE'].shift(), fill_value=0) for title,g in grouped])
I am getting pretty incomprehensible results. So, I am going to go with I have a logic error.
The results I get are as follows:
ID DATE X_SEQUENCE_GAP
0 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 12277 days, 00:00:00
1 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 00:00:00
3 0088f218a1f00e0fe1b94919dc68ec33 2006-06-03 00:00:00 27 days, 00:00:00
2 0088f218a1f00e0fe1b94919dc68ec33 2006-05-07 00:00:00 13275 days, 00:00:00
5 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 13216 days, 00:00:00
4 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 00:00:00
6 0101d3286dfbd58642a7527ecbddb92e 2007-10-13 00:00:00 13799 days, 00:00:00
7 0101d3286dfbd58642a7527ecbddb92e 2007-10-27 00:00:00 14 days, 00:00:00
9 0103bd73af66e5a44f7867c0bb2203cc 2008-01-20 00:00:00 2544 days, 00:00:00
8 0103bd73af66e5a44f7867c0bb2203cc 2001-02-01 00:00:00 11354 days, 00:00:00
I was expecting for exapme that 0 and 1 would have both a 0 result. Any help is most appreciated.
This is in 0.11rc1 (I don't think will work on a prior version)
When you shift dates the first one is a NaT (like a nan, but for datetimes/timedeltas)
In [27]: df['X_SEQUENCE_GAP'] = grouped.apply(lambda g: g['DATE']-g['DATE'].shift())
In [30]: df.sort()
Out[30]:
ID DATE X_SEQUENCE_GAP
0 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 NaT
1 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 00:00:00
2 0088f218a1f00e0fe1b94919dc68ec33 2006-05-07 00:00:00 NaT
3 0088f218a1f00e0fe1b94919dc68ec33 2006-06-03 00:00:00 27 days, 00:00:00
4 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 NaT
5 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 00:00:00
6 0101d3286dfbd58642a7527ecbddb92e 2007-10-13 00:00:00 NaT
7 0101d3286dfbd58642a7527ecbddb92e 2007-10-27 00:00:00 14 days, 00:00:00
8 0103bd73af66e5a44f7867c0bb2203cc 2001-02-01 00:00:00 NaT
9 0103bd73af66e5a44f7867c0bb2203cc 2008-01-20 00:00:00 2544 days, 00:00:00
You can then fillna (but you have to do this ackward type conversion becuase of a numpy bug, will get fixed in 0.12).
In [57]: df['X_SEQUENCE_GAP'].sort_index().astype('timedelta64[ns]').fillna(0)
Out[57]:
0 00:00:00
1 00:00:00
2 00:00:00
3 27 days, 00:00:00
4 00:00:00
5 00:00:00
6 00:00:00
7 14 days, 00:00:00
8 00:00:00
9 2544 days, 00:00:00
Name: X_SEQUENCE_GAP, dtype: timedelta64[ns]
I am dealing with a dataset where observations occur between opening and closing hours -- but the service closes on the day after it opens. For example, opening occurs at 7am and closing at 1am, the following day.
This feels like a very common problem -- I've searched around for it and am open to the fact I might just not know the correct terms to search for.
For most of my uses it's enough to do something like:
open_close = pd.DatetimeIndex(start='2012-01-01 05:00:00', periods = 15, offset='D')
Then I can just do fun little groupbys on the df:
df.groupby(open_close.asof).agg(func).
But I've run into an instance where I need to grab multiple of these open-close periods. What I really want to be able to do is just have an DatetimeIndex where I get to pick when an day starts. So I could just redefine 'day' to be from 5AM to 5AM. The nice thing about this is I can then use things like df[df.index.dayofweek == 6] and get back everything from 5AM on Sunday to 5AM on Monda.
It feels like Periods...or something inside of pandas anticipated this request. Would love help figuring it out.
EDIT:
I've also figured this out via creating another column with the right day
df['shift_day'] = df['datetime'].apply(magicFunctionToFigureOutOpenClose)
-- so this isn't blocking my progress. Just feels like something that could be nicely integrated into the package (or datetime...or somewhere...)
Perhaps the base parameter of df.resample() would help:
base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0
Here's an example:
In [44]: df = pd.DataFrame(np.random.rand(28),
....: index=pd.DatetimeIndex(start='2012/9/1', periods=28, freq='H'))
In [45]: df
Out[45]:
0
2012-09-01 00:00:00 0.970273
2012-09-01 01:00:00 0.730171
2012-09-01 02:00:00 0.508588
2012-09-01 03:00:00 0.535351
2012-09-01 04:00:00 0.940255
2012-09-01 05:00:00 0.143483
2012-09-01 06:00:00 0.792659
2012-09-01 07:00:00 0.231413
2012-09-01 08:00:00 0.071676
2012-09-01 09:00:00 0.995202
2012-09-01 10:00:00 0.236551
2012-09-01 11:00:00 0.904853
2012-09-01 12:00:00 0.652873
2012-09-01 13:00:00 0.488400
2012-09-01 14:00:00 0.396647
2012-09-01 15:00:00 0.967261
2012-09-01 16:00:00 0.554188
2012-09-01 17:00:00 0.884086
2012-09-01 18:00:00 0.418577
2012-09-01 19:00:00 0.189584
2012-09-01 20:00:00 0.577041
2012-09-01 21:00:00 0.100332
2012-09-01 22:00:00 0.294672
2012-09-01 23:00:00 0.925425
2012-09-02 00:00:00 0.630807
2012-09-02 01:00:00 0.400261
2012-09-02 02:00:00 0.156469
2012-09-02 03:00:00 0.658608
In [46]: df.resample("24H", how=sum, label='left', closed='left', base=5)
Out[46]:
0
2012-08-31 05:00:00 3.684638
2012-09-01 05:00:00 11.671068
In [47]: df.ix[:5].sum()
Out[47]: 0 3.684638
In [48]: df.ix[5:].sum()
Out[48]: 0 11.671068