How to groupby non-unique timedate index and column - python

Just starting with Pandas. I have a DataFrame with a timedate index a number of columns (data from parsing a log file). I have been able to convert the DataFrame index to a period index (monthly). One of the columns contains the user name associated to the event in the logfile. I would like to get an overview of the number of occurrences (i.e. rows in the DataFrame) per month per user. The index has non-unique values, so I have been able to group this by using
grp = DF_monthly.groupby(level=0)
However, I don't seem to be able to add that extra grouping on the user column. How can I do this?

Say your raw log looks like:
import pandas as pd
from StringIO import StringIO
infile = StringIO("""datetime,user,event
2013-01-01 11:15:23,A,error
2013-01-02 11:15:23,C,warning
2013-01-03 11:15:23,C,message
2013-02-01 11:15:23,A,error
2013-02-02 11:15:23,B,warning
2013-02-03 11:15:23,A,message""")
df = pd.read_csv(infile, parse_dates=True, index_col='datetime')
user event
datetime
2013-01-01 11:15:23 A error
2013-01-02 11:15:23 C warning
2013-01-03 11:15:23 C message
2013-02-01 11:15:23 A error
2013-02-02 11:15:23 B warning
2013-02-03 11:15:23 A message
Then you can get a count per user per month with:
df.groupby([lambda x: x.strftime('%Y-%b'), 'user']).count()['event']
user
2013-Feb A 2
B 1
2013-Jan A 1
C 2
So its not necessary to groupby month first, unless you have other reasons to do so. If so, you can also apply the last groupby on the monthly df as well.
The lambda function converts each timestamp from the index to a string of 'Year-Month' and performs a groupby on that string.

Related

Counting backwards from end date in pd.Grouper

I want to aggregate daily data to weekly (7-day sum) but with the last date as the 'origin'. Is it possible to do a group by from the end date using pd.Grouper? This is how the data looks like:
This code:
df.groupby(pd.Grouper(key='date', freq='7d'))['value'].sum()
results to
2020-01-01 5
2020-01-08 12
2020-01-15 4
but I was hoping for this:
2020-01-01 0
2020-01-03 7
2020-01-10 14
the method you have used can be shortened using resample method of pandas on df
but i think you problem is the order your dates are;
the result you expect is more day wise output;
hence what i will recommend is splitting the df and then again merging them
df.set_index(['date'],inplace=True)
df_below = df[3:].resample('W').sum()
df_up = df.iloc[0:3,:].sum()
# or you can give dates instead of 0:3 in iloc
the rows [0,1,2] you can take sum of those n then using hstack or concat or merge again make them one DataFrame
feel free for asking further queries....

How to select a subset of pandas DateTimeIndex whose data are in a list?

Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8

Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

Indexing by multiple fields with pandas in python

I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623

get subset dataframe by date

I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.
Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.

Categories