I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))
Related
Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5
I have such a dataframe:
ds y
2018-07-25 22:00:00 1
2018-07-25 23:00:00 2
2018-07-26 00:00:00 3
2018-07-26 01:00:00 4
2018-07-26 02:00:00 5
What I want to get is a new dataframe which looks like this
ds y
2018-07-25 3
2018-07-26 12
I want to get a new dataframe df1 where all the entries of one day are summed up in y and I only want to keep one column of this day without a timestamp.
What I did so far is this:
df1 = df.groupby(df.index.date).transform(lambda x: x[:24].sum())
24 because I have 24 entries every day (for every hour). I get the correct sum for every day but I also get 24 rows for every day together with the existing timestamps. How can I achieve what I want?
If need sum all values per days then filtering first 24 rows is not necessary:
df1 = df.groupby(df.index.date)['y'].sum().reset_index()
Try out:
df.groupby([df.dt.year, df.dt.month, df.dt.day])['y'].sum()
Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1
I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623