My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')
Related
I have a dataframe df that contains data in periods of 3 hours:
index , values
2003-01-01 00:00:00, 2.0
2003-01-01 03:00:00, 1.8
2003-01-01 06:00:00, 1.4
2003-01-01 09:00:00, 1.1
....
I want to resample the data to every hour and interpolate the missing values in between linearly. I can achieve something similar, filling the missing values with .bfill(), and it looks like this:
df2 = df.resample('H').bfill()
I tried to alter this to achive my task as follows:
df2 = df.resample('H')
df2.interpolate(method='linear', axis=0, inplace=True)
But df2 = df.resample('H') in contrast to df2 = df.resample('H').bfill() doesn't return a dataframe object, but a pandas.core.resample.DatetimeIndexResampler object.
Do you know how I can do the resampling and interpolation? Do you have some other work around? Tnx
I found out, that I could just append my initial approach with .interpolate() and it would work:
df2 = df.resample('H').interpolate()
df2.interpolate(method='linear', axis=0, inplace=True)
so the data set I am using is only business days but I want to change the date index such that it reflects every calendar day. When I use reindex and have to use reindex(), I am unsure how to use 'fill value' field of reindex to inherit the value above.
import pandas as pd
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex()
Currently, my data set looks like this.
However, when I use reindex I get the below result
In reality I want it to inherit the values directly above if it is a NaN result so the data set becomes the following
Thank you guys for your help!
You were close! You just need to pass the index you want to reindex on (idx in this case) as a parameter to the reindex method, and then you can set the method parameter to 'ffill' to propagate the last valid value forward.
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex(idx, method='ffill')
It seems that you have created a 'Series', not a dataframe. See if the code below helps you.
df = df.to_frame().reset_index() #to convert series to dataframe
df = df.fillna(method='ffill')
print(df)
Output You will have to rename columns
index 0
0 2019-12-18 22.63
1 2019-12-19 22.20
2 2019-12-20 21.03
3 2019-12-21 21.03
4 2019-12-22 21.03
5 2019-12-23 17.00
6 2019-12-24 19.65
I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1
I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623