Sum large pandas dataframe based on smaller date ranges - python

I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into "monthly" data that sums the hourly data. However, the months aren't necessarily calendar months, they typically start in the middle of one month and end in the middle of the next month.
I could build a list of the "months" that each of these date ranges fall into and loop through it, but I would think there is a much better way to do this via pandas.
Here's my current code, the last line throws an error and is the crux of the question:
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
month = pd.DataFrame({'start':['1/4/2015 00:00','1/24/2015 00:00'], 'end':['1/23/2015 23:00','2/23/2015 23:00']})
month['start'] = pd.to_datetime(month['start'])
month['end'] = pd.to_datetime(month['end'])
month['num'] = df['num'][(df['date'] >= month['start']) & (df['date'] <= month['end'])].sum()
I would want an output similar to:
start end num
0 2015-01-04 2015-01-23 23:00:00 33,251
1 2015-01-24 2015-02-23 23:00:00 39,652
but of course, I'm not getting that.

pd.merge_asof only available with pandas 0.19
combination of pd.merge_asof + query + groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start date.
query
I take care of end date with query since I now conveniently have end in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`

Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns

Related

How to aggregate irregularly sampled data for Time Series Analysis

I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28

Pandas Columns with Date and Time - How to sort?

I have concatenated several csv files into one dataframe to make a combined csv file. But one of the columns has both date and time (e.g 02:33:01 21-Jun-2018) after being converted to date_time format. However when I call
new_dataframe = old_dataframe.sort_values(by = 'Time')
It sorts the dataframe by time , completely ignoring date.
Index Time Depth(ft) Pit Vol(bbl) Trip Tank(bbl)
189147 00:00:00 03-May-2018 2283.3578 719.6753 54.2079
3875 00:00:00 07-May-2018 5294.7308 1338.7178 29.5781
233308 00:00:00 20-May-2018 8073.7988 630.7964 41.3574
161789 00:00:01 05-May-2018 122.2710 353.6866 58.9652
97665 00:00:01 01-May-2018 16178.8666 769.1328 66.0688
How do I get it to sort by dates and then times , so that Aprils days come first, and come in chronological order?
In order to sort with your date first and then time, your Time column should be in the right way Date followed by Time. Currently, it's opposite.
You can do this:
df['Time'] = df['Time'].str.split(' ').str[::-1].apply(lambda x: ' '.join(x))
df['Time'] = pd.to_datetime(df['Time'])
Now sort your df by Time like this:
df.sort_values('Time')

Is it possible to resample and sum values in a Pandas dataframe by specifying a date range?

I have a dataframe like the following (dates with an associated binary value (whether or not a flood occurs), spanning a total of 20 years):
...
2019-12-27 0.0
2019-12-28 1.0
2019-12-29 1.0
2019-12-30 0.0
2019-12-31 0.0
...
I need to produce a count (i.e. sum, considering the values are binary) over a series of custom date ranges, e.g. '24-05-2019 to 09-09-2019', or '15-10-2019 to 29-12-2019', etc.
My initial thoughts were to use the resample method, however as I understand this will not allow me to select a custom date range, rather it will allow me to resample over a set time period, e.g. month or year.
Any ideas out there?
Thanks in advance
If the dates are a DatetimeIndex and the index of the dataframe or Series, you can directly select the relevant rows:
df.loc['24-05-2019':'09-09-2019', 'flood'].sum()
Since it's a Pandas dataframe you should be able to do something like:
start_date = df[df.date == '24-05-2019']].index.values
end_date = df[df.date == '09-09-2019'].index.values
subset = df[start_date:end_date]
sum(subset.flood) # Or other maths
where 'date' and 'flood' are your column headers, and 'df' is your dataframe. This assumes your dates are strings, and that each date only appears once. If not, you'll have to pick which date you want from the list of index values in 'start_date' and 'end_date'.

How to find the top 10 performing values of each week in python?

I would like to return the top 10 performing (by average) variables for each week in my DataFrame. It is about 2 years worth of data
I am using Python to figure this out but, would also eventually like to do it in SQL.
I have been able to produce code that returns the top 10 for the latest week but, would like results for every week
Creating df that creates the datetime range
range_max = rtbinds['pricedate'].max()
range_min = range_max - datetime.timedelta(days=7)
sliced_df = rtbinds[(rtbinds['pricedate'] >= range_min)
& (rtbinds['pricedate'] <= range_max)]
grouping by 'shadow'
sliced_df.groupby(['pricedate','cons_name']).aggregate(np.mean)
.sort_values('shadow').head(10)
returns for the first week of data.
pricedate cons_name shadow
2019-04-26 TEMP71_24753 -643.691
2019-04-27 TMP175_24736 -508.062
2019-04-25 TMP109_22593 -383.263
2019-04-23 TEMP48_24759 -376.967
2019-04-29 TEMP71_24753 -356.476
TMP175_24736 -327.230
TMP273_23483 -303.234
2019-04-27 TEMP71_24753 -294.377
2019-04-28 TMP175_24736 -272.603
TMP109_22593 -270.887
But, I would like a list that returns the top 10 for each week until the earliest date of my data
heads up pd.sort_values is sorting by default in ascending order, so when you take head(10), it's actually the worst 10 if we consider the natural ordering of real numbers.
Now for your problem, here is a solution
First we need to create some columns to identify the week of the year (rtbins is renamed df):
df['year'] = df['pricedate'].apply(lambda x: x.year)
df['week'] = df['pricedate'].apply(lambda x: x.isocalendar()[1])
Then we will group the data by ['year', 'week', 'cons_name'] :
df2 = df.groupby(['year', 'week', 'cons_name'], as_index=False).aggregate(np.mean)
You should get now a dataframe where for each (year, week) you have only one record of a cons_name with the mean shadow.
Then we will take the top 10 for each (year, week)
def udf(df):
return df.sort_values('shadow').head(10)
df2.groupby(['year', 'week'], as_index=False).apply(udf)
This should give you the result you want.

Pandas- Split dataset based on overlapping time periods

I have reporting time periods that start on Mondays, end on Sundays, and run for 5 weeks. For example:
11/20/2017 - 12/24/2017 = t1
11/27/2017 - 12/31/2017 = t2
I have a dataframe that consists of 6 of these periods (starting 11/20/2017) and I'm trying to split it into 6 dataframes for each time period using the LeaveDate column. My data looks like this:
Barcode LeaveDate
ABC123 2017-11-22
ABC124 2017-12-04
ABC125 2017-12-15
As the dataframe is separated, some of the barcodes will fall into multiple periods- that's OK. I know I can do:
df['period'] = df['LeaveDate'].dt.to_period('M-SUN')
df['week'] = df['period'].dt.week
To get single weeks, but I don't know how to definte a "multi-week" period. The problem also is that a barcode can full under multiple periods, so they need to be outputted to multiple dataframes. Any ideas? Thanks!
There might be a more succinct solution, but this should work (will give you a dictionary of DataFrames, one for each period):
df = pd.DataFrame([['ABC123', '2017-11-22'],
['ABC124', '2017-12-04'],
['ABC125', '2017-12-15']],
columns=['Barcode', 'LeaveDate'])
periods = [('2017-11-20', '2017-12-24'), ('2017-11-27', '2017-12-31')]
results = {}
for period in periods:
period_df = df[(df['LeaveDate'] >= period[0]) & (df['LeaveDate'] <= period[1])]
results[period] = period_df

Categories