Pandas: Value With Greatest Quantity Resampled - python

I have a DataFrame df
value quantity
2020-01-02 08:50:03 A 20
2020-01-02 08:52:39 B 29
2020-01-02 08:54:51 C 30
2020-01-02 08:55:03 C 20
2020-01-02 08:56:43 A 20
2020-01-02 08:59:59 B 10
2020-01-02 09:02:01 A 29
2020-01-02 09:03:29 B 27
2020-01-02 09:06:51 C 30
2020-01-02 09:07:03 C 20
2020-01-02 09:07:43 A 33
2020-01-02 09:09:59 B 10
I want to resample my DataFrame every T minutes (10 minutes in this example). And return the value that has the highest quantity. In the above example, I want to return the following:
value quantity
2020-01-02 08:50:00 C 50
2020-01-02 09:00:00 A 62
My current solution works but is slow due to redundant computations.
def get_value_with_max_qty(df_rl):
""" Returns the value and total quantity of the value with max qty
Args:
df_rl: A pandas rolling object
Returns:
A pandas series
"""
gper = df_rl.groupby(df_rl.value).quantity.sum()
return pd.Series([gper.idxmax(), gper.max()])
Then, I run:
df.resample('10T', label='right',closed='left').apply(get_value_with_max_qty)
Is there a way to make my code faster and more memory-efficient without using groupby in an apply method?

Three observations:
resample is often found to be slower compared to pd.Grouper
using an indexing method is often faster than idxmax
avoid nesting groupby if possible
I benchmark an alternative method, largest_quantity_proposal, to yours, largest_quantity_baseline. This one computes the group specific quantity aggregates at once as opposed to using groupby within resample. Both produce the same result:
>>> largest_quantity_baseline(df)
0 1
datetime
2020-01-02 09:00:00 C 50
2020-01-02 09:10:00 A 62
>>> largest_quantity_proposal(df)
datetime value
2020-01-02 09:00:00 C 50
2020-01-02 09:10:00 A 62
Name: quantity, dtype: int64
For 1,000 repetitions, the suggested method takes on average 2.9ms vs. 4ms for the baseline method on my machine, i.e., it's roughly 30% faster.
import pandas as pd
import timeit
from io import StringIO
def largest_quantity_baseline(df):
def get_value_with_max_qty(df_rl):
gper = df_rl.groupby(df_rl.value).quantity.sum()
return pd.Series([gper.idxmax(), gper.max()])
return df.resample('10T', label='right', closed='left').apply(get_value_with_max_qty)
def largest_quantity_proposal(df):
test = df.groupby([pd.Grouper(level='datetime', freq='10Min', label='right', closed='left'),'value'])['quantity'].sum()
return test[test == test.groupby('datetime').transform('max')]
# sample data
df = """
datetime,value,quantity
2020-01-02 08:50:03,A,20
2020-01-02 08:52:39,B,29
2020-01-02 08:54:51,C,30
2020-01-02 08:55:03,C,20
2020-01-02 08:56:43,A,20
2020-01-02 08:59:59,B,10
2020-01-02 09:02:01,A,29
2020-01-02 09:03:29,B,27
2020-01-02 09:06:51,C,30
2020-01-02 09:07:03,C,20
2020-01-02 09:07:43,A,33
2020-01-02 09:09:59,B,10
"""
df = pd.read_csv(StringIO(df.strip()), sep=',', engine='python',parse_dates=['datetime'], index_col='datetime')
# check result for correctness
largest_quantity_baseline(df)
largest_quantity_proposal(df)
# benchmark tests
num_runs = 1000
duration = timeit.Timer(lambda: largest_quantity_baseline(df)).timeit(number = num_runs)
print(duration/num_runs)
duration = timeit.Timer(lambda: largest_quantity_proposal(df)).timeit(number = num_runs)
print(duration/num_runs)

Related

Efficient groupby when rows of groups are contiguous?

The context
I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group.
As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks,
Bests
group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.
What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.
Here is an implementation:
import numpy as np
import numba as nb
# To avoid the compilation cost at runtime, use:
# #nb.njit('int64[:](float64[:],int64[:])')
#nb.njit
def segmentedCumSum(ids, values):
size = len(ids)
res = np.empty(size, dtype=values.dtype)
if size == 0:
return res
zero = values.dtype.type(0)
curValue = zero
for i in range(size):
if not np.isnan(ids[i]):
if i > 0 and ids[i-1] != ids[i]:
curValue = zero
curValue += values[i]
res[i] = curValue
else:
res[i] = -1
curValue = zero
return res
df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())
Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.

How to loop through a pandas grouped time series?

I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.

SAX method: cut time series into subsequences then calculate distances (Python)

I am trying to apply SAX (Symbolic Aggregation Approximation) method to detect outliers on my time series data. Basically I need to cut the whole series into equal length sub-series, then calculate the distances between each of them. Then the top-K sub-series are marked as abnormal.
Tried a few packages:
pyts - not sure how to cut the series in the first place
This question is relatable - is there any better solution in python?
tslearn.metrics.dtw_path_from_metric - looks like it's calculating distances between two series, but I am missing the first "cutting" part.
Also I was thinking if a matrix would work (with each sub-series as row and column, then distances are laid out on the diagnosis)
The outcome is 1) cut the series by week; 2) calculate the distances between each subseries; 3) rand them, with the top-k longest-distance ones. I know it's probably a lot to ask, but any suggestion will be really appreciated!
import datetime
import pandas as pd
import bumpy as np
rng = np.random.RandomState(0)
base = datetime.datetime.today()
dates = pd.date_range(start='1/1/2020', end='6/1/2020', freq='D')
df = pd.DataFrame(dates, columns=['date'])
df['sales'] = np.random.randint(0, 100, size=(len(dates)))
An answer to 1) cut the series by week
Although you could probably just get away with using df.groupby(pd.Grouper(key='date', freq='W')) perhaps more useful would be to populate to dataframe with the week_number and week_date attributes.
week = 1
weekly_data = []
week_data = []
for data in df.groupby(pd.Grouper(key='date', freq='W')):
week_date = data[0]
week_dates = list(data[1]['date'])
week_sales = list(data[1]['sales'])
week_data_list = list(zip(week_dates, week_sales))
for i in week_data_list:
week_data.append([week, week_date, i[0], i[1]])
weekly_data.append(week_data)
week += 1
df = pd.DataFrame(week_data, columns=['week_number', 'week_date', 'date', 'sales'])
df
This produces a dataframe of shape:
week_number week_date date sales
0 1 2020-01-05 2020-01-01 57
1 1 2020-01-05 2020-01-02 64
2 1 2020-01-05 2020-01-03 51
3 1 2020-01-05 2020-01-04 77
4 1 2020-01-05 2020-01-05 69
... ... ... ... ...
148 22 2020-05-31 2020-05-28 34
149 22 2020-05-31 2020-05-29 51
150 22 2020-05-31 2020-05-30 66
151 22 2020-05-31 2020-05-31 77
152 23 2020-06-07 2020-06-01 31
153 rows × 4 columns
You can simply select or iteration on the dimension you want e.g.
df.loc[weeks_df['week_number'] == 1]
week_number week_date date sales
0 1 2020-01-05 2020-01-01 57
1 1 2020-01-05 2020-01-02 64
2 1 2020-01-05 2020-01-03 51
3 1 2020-01-05 2020-01-04 77
4 1 2020-01-05 2020-01-05 69
Do note that this will not give you equal length subseries for each week because your data example does not allow for that, the first week having only 5 values and week 23 having only 1.
Good luck with 2) and 3)

Count on a rolling time window in pandas

I'm trying to return a count on a time window about a (moving) fixed point.
It's an attempt to understand the condition of an instrument at any time, as a function of usage prior to it.
So if the instrument is used at 12.05pm, 12.10, 12.15, 12.30, 12.40 and 1pm, the usage counts would be:
12.05 -> 1 (once in the last hour)
12.10 -> 2
12.15 -> 3
12.30 -> 4
12.40 -> 5
1.00 -> 6
... but then lets say usage resumes at 1.06:
1.06 -> 6
this doesn't increase the count, as the first run is over an hour ago.
How can I calculate this count and append it as a column?
It feels like this is an groupby/aggregate/count using possibly timedeltas in a lambda function, but I don't know where to start past that.
I'd like to be able to play with the time window too, so not just the past hour, but the hour surrounding an instance i.e. + and -30 minutes.
The following code gives a starting dataframe:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
The best help i found (and to be fair i can usually hack together from the logic) was this Distinct count on a rolling time window but i've not managed this time.
Thanks
I've done something similar previously with the DataFrame.rolling function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
So for your dataset, first you need to update the index to the datetime field, then you can preform the analysis you need, so continuing on from your code:
s = pd.Series(pd.date_range('2020-1-1', periods=8000, freq='250s'))
df = pd.DataFrame({'Run time': s})
df_sample = df.sample(6000)
df_sample = df_sample.sort_index()
# Create a value we can count
df_sample('Occurrences') = 1
# Set the index to the datetime element
df_sample = df_sample.set_index('Run time')
# Use Pandas rolling method, 3600s = 1 Hour
df_sample['Occurrences in Last Hour'] = df_sample['Occurrences'].rolling('3600s').sum()
df_sample.head(15)
Occurrences Occurrences in Last Hour
Run time
2020-01-01 00:00:00 1 1.0
2020-01-01 00:04:10 1 2.0
2020-01-01 00:08:20 1 3.0
2020-01-01 00:12:30 1 4.0
2020-01-01 00:16:40 1 5.0
2020-01-01 00:25:00 1 6.0
2020-01-01 00:29:10 1 7.0
2020-01-01 00:37:30 1 8.0
2020-01-01 00:50:00 1 9.0
2020-01-01 00:54:10 1 10.0
2020-01-01 00:58:20 1 11.0
2020-01-01 01:02:30 1 11.0
2020-01-01 01:06:40 1 11.0
2020-01-01 01:15:00 1 10.0
2020-01-01 01:19:10 1 10.0
You need to set the index to a datetime element to utilised the time base window, otherwise you can only use integer values corresponding to the number of rows.

Comparing date column values in one dateframe with two date column in another dataframe by row in Pandas

I have a dataframe like this with two date columns and a quamtity column :
start_date end_date qty
1 2018-01-01 2018-01-08 23
2 2018-01-08 2018-01-15 21
3 2018-01-15 2018-01-22 5
4 2018-01-22 2018-01-29 12
I have a second dataframe with just column containing yearly holidays for a couple of years, like this:
holiday
1 2018-01-01
2 2018-01-27
3 2018-12-25
4 2018-12-26
I would like to go through the first dataframe row by row and assign boolean value to a new column holidays if a date in the second data frame falls between the date values of the first date frame. The result would look like this:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
When I try to do that with a for loop I get the following error:
ValueError: Can only compare identically-labeled Series objects
An answer would be appreciated.
If you want a fully-vectorized solution, consider using the underlying numpy arrays:
import numpy as np
def holiday_arr(start, end, holidays):
start = start.reshape((-1, 1))
end = end.reshape((-1, 1))
holidays = holidays.reshape((1, -1))
result = np.any(
(start <= holiday) & (holiday <= end),
axis=1
)
return result
If you have your dataframes as above (calling them df1 and df2), you can obtain your desired result by running:
df1["contains_holiday"] = holiday_arr(
df1["start_date"].to_numpy(),
df1["end_date"].to_numpy(),
df2["holiday"].to_numpy()
)
df1 then looks like:
start_date end_date qty contains_holiday
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
try:
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df1.apply(lambda x: _is_holiday(x, df2), axis=1)
I'm not sure why you would want to go row-by-row. But boolean comparisons would be way faster.
df['holiday'] = ((df2.holiday >= df.start_date) & (df2.holiday <= df.end_date))
Time
>>> 1000 loops, best of 3: 1.05 ms per loop
Quoting #hchw solution (row-by-row)
def _is_holiday(row, df2):
return ((df2['holiday'] >= row['start_date']) & (df2['holiday'] <= row['end_date'])).any()
df.apply(lambda x: _is_holiday(x, df2), axis=1)
>>> The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.46 ms per loop
Try IntervalIndex.contains with list comprehensiont and np.sum
iix = pd.IntervalIndex.from_arrays(df1.start_date, df1.end_date, closed='both')
df1['holidays'] = np.sum([iix.contains(x) for x in df2.holiday], axis=0) >= 1
Out[812]:
start_date end_date qty holidays
1 2018-01-01 2018-01-08 23 True
2 2018-01-08 2018-01-15 21 False
3 2018-01-15 2018-01-22 5 False
4 2018-01-22 2018-01-29 12 True
Note: I assume start_date, end_date, holiday columns are in datetime format. If they are not, you need to convert them before run above command as follows
df1.start_date = pd.to_datetime(df1.start_date)
df1.end_date = pd.to_datetime(df1.end_date)
df2.holiday = pd.to_datetime(df2.holiday)

Categories