increase efficiency of pandas groupby with custom aggregation function - python

I have a not so large dataframe (somewhere in 2000x10000 range in terms of shape).
I am trying to groupby a columns, and average the first N non-null entries:
e.g.
def my_part_of_interest(v,N=42):
valid=v[~np.isnan(v)]
return np.mean(valid.values[0:N])
mydf.groupby('key').agg(my_part_of_interest)
It now take a long time (dozen of minutes), when .agg(np.nanmean)
was instead in order of seconds.
how to get it running faster?

Some things to consider:
Droping the nan entries on the entire df via single operation is faster than doing it on chunks of grouped datasets mydf.dropna(subset=['v'], inplace=True)
Use the .head to slice mydf.groupby('key').apply(lambda x: x.head(42).agg('mean')
I think those combined can optimize things a bit and they are more idiomatic to pandas.

Related

Extracting latest values in a Dask dataframe with non-unique index column dates

I'm quite familiar with pandas dataframes but I'm very new to Dask so I'm still trying to wrap my head around parallelizing my code.
I've obtained my desired results using pandas and pandarallel already so what I'm trying to figure out is if I can scale up the task or speed it up somehow using Dask.
Let's say my dataframe has datetimes as non-unique indices, a values column and an id column.
time value id
2021-01-01 00:00:00.210281 28.08 293707
2021-01-01 00:00:00.279228 28.07 293708
2021-01-01 00:00:00.697341 28.08 293709
2021-01-01 00:00:00.941704 28.08 293710
2021-01-01 00:00:00.945422 28.07 293711
... ... ...
2021-01-01 23:59:59.288914 29.84 512665
2021-01-01 23:59:59.288914 29.83 512666
2021-01-01 23:59:59.288914 29.82 512667
2021-01-01 23:59:59.525227 29.84 512668
2021-01-01 23:59:59.784754 29.84 512669
What I want to extract is the latest value for every second. e.g. if the price right before 2021-01-01 00:00:01 is the row with the index of 2021-01-01 00:00:00.945422 the latest value is 28.07.
In my case, sometimes index values are not unique so as a tie breaker, I'd like to use the id column. The value with the largest id number will be considered the latest value. For the case of the three values tied at the time 2021-01-01 23:59:59.288914, the value 29.82 would be chosen since the largest id for that date would be 512667. Also note that id is not consistent throughout the dataset and I cannot only rely on it for ordering my data.
In pandas I simply do this by obtaining the last index
last_index = df.loc[date_minus60: date_curr].index[-1]
last_values = df.loc[last_index]
and then if the value of last_values.index.is_unique is false, I finally perform last_values.sort_values('id').iloc[-1].
I've been having a hard time translating this code to Dask encountering issues regarding my delayed function resulting to them to need computing before I can reindex my dataframe again.
I'd like to know if there are any best practices to dealing with this sort of problem.
#Kafkaesque here's another approach to consider using map_partitions, which maps a custom function across each partition, treating each as a Pandas DataFrame. Generally, it's advisable to use dask.dataframe methods directly. In this case, however, dask.DataFrame.sort_values only supports sorting by a single column, so map_partitions is a good alternative. You may also find these Dask Groupby examples helpful.
It's worth noting that using map_partitions + groupby only works if your dataset is already sorted, such that the same seconds are in the same partition. The example below is for the case where the data are not sorted:
import dask
import dask.dataframe as dd
import pandas as pd
# example dataset, use sample() to "unsort"
ddf = dask.datasets.timeseries(
freq="250ms", partition_freq="5d", seed=42
).sample(frac=0.9, replace=True, random_state=42)
# first set the rounded timestamp as the index before calling map_partitions
# (don't need to reset the index if your dataset is already sorted)
ddf = ddf.reset_index()
ddf = ddf.assign(round_timestamp=ddf['timestamp'].dt.floor('S')).set_index('round_timestamp')
def custom_func(df):
return (
df
.sort_values(by=['timestamp', 'id'])
.groupby('round_timestamp')
.last()
)
new_ddf = ddf.map_partitions(custom_func)
# shows embarrassingly parallel execution of 'custom_func' across each partition
new_ddf.visualize(optimize_graph=True)
# check the result of the first partition
new_ddf.partitions[0].compute()
The snippet below shows that it's a very similar syntax:
import dask
# generate dask dataframe
ddf = dask.datasets.timeseries(freq="500ms", partition_freq="1h")
# generate a pandas dataframe
df = ddf.partitions[0].compute() # pandas df for example
# sample dates
date_minus60 = "2000-01-01 00:00:00.000"
date_curr = "2000-01-01 00:00:02.000"
# pandas code
last_index_pandas = df.loc[date_minus60:date_curr].index[-1]
last_values_pandas = df.loc[last_index_pandas]
# dask code
last_index_dask = ddf.loc[date_minus60:date_curr].compute().index[-1]
last_values_dask = ddf.loc[last_index_dask].compute()
# check equality of the results
print(last_values_pandas == last_values_dask)
Note that the distinction is in two .compute steps in dask version, since two lazy values need to be computed: first is to find out the correct index location and second is to get the actual value. Also this assumes that the data is already indexed by the timestamp, if it is not, it's best to index the data before loading into dask since .set_index is in general a slow operation.
However, depending on what you are really after this is probably not a great use of dask. If the underlying idea is to do fast lookups, then a better solution is to use indexed databases (including specialised time-series databases).
Finally, the snippet above is using unique index. If the actual data has non-unique indexes, then the requirement on selection by largest id is something that should be handled once the last_values_dask is computed, by using something like this (pseudo code, not expected to work right away):
def get_largest_id(last_values):
return last_values.sort_values('id').tail(1)
last_values_dask = get_largest_id(last_values_dask)
There is scope for designing a better pipeline if the lookup is for batches (rather than specific sample dates).

How to apply multiple functions to several chunks of a dask dataframe?

I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).
I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)
You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.

High performance apply on group by pandas

I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:
I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]:
so I define a function as below:
def quant(group):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
return group
And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:
quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)
That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core.
It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance.
I came up with "numba" solution with below code which didn't work:
from numba import njit, jit
#jit(nopython=True)
def quant_numba(df):
final_quant = []
for bar_shop,group in df.groupby(['Barcode','ShopCode']):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
final_quant.append((bar_shop,group["Quantile"]))
return final_quant
result = quant_numba(sales)
It seems that I cannot use pandas objects within this decorator.
I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.
You can try DataFrameGroupBy.quantile:
df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)
Or like montioned #Jon Clements for new columns filled by percentiles use GroupBy.transform:
df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)
There is a inbuilt function in panda called quantile().
quantile() will help to get nth percentile of a column in df.
Doc reference link
geeksforgeeks example reference

Apply LSH approxNearestNeighbors on all rows of a dataframe

I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n) on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows.
My problem is that I have to find a way to compute it within a reasonable time (no more than 2 hrs). I've read about that function approxSimilarityJoin(df, df, threshold) but the function takes way too long and doesn't return the right number of rows : if my dataframe has 100.000 rows, and I set a threshold VERY high/permissive I get something like not even 10% of the number of rows returned.
So, I'm thinking about using approxNearestNeighbors on all rows so that the computation time is almost linear.
How do you apply that function to every row of a dataframe ? I can't use a UDF since I need the model + a dataframe as inputs.
Do you have any suggestions ?

Creating dataframe by merging a number of unknown length dataframes

I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.

Categories