Is there an equivalent to pandas.cut() in Dask?
I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes.
So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask.
The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy into non-parallel code). Is there an equivalent to pandas.cut() in dask, or is there another (elegant) way to achieve the same functionality?
import dask
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))
# Set the bins to bin along energy
bins = range(0, 10000, 500)
# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)
# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])
# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()
Thanks a lot!
You should be able to do dd['energy'].map_partitions(pd.cut, bins).
Related
I need to use pd.cut on a dask dataframe.
This answer indicates that map_partitions will work by passing pd.cut as the function.
It seems that map_partitions passes only one partition at a time to the function. However, pd.cut will need access to an entire column of my df in order to create the bins. So, my question is: will map_partitions in this case actually operate on the the entire dataframe or am I going to get incorrect results with this this approach?
In your question you correctly identify why the bins should be provided explicitly.
By specifying the exact bin cuts (either based on some calculation or external reasoning), you ensure that what dask does is comparable across partitions.
# this does not guarantee comparable cuts
ddf['a'].map_partitions(pd.cut)
# this ensures the cuts are as per the specified bins
ddf['a'].map_partitions(pd.cut, bins)
If you want to generate bins in an automatic way, one way is to get the min/max for the column of interest and generate the bins with np.linspace:
# note that computation is needed to give
# actual (not delayed) values to np.linspace
bmin, bmax = dask.compute(ddf['a'].min(), ddf['a'].max)
# specify the number of desired cuts here
bins = np.linspace(bmin, bmax, num=123)
Context of the Problem
I am running a discrete event simulation where at the end of each event I store the state of the system in a row of a dataframe. Each simulation run has a clock which determines which events are run and when, all the simulation have the same initial clock (0) and the same end clock (simulation end). but the number of rows for the dataframes may be different because the simulation has several stochastic components. The clock column is then converted to timestamp and use as index of the dataframe.
Whenever one does a simulation, it is good practice to run it many times and then average over all the replicates, doing so in this setup is a bit complicated because each simulation produced a dataframe which different indexes.
Proposed Solution
So far this is the solution I found:
# Auxiliary functions
def get_union_index(dfs):
index_union = dfs[0].index
for df in dfs[1:]:
index_union = index_union.union(df.index)
return pd.Series(index_union).drop_duplicates().reset_index(drop=True)
def interpolate(df, index, method='time'):
aux = df.astype(float)
aux = aux.reindex(index).fillna(method='ffill').fillna(method='bfill')
return aux
# Simulation
dfs = []
replicates = 30
seeds = list(range(40, 40 + replicates))
dfs = [simulate(seed=seed, **parameters) for seed in seeds]
# Main Code
union_index = get_union_index(dfs)
dfs_interpolates = [interpolate(df, union_index) for df in dfs]
df_concat = pd.concat(dfs_interpolates)
by_row_index = df_concat.groupby(df_concat.index)
# Averaging
df_means = by_row_index.mean()
df_std = by_row_index.std()
Explanation
First, it is necessary to combine all the indexes, then this combined index is used to re-index all the dataframes and the nan values are filled using interpolation.
Questions
Is there a native pandas function that could simplify this?
(If not 1) Is there an alternative way to combine the datasets directly, since the majority of the index are disjoint, union_index has a length of approximately len(df) * len(dfs) which is actually huge.
In this case I am using ffill and then bfill but the solution should allow the more general interpolate method to support different and more complex approaches. The main issue is that since each row is an event (before interpolation), only a few columns (sometimes none) are changed in two consecutive rows, making it possible to have several consecutive rows with identical values in several columns and thus having a lot of redundant values, and even more after interpolation (especially for fill and bfill).
EDIT
Add data to reproduce and change question 3
dfs is a list of 3 DataFrames as follows:
Then dfs_interpolates is also a list of 3 DataFrames but this time the indexes are exactly the same, namely:
Finally the expected result should be a row-wise application of a function (mean, median, etc.) across the different dataframes in dfs_interpolates. For the case of the mean the result should be:
I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).
I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)
You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.
I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:
I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]:
so I define a function as below:
def quant(group):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
return group
And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:
quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)
That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core.
It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance.
I came up with "numba" solution with below code which didn't work:
from numba import njit, jit
#jit(nopython=True)
def quant_numba(df):
final_quant = []
for bar_shop,group in df.groupby(['Barcode','ShopCode']):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
final_quant.append((bar_shop,group["Quantile"]))
return final_quant
result = quant_numba(sales)
It seems that I cannot use pandas objects within this decorator.
I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.
You can try DataFrameGroupBy.quantile:
df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)
Or like montioned #Jon Clements for new columns filled by percentiles use GroupBy.transform:
df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)
There is a inbuilt function in panda called quantile().
quantile() will help to get nth percentile of a column in df.
Doc reference link
geeksforgeeks example reference
I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:
Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.
So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?
%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
.groupBy('User')
.pivot('ObjectPath')
.agg({'PropertyFlagValue':'max'})
.fillna(0))
ignore = ['User']
assembler = VectorAssembler(
inputCols=[x for x in tmp.columns if x not in ignore],
outputCol='features')
Wall time: 36.7 s
print(tmp.count(), len(tmp.columns))
552, 9378
%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s
%%time
lst_levels = []
for num in range(3, 14):
kmeans = KMeans(k=num, maxIter=50)
model = kmeans.fit(transformed)
lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
if i - j < j:
print(rs.index(i))
kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
model = kmeans.fit(transformed)
break
Wall time: 1min 32s
Config:
.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \
VectorAssembler's transform function processes all the columns and stores metadata on each column in addition to the original data. This takes time, and also takes up RAM.
To put an exact figure on how much things have increased, you can dump your data frame before and after the transformation as parquet files and compare. In my experience, a feature vector built by hand or other feature extraction methods compared to one built by VectorAssembler can cause a size increase of 10x and this was for a logistic regression with only 10 parameters. Things will get a lot worse with a data set with as many columns as you have.
A few suggestions:
See if you can build your feature vector another way. I'm not sure how performant this would be in Python, but I've got a lot of mileage out of this approach in Scala. I've noticed something like a 5x-6x performance difference comparing logistic regressions (10 params) for manually built vectors or vectors built using other extraction methods (TF-IDF) than VectorAssembled ones.
See if you can reshape your data to reduce the number columns that need to be processed by VectorAssembler.
See if increasing the RAM available to Spark helps.
Actually solution was found in map for rdd.
First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.
Advantages:
You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.
Example of code: scala implementation.