Will dask map_partitions(pd.cut, bins) actually operate on entire dataframe?

Will dask map_partitions(pd.cut, bins) actually operate on entire dataframe? - python

I need to use pd.cut on a dask dataframe.
This answer indicates that map_partitions will work by passing pd.cut as the function.
It seems that map_partitions passes only one partition at a time to the function. However, pd.cut will need access to an entire column of my df in order to create the bins. So, my question is: will map_partitions in this case actually operate on the the entire dataframe or am I going to get incorrect results with this this approach?

In your question you correctly identify why the bins should be provided explicitly.
By specifying the exact bin cuts (either based on some calculation or external reasoning), you ensure that what dask does is comparable across partitions.
# this does not guarantee comparable cuts
ddf['a'].map_partitions(pd.cut)
# this ensures the cuts are as per the specified bins
ddf['a'].map_partitions(pd.cut, bins)
If you want to generate bins in an automatic way, one way is to get the min/max for the column of interest and generate the bins with np.linspace:
# note that computation is needed to give
# actual (not delayed) values to np.linspace
bmin, bmax = dask.compute(ddf['a'].min(), ddf['a'].max)
# specify the number of desired cuts here
bins = np.linspace(bmin, bmax, num=123)

Related

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.

You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

Storing and reading multiple histograms in a csv file

I'm working with histograms presented as pandas Series and representing the realizations of random variables from an observation set. I'm looking for an efficient way to store and read them back.
The histogram's bins are the index of the Series. For example :
histogram1 :
(-1.3747106810983318, 3.529160051186781] 0.012520
(3.529160051186781, 8.433030783471894] 0.013830
(8.433030783471894, 13.336901515757006] 0.016495
(13.336901515757006, 18.24077224804212] 0.007194
(18.24077224804212, 23.144642980327234] 0.041667
(23.144642980327234, 28.048513712612344] 0.000000
I would like to store several of these histograms in a single csv file (one file for each set of random variables, one file would store ~100 histograms), and read them back later exactly as they were before storing (each histogram from the file as a single Series, all values as floats).
How can I do this ? Since speed matters, is there a more efficient way than csv files ?
Therefore, when a new realization of a variable comes in, I would retrieve it's histogram from the corresponding file and assess the bin that it "falls in". Something like this :
# Not very elegant
for bin in histogram1.index:
if 1.0232545 in bin:
print("It's in!")
print(histogram1.loc[bin])
Thanks !

You are addressing two different topics here:
What is an efficient way to store multiple series?
How to determine the bin for a float from an already formed IntervalIndex?
The first part is straightforward. I would use pandas.concat() to create a big frame before saving to csv (or rather
pd.concat(histograms, keys=hist_names, names=['hist_name','bin']).rename('random_variable').to_frame().to_parquet()
see .to_parquet(), this answer, and this benchmark for more
Then when reading back, select a single histogram with
hist1 = df.loc[('hist1', :), 'random_variable']
or
grouped = df.reset_index('hist_name').groupby('hist_name')
hist1 = grouped.get_group('hist1')
The second part is already answered here.
In short, you need to flatten the IntervalIndex by:
bins = hist1.index.right
Then you can find the bin for your value (or list of values) with numpy.digitize:
i = np.digitize(my_value, bins)
return_value = hist1.iloc[i]
Edit
Just found this answer about Indexing with an IntervalIndex, which also works:
return_value = hist1.loc[my_value]

How to apply multiple functions to several chunks of a dask dataframe?

I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).

I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)

You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.

Understanding percentile= calculation in describes () of python

I am trying to understand the following:
1)how the percentiles are calculated.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
3) My requirement is to know actual value below which x% of population lies. How to do that?
Thanks
Python-2
new=pd.DataFrame({'a':range(10),'b':[60510,60053,54968,62269,91107,29812,45503,6460,62521,37128]})
print new.describe(percentiles=[ 0,0.1 ,0.2,0.3,0.4, 0.50, 0.6,0.7,0.8 ,0.90,1 ])

1)how the percentiles are calculated
90% percentile/quantile means 10% of the data is greater than that value, 90% of the data falls below that value. By default, it's based on a linear interpolation. This is why in your a column, values increment by 0.9instead of original data values of [0, 1, 2 ...]. If you want to use nearest values instead of interpolation, you can use the quantile method instead of describe and change the interpolation parameter.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
Your question is unclear here. It does return values in a sorted order, indexed based on the output of the .describe method output: count, mean, std, min, quantiles from low to high, max. If you only want quantiles and not the other statistics, you can use the quantile method instead.
3) My requirement is to know actual value below which x% of population lies. How to do that?
Nothing is wrong with the output. Those quantiles are accurate, although they aren't very meaningful when your data only has 10 observations.
Edit: It wasn't originally clear to me that you were attempting to do stats on a frequency table. I don't know of a direct solution in pandas that don't involve moving your data over to a numpy array. You could use numpy.repeat like to get a raw list of observations to put back into pandas and do descriptive stats on.
vals = np.array(new.a)
freqs = np.array(new.b)
observations = np.repeat(vals, freqs)

How to use pandas.cut() (or equivalent) in dask efficiently?

Is there an equivalent to pandas.cut() in Dask?
I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes.
So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask.
The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy into non-parallel code). Is there an equivalent to pandas.cut() in dask, or is there another (elegant) way to achieve the same functionality?
import dask
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))
# Set the bins to bin along energy
bins = range(0, 10000, 500)
# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)
# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])
# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()
Thanks a lot!

You should be able to do dd['energy'].map_partitions(pd.cut, bins).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Will dask map_partitions(pd.cut, bins) actually operate on entire dataframe? - python

Related

Multiply by unique number based on which pandas interval a number falls within

Storing and reading multiple histograms in a csv file

How to apply multiple functions to several chunks of a dask dataframe?

Understanding percentile= calculation in describes () of python

How to use pandas.cut() (or equivalent) in dask efficiently?

Categories

Resources