Slicing operation with Dask in a optimal way with Python - python

I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.

The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.

It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.

Related

Dask: Sorting truly lazily

If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).

Dask map_partitions results in duplicates when reducing and gives wrong results compared to pure pandas

When I use dask to groupby using map_partitions, I obtain duplicated data and wrong results compared to simple pandas groupby. But when I use n_partitons=1, I get the correct results.
Why does this happen? and how can I use multiple partitions and still get the correct results?
my code is
measurements = measurements.repartition(n_partitions=38)
measurements.map_partitions(lambda df : df.groupby(["id",df.time.dt.to_period("M"),
"country","job"]).source.nunique()).compute().reset_index()
In pandas, I do
measurements.groupby(["id",measurements.time.dt.to_period("M"),
"country","job"]).source.nunique().reset_index()
PS: I'm using a local cluster on a single machine.
When you call map_partitions, you say you want to perform that action on each partition. Given that each unique grouping value can occur in multiple partitions, you will get an entry for each group, for each partition in which it is found.
What if there were a way to do groupby across partitions and have the results smartly merged for you automatically? Fortunately, this is exactly what dask does, and you did not need to use map_partitions at all.
measurements.groupby(...).field.nunique().compute()

What is the time complexity of .at and .loc in pandas?

I'm looking for the time complexity of these methods as a function of the number of rows in a dataframe, n.
Another way of asking this question is: Are indexes for dataframes in pandas btrees (with log(n) time look ups) or hash tables (with constant time lookups)?
Asking this question because I'd like a way to do constant time look ups for rows in a dataframe based on a custom index.
Alright so it would appear that:
1) You can build your own index on a dataframe with .set_index in O(n) time where n is the number of rows in the dataframe
2) The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index. So accessing a row for the first time using that index takes O(n) time
3) All subsequent row access takes constant time.
So it looks like the indexes are hash tables and not btrees.
From the Pandas Internals documentation, the default DataFrame index
Populates a dict of label to location in Cython to do O(1) lookups.
dict uses hash tables, supporting Peter Berg's answer to this question.

After how many columns does the Pandas 'set_index' function stop being useful?

As I understand it, the advantage to using the set_index function with a particular column is to allow for direct access to a row based on a value. As long as you know the value, this eliminates the need to search using something like loc thus cutting down the running time of the operation. Pandas also allows you to set multiple columns as the index using this function. My question is, after how many columns do these indexes stop being valuable? If I were to specify every column in my dataframe as the index would I still see increased speed in indexing rows over searching with loc?
The real downside of setting everything as index is buried deep in the advanced indexing docs of Pandas: indexing can change the dtype of the column being set to index. I would expect you to encounter this problem before realizing the prospective performance benefit.
As for that performance benefit, you pay for indexing up front when you construct the Series object, regardless of whether you explicitly set them. AFAIK Pandas indexes everything by default. And as Jake VanderPlas puts it in his excellent book:
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.
-- Jake VanderPlas, The Python Data Science Handbook
So, the reason to set something as index is to make it easier for you to work with your data or to support your data access pattern, not necessarily for performance optimization like a database index.

Pandas memory usage?

I was wondering how pandas handles memory usage in python? I was wondering more specifically how the memory is handled if I set a pandas dataframe query results to a variable. Behind the hood, would it just be some memory addresses to the original dataframe object or would I be cloning all of the data?
I'm afraid of memory ballooning out of control but I have a dataframe that has non-unique fields I can't index it by. It's incredibly slow to query and plot data from it using commands like df[(df[''] == x) & (df[''] == y)].
(They're both integer values in the rows. They're also not unique, hence the fact it returns multiple results.)
I'm very new to pandas anyway, but any insights as to how to handle a situation where I'm looking for the arrays of values where two conditions match would be great too. Right now I'm using an O(n) algorithm to loop through and index it because even that runs faster than the search queries when I need to access the data quickly. Watching my system take twenty seconds on a dataset of only 6,000 rows is foreboding.

Categories