I am trying to understand the use patterns for Dask on a local machine.
Specifically,
I have a dataset that fits in memory
I'd like to do some pandas operations
groupby...
date parsing
etc.
Pandas performs these operations via a single core and these operations are taking hours for me. I have 8 cores on my machine and, as such, I'd like to use Dask to parallelize these operations as best as possible.
My question is as follows: What is the difference between the two way of doing this in Dask:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
(1)
import dask.dataframe as dd
df = dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
df.mean().compute()
(2)
import dask.dataframe as dd
from distributed import Client
client = Client()
df = client.persist(
dd.from_pandas(
pd.DataFrame(iris.data, columns=iris.feature_names),
npartitions=2
)
)
df.mean().compute()
What is the benefit of one use pattern over the other? Why should I use one over the other?
Version (2) has two differences compared to version (1): the choice to use the distributed scheduler, and persist. These are separate factors. There is a lot of documentation about both: https://distributed.readthedocs.io/en/latest/quickstart.html, http://dask.pydata.org/en/latest/dataframe-performance.html#persist-intelligently , so this answer can be kept brief.
1) The distributed scheduler is newer and smarter than the previous threaded and multiprocess schedulers. As the name suggests, it is able to use a cluster, but also works on a single machine. Although the latency when calling .compute() is generally higher, in many ways it is more efficient, has more advanced features such as real-time dynamic programming and more diagnostics such as the dashboard. When created with Client(), you by default get a number of processes equal to the number of cores, but you can choose the number of processes and threads, and get close to the original threads-only situation with Client(processes=False).
2) Persisting means evaluating a computation and storing it in memory, so that further computations are faster. You can also persist without the distributed client (dask.persist). It effectively offers to trade memory for performance because you don't need to re-evalute the computation each time you use it for anything that depends on it. In the case where you go on to perform only one computation on the intermediate, as in the example, it should make no difference to performance.
Related
I have a function f that uses as input a variable x which is a large np.ndarray (lenght 20000).
Execution of f takes very little (about 5ms).
A for loop over a matrix M with many rows
for x in M:
f(x)
takes about 5 times longer than parallelizing using multiprocessing
import multiprocessing
with multiprocessing.Pool() as pool:
pool.map(f, M)
I have tried to parallelize with dask but it loses even against sequential execution. Related post is here but the accepted answer doesn´t work for me. I have tried many thing like use partitions of the data as the best practices say or using dask.bag. I'm running Dask in local machine with 4 physical cores.
So the question is how to use dask with short tasks that take large data as input?
Firstly, the dask documentation makes clear the following contraindications:
it is a bad idea to create big data in the client and pass it to workers; you should have workers load the data they need
if the data you need fit into memory, the standard python tool (in this case numpy) probably works as well or better than dask
if you want to share memory and are running processes such as numpy that release the GIL, then you should prefer threads over processes.
dask multiprocessing should not generally be used if you can run distributed (i.e., always)
don't use a python loop over an array, you should vectorize
Since we don't know much about what you are doing or your system, I will provide a guess of why dask is slower than multiprocessing. When you use multiprocessing.pool, probably the system created processes via fork, and copied (or copy-on-write duplicated) the array into each process, so they can access it. Dask requires threads and event loops to run, so it is not safe to use with fork. This means, that when you want data in the client to be processed in a worker, it must be serialised and sent over IPC. This is very likely the cause of your slowdown.
I am using Dask Bag to run some simple map-reduce computation on a special cluster:
import dask.bag as bag
summed_image = bag.from_sequence(my_ids).map(gen_image_from_ids).reduction(sum, sum).compute()
This code generates a chained computation, starts mapping from from_sequence and gen_image_from_ids, and then reduces all results into one with sum's. Thanks to Dask Bag's feature, the summation is done in parallel in a multi-level tree.
My special cluster setting has higher failure rate because my worker can be killed anytime and the CUP is taken over by other higher-order processes and then released after a while. The kill may occur once on only a single node per 5 minutes, but my total reduction job may take more than 5 minutes.
Although Dask is good at failure recovery, my job sometimes just never ends. Consider if any internal node in the job tree gets killed, the temporary intermediate results from all previous computations are missing. And the computation should restart from beginning.
There is replicate for Dask Future objects but I could not find similar feature on higher-level Dask Bag or Dataframe to ensure data resiliency. Please let me know if there is a common treatment to keep intermediate results in a Dask cluster with super-high failure rate.
Update - My workaround
Maybe any distributed computing system will suffer from frequent failures even though the system can recover from them. In my case the worker shutdown is not essentially system failure, but is triggered by the higher-order process. So instead of directly killing my workers, the higher-order process now launches a small python script to send retire_worker() command, when it starts running.
As documented, by retire_worker() scheduler will move data from the retired worker to another one available. So my problem is temporarily solved. However, I sill leave the question open since I think replicated, redundant computing would be a faster solution, and better use idle nodes in the cluster.
This might not be the solution you are looking for, but one option is to divide up the task sequence into small-enough batches that can ensure that the task will complete in time (or will be quick to re-do from scratch).
Something like this perhaps:
import dask.bag as db
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
results = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.compute()
)
results.append(summed_image)
summed_image = sum(results) # final result
There are other things to keep in mind here regarding re-starting the workflow on failure (or potentially launching smaller tasks in parallel), but hopefully this gives you a starting point for a workable solution.
Update: More trials later -- this answer is not ideal because client.replicate() command is blocking. I suspect it requires all futures to be done before making replica -- this is unwanted because 1. any intermediate node can disconnect before all are ready, and 2. it prevents other tasks to run asynchronously. I need other way to make replica.
After lots of trials, I found one way to replicate the intermediate results during chained computation to realize data redundancy. Note the parallel reduction function is a Dask Bag feature, which does not directly support replicate facility. However, as Dask document states, one can replicate low-level Dask Future objects to improve resiliency.
Following #SultanOrazbayev's post to manually perform partial sums, with persist() function to keep partial sums in cluster memory as in the comment, the returned item is essentially a Dask Future:
import dask.bag as db
from dask.distributed import futures_of
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
bags = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.persist()
)
bags.append(summed_image)
futures = futures_of(bags) # This can only be called on the .persist() result
I can then replicate these remote intermediate partial sums and feel safer to sum the futures to get final result:
client.replicate(futures, 5) # Improve resiliency by replicating to 5 workers
summed_image = client.submit(sum, futures).result() # The only line that blocks for the final result
Here I feel replica of 5 is stable for my cluster, although higher value will incur higher network overhead to pass the replica among workers.
This works but may be improved, like how to perform parallel reduction (sum) on the intermediate results, especially when there are lots of tasks. Please leave me your suggestions.
I have an API that loads data from MongoDB (with pymongo) and applies relatively "complex" data transformations with pandas afterwards, such as groupby on datetime columns, parametrizing the frequency and other stuff. Since I'm more expert in pandas than mongo, I do prefer doing it as it is, but I have no idea if writing these transformations as mongo aggregate queries would be significantly faster.
To simplify the question, not considering the difficulty on writing the queries on both sides: it's faster doing a [simple groupby on mongo and select * results] or [select * and doing it in pandas/dask(in a distributed scenario)]? Is the former faster/slower than the second in large datasets or smaller?
In general aggregation will be much faster than Pandas because the aggregation pipeline is:
Uploaded to the server and processed there eliminating network latency associated with round trips
Optimised internally to achieve the fastest execution
Executed in C on the server will all the available threading as opposed to running
on a single thread in Pandas
Working on the same dataset within the same process which means your working set
will be in memory eliminating disk accesses
As a stop gap I recommend pre-processing your data into a new collection using $out and then using Panda to process that.
I am using Dask for parallel computation and want to detect the language of sentences in a column using langdetect. However, I still can not gain any speed in getting the language of the rows in the column.
Below is my code:
import dask.dataframe as dd
data = dd.read_csv('name.csv')# has a column called short_description
def some_fn(e):
return e['short_description'].apply(langdetect.detect)
data['Language'] = data.map_partitions(some_fn, meta='string')# adding a new column called Language.
This csav file has 800000 rows each containing approx. 20 words long sentences.
Any suggestion how I can achieve language detection more faster because currently it takes 2-3 hours.
By default dask dataframe uses a thread pool for processing. My guess is that your language detection algorithm is written in pure Python (rather than C/Cython like most of Pandas) and so is limited by the GIL. This means that you should use processes rather than threads. You can ask Dask to use processes by adding the scheduler="processes" keyword to any compute or persist calls
df.compute(scheduler="processes")
More information on Dask's different schedulers and when to use them is here: https://docs.dask.org/en/latest/scheduling.html
what is the difference of using
//DASK
b = db.from_sequence(_query,npartitions=2)
df = b.to_dataframe()
df = df.compute()
//PANDAS
df = pd.DataFrame(_query)
I want to choose the best option to fragment large amounts of data and without losing performance
As per Dask's best practices with dataframes https://docs.dask.org/en/latest/dataframe-best-practices.html, for data that fits into RAM, use Pandas, it will probably be more efficient.
If you choose to use Dask, avoid very large partitions. If manually changing partition count, take into account your available memory and cores. For instance a machine with 100 GB and 10 cores would typically want partitions in the 1 GB range.
As of Dask 2.0.0 you can do that by using something like:
df.repartition(partition_size="100MB")
Other tips I can offer if you choose to stick with Dask is setting up a local client where you can take advantage of Dask Distributed http://distributed.dask.org/en/latest/client.html. From there avoid full data shuffling and reduce as far as you can before computing to Pandas.