Dask: why has CPU usage suddenly dropped? - python

I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:

Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.

Related

More optimized way to do itertools.combinations

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

Dataflow pipeline throughput decreases drastically as execution advances + unexpected side input behavior

I have a dataflow pipeline processing an input of about 1Gb of data with two dicts as side_inputs. The goal is to calculate features from the main dataset with the help of those two side_inputs.
Overall structure of the pipeline is as follows:
# First side input, ends up as a 2GB dict with 3.5 million keys
side_inp1 = ( p |
"read side_input1" >> beam.io.ReadFromAvro("$PATH/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# Second side input, ends up as a 1.6GB dict with 4.5 million keys
side_inp2 = (p |
"read side_input2" >> beam.io.ReadFromAvro("$PATH2/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# The main part of the pipeline, reading an avro dataset of 1 million rows -- 20GB
(p |
"read inputs" >> beam.io.ReadFromAvro("$MainPath/*.avro") |
"main func" >> beam.Map(MyMapper, pvalue.AsDict(side_inp1), pvalue.AsDict(side_inp2))
)
Here's the Dataflow graph:
And the "Featurize" step unwrapped:
So Featurize is a function that looks for ids in the side-inputs, .gets the vectors and does like 180 different ways of vector dot products to calculate some features. It's a completely CPU bound process and it's expected to take longer than the rest of the pipeline, but stalling is the thing that's strange here.
My problems are two fold:
The dataflow pipeline seems to slow down drastically as it moves further in the process. I don't know what the reasons are and how can I alleviate this problem. A throughput chart of the MyMapper step can be seen below, I'm wondering for the declining throughput (from ~400 rows/sec to nearly ~1 rows/sec in the end).
Also the behavior of side_inputs is strange to me. I expected the side_inputs to be read only and only once, but when I checkout the Job Metrics / Throughput chart, I observe the following chart. As can be seen, the pipeline is constantly reading in side_inputs, while what I want is only two dicts that are kept in memory.
Other job configurations
zone: us-central-1a
machine_type: m1-ultramem-40 (40 CPU cores, 960GB RAM)
disk_type/size: ssd/50GB
experiments: shuffle-service enabled.
max_num_workers: 1 to help ease calculations and metrics, and not have them vary due to auto-scaling.
Extra Observations
I'm constantly seeing log entires like the following in LogViewer: [INFO] Completed workitem: 4867151000103436312 in 1069.056863785 seconds"
All completed workItems so far have taken about 1000-1100 seconds, this is another source of confusion, why should throughput drop while processing workItems takes the same time as before? Has parallelism dropped for some reason? (maybe some hidden threading threshold that's out of my control, like harness_threads?).
In the later parts of the pipelines, looking at the logs, it looks the execution pattern is very sequential (Seems like it's executing 1 workItem, finishes it, goes to the next, which is strange to me, considering there's 1TB of available memory and 40cores)
There are 0 errors or even warnings
The throughput chart in point 1 is a good indicator that the performance in your job decreased somehow.
The side input is intended to be in memory; however, I'm not quite sure that a pipeline with only 1 highmem node is a good approach. By having only one node, the pipeline might have bottlenecks difficult to identify, e.g. Network or OS limitations (like max number of files opened in the OS related to the files loaded into memory). Because of beam's architecture, I think it is not a problem that you can have more nodes even if autoscaling is enabled since we find that autoscaling automatically chooses the appropriate number of worker instances required to run your job. If you are worried about calculations and metrics for other reasons, please share.
Regarding point 2, I think it is expected to find activity on the graph since the side input (in memory) is read by each element being processed. However, if this doesn't make sense for you, you can always add the complete job graph for us to understand any other details of the pipeline steps.
My recommendation is adding more workers to distribute the workaload as a PCollection is a distributed dataset that will be distributed among available nodes. You can try to have similar computational resources with more nodes, for example, 4 instances n2d-highmem-16 (16vCPU 128GB). With this changes it is possible that any bottlenecks dissapear or can be mitigated; in addition, you can monitor the new job in the same way:
Remember to check errors in your pipeline, so you can identify any other issues that are happening/causing the performance issue.
Check the CPU and Memory usage in Dataflow UI. If memory errors are happening at job level Stackdriver should shows them as memory errors, but also the memory in the host instance should be checked to be sure that it is not reaching the limit in the OS for other reasons.
You might want to check this example with side inputs as dictionaries. I'm not expert, but you can follow the best practice in the example.
UPDATE
If machines n2d-highmem-16 have OOM, it seems to me that each harness thread might use a copy of the dicts. Not quite sure if configuring the number of threads can help, but you can try to set number_of_worker_harness_threads in the pipeline options.
On the other hand, can you expand the step Featurize? The wall time is very high in this step (~6 days), let's check the composite transforms that absorbed such latency. For the problematic composite transforms let us know the code snippet. To identify the composite transforms that can have issues please refer to Side Inputs Metrics especially Time spent writing and Time spent reading.

Measuring the execution time of python scripts

I am currently working on an ML NLP project and I want to measure the execution time of certain parts and also potentially predict how long the execution will take. For example, I want to measure the ML training process (including sub-processes like the data preprocessing part). I have been looking online and I have come across different python modules that can measure the execution time of functions (like the time or timeit ones). However, I still haven't found a concrete solution to predict the time it will take for a function to execute. I have thought about running the code several times, save the (data_size, time) values and then use that to extrapolate for future data. I also thought about then updating this estimation with the time it took the run several subparts of a function (like seeing how much of the process was computed, how long it took and then use that to adjust the time left).
However, I am not sure of any of this and I wanted to see if there were better options out there that I wasn't aware of, so if anyone has a better idea, I'd be thankful if you could share it.
Have you looked into using profiling? It should give a detailed breakdown of the function execution times, the number of calls, etc. You will have to execute the script with profiling, and then you will get the detailed breakdown.
https://docs.python.org/3/library/profile.html#module-cProfile
If you want in-time progress reports there are a couple of libraries I've seen. https://pypi.org/project/tqdm/
https://pypi.org/project/progressbar2/
Hope these help!

HDBSCAN won't utilize all available cpus. Processes just sleep

For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at around 1.5GB in CSV format. It's a mixture of ints, bools, and floats up to 9 digits.
During this period each time I've been able to get the data to cluster it has taken 3 plus days, which seems weird given HDBSCAN is revered for its speed and I'm running this on a Google Cloud Compute Instance with 96 cpus. I've spent days trying to get it to utilize the cloud instance's processing power but to no avail.
Using the auto algorithm detection in HDBSCAN, it selects the boruvka_kdtree as the best algorithm to use. And I've tried passing in all sorts of values to core_dist_n_jobs parameter. From -2,-1, 1, 96, multiprocessing.cpu_count(), to 300. All seem to have a similar effect of causing 4 main python processes to utilize a full core while spawning way more sleeping processes.
I refuse to believe I'm doing this right and this is truly how long this takes on this hardware. I'm convinced I must be missing something like an issue where using JupyterHub on the same machine causes some sort of GIL lock, or I'm missing some parameter for HDBSCAN.
Here is my current call to HDBSCAN:
hdbscan.HDBSCAN(min_cluster_size = 100000, min_samples = 500, algorithm='best', alpha=1.0, memory=mem, core_dist_n_jobs = multiprocessing.cpu_count())
I've followed all existing issues and posts related to this issue I could find and nothing has worked so far, but I'm always down to try even radical ideas, because this isn't even the full data I want to cluster and at this rate it would take 4 years years to cluster the full data!
According to the author
Only the core distance computation can use all the cores, sadly that is apparently the first few seconds. The rest of the computation is quite challenging to parallelise unfortunately and will run on a single thread.
you can read the issues from the links below:
Not using all available CPUs?
core_dist_n_jobs =1 or -1 -> no difference at all and computation time extremely high

How to speed up the run time of computation heavy python programs

I want to know how to speed up running of the programs involving lot of computations on large datasets. So, I have 5 python programs, each of them perform some convoluted computations on a large dataset. For example, a portion in one of the program is as follows:
df = get_data_from_redshift()
cols = [{'Col': 'Col1', 'func': pd.Series.nunique},
{'col': 'Col2', 'func': pd.Series.nunique},
{'col': 'Col3', 'func': lambda x: x.value_counts().to_dict()},
{'col': 'Col4', 'func': pd.Series.nunique},
{'col': 'Col5', 'func': pd.Series.nunique}]
d = df.groupby('Column_Name').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()
where get_data_from_redshift() connects to a redshift cluster, get data from the database, writes it to a dataframe (the dataframe is about 600,000 rows x 6 columns).
The other programs also use this dataframe df and perform a lot of computation and each program write its result in a pickle file.
The final program loads the pickle files created by the 5 programs, does some computations to get some 300,000 values and then checks them against another database in the cluster, to get a final file output.
Running each program individually takes hours (sometimes overnight). However, I need that the whole thing runs within an hour and gives me the final output file.
I tried putting the one of the programs on an EC2 instance to see if the performance improves, but it has been over 3 hours and it's still running. I tried m4.xlarge, c4.xlarge, r4.xlarge instances, but none of them were useful.
Is there a way to speed up the total run time?
Maybe I could run each of the 5 programs on separate EC2 instances, but then each of the program give an output file, which the final program has to use. So, if I run on multiple instances, the output files from each program will be save on different servers, right? Then how will the final program use them? Can we save the output file from each file to a common location that the final program can access?
I've heard of GPUs being 14 times faster than CPUs, however I've never used them. Will using a GPU instance be of any help in this case?
Sorry, I'm new here, but don't really know how to go about it.
You need to find out what's making it slow, you can use a profiler to start with if you can not think of anything else at the moment. Finding out the exact problem is the simplest way of making it work better.
This, below is a generic approach.
First thing, optimizations in architecture/algorithms can substantially outperform any other optimization (like those provided by programming languages, tools, even techniques like memoization etc.). So first look thoroughly if your algorithm can be improved. It includes looking for parts that can run concurrently, which can be executed parallelly. For example, using map-reduce instead of linear data processing can lower execution times to fractions! But it needs that the processing should be able to be divided (mutually exclusively) for parallel processing.
Next should be finding unnecessary loops or computations. Using techniques like memoization can also improve performance greatly.
In case if there are any communication or I/O tasks (eg. communication to redshift cluster you've mentioned) that are time taking (this doesn't seem to relate to you as you've shown concerns about computation being slow)
Then there are minor optimization, like using functions like map, filter or using generator expressions instead of lists etc can optimize (very) slightly.

Categories