Is Spark's KMeans unable to handle bigdata? - python

KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely, without yielding an error!
Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization):
from pyspark.context import SparkContext
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.random import RandomRDDs
if __name__ == "__main__":
sc = SparkContext(appName='kmeansMinimalExample')
# same with 10000 points
data = RandomRDDs.uniformVectorRDD(sc, 10000000, 64)
C = KMeans.train(data, 8192, maxIterations=10)
sc.stop()
The job does nothing (it doesn't succeed, fail or progress..), as shown below. There are no active/failed tasks in the Executors tab. Stdout and Stderr Logs don't have anything particularly interesting:
If I use k=81, instead of 8192, it will succeed:
Notice that the two calls of takeSample(), should not be an issue, since there were called twice in the random initialization case.
So, what is happening? Is Spark's Kmeans unable to scale? Does anybody know? Can you reproduce?
If it was a memory issue, I would get warnings and errors, as I had been before.
Note: placeybordeaux's comments are based on the execution of the job in client mode, where the driver's configurations are invalidated, causing the exit code 143 and such (see edit history), not in cluster mode, where there is no error reported at all, the application just hangs.
From zero323: Why is Spark Mllib KMeans algorithm extremely slow? is related, but I think he witnesses some progress, while mine hangs, I did leave a comment...

I think the 'hanging' is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization.
I opened https://issues.apache.org/jira/browse/SPARK-17389 to track two main improvements, one of which you can use now. Edit: really, see also https://issues.apache.org/jira/browse/SPARK-11560
First, there are some code optimizations that would speed up the init by about 13%.
However most of the issue is that it default to 5 steps of k-means|| init, when it seems that 2 is almost always just as good. You can set initialization steps to 2 to see a speedup, especially in the stage that's hanging now.
In my (smaller) test on my laptop, init time went from 5:54 to 1:41 with both changes, mostly due to setting init steps.

If your RDD is so large the collectAsMap will attempt to copy every single element in the RDD onto the single driver program, and then run out of memory and crash. Even though you had partitioned the data, the collectAsMap sends everything to the driver and you job crashs.
You can make sure the number of elements you return is capped by calling take or takeSample, or perhaps filtering or sampling your RDD.
Similarly, be cautious of these other actions as well unless you are sure your dataset size is small enough to fit in memory:
countByKey,
countByValue,
collect
If you really do need every one of these values of the RDD and the data is too big to fit into memory, you could write out the RDD to files or export the RDD to a database that is large enough to hold all the data. As you are using an API, I think you are not able to do that (rewrite all the code maybe? Increase Memory?). I think this collectAsMap in the runAlgorithm method is a really bad thing in Kmeans (https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)...

Related

multiprocessing pool map incredibly slow running a Monte Carlo ensemble of python simulations

I have a python code that runs a 2D diffusion simulation for a set of parameters. I need to run the code many times, O(1000), like a Monte Carlo approach, using different parameter settings each time. In order to do this more quickly I want to use all the cores on my machine (or cluster), so that each core runs one instance of the code.
In the past I have done this successfully for serial fortran codes by writing a python wrapper that then used multiprocessing map (or starmap in the case of multiple arguments) to call the fortan code in an ensemble of simulations. It works very nicely in that you loop over the 1000 simulations, and the python wrapper farms out a new integration to a core as soon as it becomes free after completing a previous integration.
However, now when I set this up to do the same to run multiple instances of my python (instead of fortran) code, I find it is incredibly slow, much slower than simply running the code 1000 times in serial on a single core. Using the system monitor I see that one core is working at a time, and it never goes above 10-20% load, while of course I expected to see N cores running near 100% (as is the case when I farm out fortran jobs).
I thought it might be a write issue, and so I checked the code carefully to ensure that all plotting is switched off, and in fact there is no file/disk access at all, I now merely have one print statement at the end to print out a final diagnostic.
The structure of my code is like this
I have the main python code in toy_diffusion_2d.py which has a single arg of a dictionary with the run parameters in it:
def main(arg)
loop over timesteps:
calculation simulation over a large-grid
print the result statistic
And then I wrote a "wrapper" script, where I import the main simulation code and try to run it in parallel:
from multiprocessing import Pool,cpu_count
import toy_diffusion_2d
# dummy list of arguments
par1=[1,2,3]
par2=[4,5,6]
# make a list of dictionaries to loop over, 3x3=9 simulations in all.
arglist=[{"par1":p1,"par2":p2} for p1 in par1 for p2 in par2]
ncore=min(len(arglist),int(cpu_count()))
with Pool(processes=ncore) as p:
p.map(toy_diffusion_2d.main,arglist)
The above is a shorter paraphrased example, my actual codes are longer, so I have placed them here:
Main code: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_2d.py
You can run this with the default values like this:
python3 toy_diffusion_2d.py
Wrapper script: http://clima-dods.ictp.it/Users/tompkins/files/toy_diffusion_loop.py
You can run a 4 member ensemble like this:
python3 toy_diffusion_loop.py --diffK=[33000,37500] --tau_sub=[20,30]
(note that the final stat is slightly different each run, even with the same values as the model is stochastic, a version of the stochastic allen-cahn equations in case any one is interested, but uses a stupid explicit solver on the diffusion term).
As I said, the second parallel code works, but as I say it is reeeeeallly slow... like it is constantly gating.
I also tried using starmap, but that was not any different, it is almost like the desktop only allows one python interpreter to run at a time...? I spent hours on it, I'm almost at the point to rewrite the code in Fortran. I'm sure I'm just doing something really stupid to prevent parallel execution.
EDIT(1): this problem is occurring on
4.15.0-112-generic x86_64 GNU/Linux, with Python 3.6.9
In response to the comments, in fact I also find it runs fine on my MAC laptop...
EDIT(2): so it seems my question was a bit of a duplicate of several other postings, apologies! As well as the useful links provided by Pavel, I also found this page very helpful: Importing scipy breaks multiprocessing support in Python I'll edit in the solution below to the accepted answer.
The code sample you provide works just fine on my MacOS Catalina 10.15.6. I can guess you're using some Linux distributive, where, according to this answer, it can be the case that numpy import meddles with core affinity due to being linked with OpenBLAS library.
If your Unix supports scheduler interface, something like this will work:
>>> import os
>>> os.sched_setaffinity(0, set(range(cpu_count)))
Another question that has a good explanation of this problem is found here and the solution suggested is this:
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
inserted right before the multiprocessing call.

Why is Spark's show() function very slow?

I have
df.select("*").filter(df.itemid==itemid).show()
and that never terminated, however if I do
print df.select("*").filter(df.itemid==itemid)
It prints in less than a second. Why is this?
That's because select and filter are just building up the execution instructions, so they aren't doing anything with the data. Then, when you call show it actually executes those instructions. If it isn't terminating, then I'd review the logs to see if there are any errors or connection issues. Or maybe the dataset is still too large - try only taking 5 to see if that comes back quick.
This usually happens if you dont have enough available memory in computer. free up some memory and try again.

Python script runs slower when executed multiple times

I have a python script with a normal runtime of ~90 seconds. However, when I change only minor things in it (like alternating the colors in my final pyplot figure) and execute it thusly multiple times in quick succession, its runtime increases up to close to 10 minutes.
Some bullet points of what I'm doing:
I'm not downloading anything neither creating new files with my script.
I merely open some locally saved .dat-files using numpy.genfromtxt and crunch some numbers with them.
I transform my data into a rec-array and use indexing via array.columnname extensively.
For each file I loop over a range of criteria that basically constitute different maximum and minimum values for evaluation, and embedded in that I use an inner loop over the lines of the data arrays. A few if's here and there but nothing fancy, really.
I use the multiprocessing module as follows
import multiprocessing
npro = multiprocessing.cpu_count() # Count the number of processors
pool = multiprocessing.Pool(processes=npro)
bigdata = list(pool.map(analyze, range(len(FileEndings))))
pool.close()
with analyze being my main function and FileEndings its input, a string, to create the right name of the file I want to load and the evaluate. Afterwards, I use it a second time with
pool2 = multiprocessing.Pool(processes=npro)
listofaverages = list(pool2.map(averaging, range(8)))
pool2.close()
averaging being another function of mine.
I use numba's #jit decorator to speed up the basic calculations I do in my inner loops, nogil, nopython, and cache all set to be True. Commenting these out doesn't resolve the issue.
I run the scipt on Ubuntu 16.04 and am using a recent Anaconda build of python to compile.
I write the code in PyCharm and run it in its console most of the time. However, changing to bash doesn't help either.
Simply not running the script for about 3 minutes lets it go back to its normal runtime.
Using htop reveals that all processors are at full capacity when running. I am also seeing a lot of processes stemming from PyCharm (50 or so) that are each at equal MEM% of 7.9. The CPU% is at 0 for most of them, a few exceptions are in the range of several %.
Has anyone experienced such an issue before? And if so, any suggestions what might help? Or are any of the things I use simply prone to cause these problems?
May be closed, the problem was caused by a malfunction of the fan in my machine.

Make python process writes be scheduled for writeback immediately without being marked dirty

We are building a python framework that captures data from a framegrabber card through a cffi interface. After some manipulation, we try to write RAW images (numpy arrays using the tofile method) to disk at a rate of around 120 MB/s. We are well aware that are disks are capable of handling this throughput.
The problem we were experiencing was dropped frames, often entire seconds of data completely missing from the framegrabber output. What we found was that these framedrops were occurring when our Debian system hit the dirty_background_ratio set in sysctl. The system was calling the flush background gang that would choke up the framegrabber and cause it to skip frames.
Not surprisingly, setting the dirty_background_ratio to 0% managed to get rid of the problem entirely (It is worth noting that even small numbers like 1% and 2% still resulted in ~40% frame loss)
So, my question is, is there any way to get this python process to write in such a way that it is immediately scheduled for writeout, bypassing the dirty buffer entirely?
Thanks
So heres one way I've managed to do it.
By using the numpy memmap object you can instantiate an array that directly corresponds with a part of the disk. Calling the method flush() or python's del causes the array to sync to disk, completely bypassing the OS's buffer. I've successfully written ~280GB to disk at max throughput using this method.
Will continue researching.
Another option is to get the os file id and call os.fsync on it. This will schedule it for writeback immediately.

Insight to np.linalg.lstsq and np.linalg.pinv limitations and failures during multiprocessing?

I am seeking insight into a vexing problem I have encountered using numpy in a multiprocessing environment. Initially, some weeks ago, "things did not work" at all and as a result of guidance from this site, the problem was traced to numpy needing to be recompiled to allow multiprocessing. This was done and appears at first blush to be functional EXCEPT that now there are limits on the problem size that I can attack.
When numpy was used "as downloaded as binary distribution" in single processor mode, I could solve large problems using linalg.pinv and linalg.svd without complaint (except for throughput). Now, relatively small problems fail. Specifically, when linalg.lstsq is called (which appears to invoke svd and/or pinv) the code expends the cpu effort (as shown by process monitor) but the code never executes the next statement after the linalg call. No exception is raised, no error indication given, no results generated; it just bails and the process that contains the call is deemed done and that is that.
These matrices are not that large: 8 x 4096 will solve and create results in a timely manner. 8 x 8192 fails with no error indication, no exception raised and no results are produced. The single process version of numpy solved 12 x 50,000 without issue. My platform is a 12-core Mac running most-recent OS X (10.11.6) with 64GB of physical ram. Minimal memory pressure is put on the machine during these runs; there is plenty of headroom on space. Python is 2.7.11 under anaconda, numpy was recompiled for multiprocessing using gcc.
I am running out of places to look; I have not found a numpy control that address this issue. Does anyone have any pointers as to where I should concentrate my search for a resolution?
This is a show-stopper for me and I have to assume that the underlying cause is simple, but perhaps not obvious. I am rather new to python, so I would appreciate any insight fellow over-flowers may have.

Categories