Scaling up big data in Jupyter and SciKit-Learn Random Forrest

Scaling up big data in Jupyter and SciKit-Learn Random Forrest - python

Running big data in Jupyter Notebooks with Sci-Kit Learn
I usually try to segment my issues into specific component questions, so bear with me here! We are attempting to model some fairly sparse health conditions against pretty ordinary demographics. We have access to a lot of data, hundreds of millions of records, we would like to get up to 20 million records into a Random Forest Classifier. I have been told that sci-kit learn should be able to handle this. We are running into issues. The sort that don't generate tracebacks! Just dead processes.
I recognize that this is not much to go on,, but what I’m looking for is any advice on how to scale and or debug this process of scaling this up.
So we want to run truly pretty big data though Jupyter notebooks and Scikit Learn, primarily Random Forest.
On a 8gb core i7 Notebook running Linux 14.04 and Jupyter running Py 3.5 and the notebook running Python 2.7 code via conda env. (fwiw we store the data in Google Big Query, we use the Pandas Big Query connector for this which allows us to run the same code both on the local machines and in the cloud)
I can get a 100,000 record dataset consumed and model built with a bunch of diagnostic reports and charts that we have built into the notebook, no problem. The following scenarios all use the same code and the same versions of the respective Conda environments but several different pieces of hardware in an attempt to scale up our process.
When I expand the dataset to 1,000,000 records, I get about 95% of the way into the data load from Big query and I see a lot of pegged cpu activity and I guess what seems to be memory swap activity (viewing it on the Linux process monitor) the entire machine seems to freeze Load progress from Big Query seems to stop for an extended period of time, and the browser indicates that it has lost connection to the kernel, sometimes with the browser message indicating failed status and sometimes with the broken link icon on the Jupyter NB, amazingly, it eventually completes normally. And the output in fact gets completely rendered in the notebook.
I tried to put some memory profiling code in using psutil, and interestingly I saw no changes as the Dataframe went through several processes, including a fairly substantial “get_dummies” process which expands the categorical variables into separate columns (which would presumably expand memory usage) but when doing this after an extended time the memory usage stats printed out but the model diagnostics never rendered in the browser (as they had previously) the terminal window indicated a web socket timeout and the browser froze. I finally terminated the notebook terminal.
Interesting my colleague can process 3,000,000 records successfully on his 16gb Macbook.
When we try to process 4,000,000 records on a 30 gb VM on Google Compute engine we get an error indicating “Mem error”. So even a huge machine does not allow us to complete.
I realize we could use serially built models and or spark, but we wish to use sci kit learn and Random Forrest and get through the memory issue if possible?

Related

My python is using less of my system resources than it used to when training my ML models

I am working on a project with some ML models. I worked on it during the summer, and have returned to it recently. For some reason, it is a lot slower to train and test now than it was then--I think it is a problem with python not using all of my system resources all of a sudden.
I am working on a project where I am building, training, and testing some ML models. I am using the sktime package in a python3.7 conda environment with jupyter notebook to do so.
I first started working on this project in the summer, and when I was building the models, I timed how long the training process took. I am resuming the project now, and I tried training the exact same model with the exact same data again, and it took around 6 hours this time compared to 76 minutes when I trained it during the summer. I have noticed that running inference on the test set also takes longer.
I am running on an 10-core M1 Max with 64 Gigs of RAM. I can tell that my computer is barely breaking a sweat right now, and with activity monitor it says that python is using 99.9% CPU, but it says overall the user is only using around 11.60% of the CPU. I remember my computer working a little harder when I was working on this project during the summer (the fan would actually turn on and the computer would get hot, but that is not happening now), so I have a feeling that the problem here is that my environment is not using the full system resources accessible to them. I have checked and my RAM limit on jupyter is enough, so that is not the issue. I am very confused what could have changed in this environment between the summer and now that has caused this problem. Any help would be much appreciated.

Set maximum number of cores for Jupyter notebook

I share a computer with a colleague. They are running parallelized calculations there and I need to run Jupyter. I may use only a few cores, not all of them.
However, every time I run a cell which uses numpy in Jupyter, it tries to use as many cores as possible. While a colleague's calculation is running, Python takes half of the cores.
I tried to set niceness of Jupyter process to 19, so that its Python child processes inherit the niceness value and do not try to use all cores, but it does not work.
Is there a way how to limit Jupyter and its Python children to use some maximum number of cores? I hope that there is a variable for this limit.

Jupyter/notebook doesn't have any resource managers like that built in. Mostly that's because all of that stuff ended up in Jupyterhub, which is like another layer on top of the Jupyter architecture that's meant for making Jupyter play nicely with others in a mutli-user environment. Which is pretty much where you're at.
Jupyterhub does offer a way to set a hard limit on the number of cores it will use. See here for details.

Python process vs Jupyter notebook acces to GPU memory

I made a very interesting observation while trying to tune my recommendation engine on a Tesla K80 housed on Google Cloud Platform. Unfortunately I have been unsuccessful at finding any literature which might point me in the right direction. Here is my predicament ...
I run the same code for fitting a fully connected model using a python script and a jupyter notebook. What is surprising is that with the same hyper-parameters (batch size etc.), the code runs faster using the jupyter notebook kernel and uses more memory on the GPU than when I fit the model using a python code invoked from the shell.
Since, I want my code to run in the least possible time and jupyter often ends up closing the connection when unattended due to web socket time-out, is there any way I can increase the amount of memory a python process can use on the GPU ? Open to any alternating way of making this work. Thanks in advance.

Is JupyterHub kernel safe across users?

I'm using JupyterHub to share computational power of a big computer among some users. The software that's primarily used is a ctypes extension Python script that uses a sophisticated C/C++ code. This code isn't invulnerable to memory problems and crashes.
My question is: If a low-level problem happens with one user and his kernel gets, say, a segmentation fault, will that crash the main server by design and get all users to lose their kernel information? Or is it designed to create a new server for every user that logs in, so that such problems don't happen?

Even if you were using straight Jupyter Notebook instead of JupyterHub, each kernel is a single process, that runs kind of independent of the notebook server. Crashes of individual kernels will not take down the notebook server.
Check out the architecture documentation. We've been running a setup with a single Jupyter Notebook instance (not even JupyterHub, because Windows :/) for about 3 years now. The only problems occuring are due to resource constraints (e.g. single kernel takes up a lot of memory), but that's solvable on both the OS and organisational level.

How to find Spark RDD created in different cores of a computer

I just wanted to educate myself more on Spark. So wanted to ask this question.
I have Spark currently installed on my local machine. Its a Mach with 16GB.
I have connected a Jupyter notebook running with Pyspark.
So now when I do any coding in that notebook, like reading the data and converting the data into Spark DataFrame, I wanted to check:
1). Where all the dataset is distributed on local machine. Like does it distribute the dataset using different cores of CPU etc?. Is there a way to find that out?
2). Running the code and computation just using Jupyter notebook without spark is different from running Jupyter notebook with Pyspark? Like the first one just uses one core of the machine and runs using one thread while Jupyter notebook with Pyspark runs the code and computing on different cores of CPU with multi-threading/processing? Is this understanding correct?.
Is there a way to check these.
Thanks

Jupyter has mainly three parts Jupyter Notebook, Jupyter Clients and Kernels. http://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html
Here is short note on Kernel from Jupyter homepage.
Kernels are processes that run interactive code in a particular
programming language and return output to the user. Kernels also
respond to tab completion and introspection requests.
Jupyter's job is to communicate between the Kernel(Like python kernel, spark kernel ..) and web interface(your notebook). So under the hood spark is running drivers and executors for you, Jupyter helps in communicating with the spark driver.
1). Where all the dataset is distributed on local machine. Like does
it distribute the dataset using different cores of CPU etc?. Is there
a way to find that out?
Spark will spawn n number of executors(process responsible for executing a task) as specified using --num-executors, executors are managed by a spark driver(program/process responsible for running the Job over the Spark Engine). So the number of executors are specified while you run a spark program, you will find this kernel conf directory.
2). Running the code and computation just using Jupyter notebook
without spark is different from running Jupyter notebook with Pyspark?
Like the first one just uses one core of the machine and runs using
one thread while Jupyter notebook with Pyspark runs the code and
computing on different cores of CPU with multi-threading/processing?
Is this understanding correct?.
Yes, as I explained Jupyter is merely an interface letting you run code. Under the hood, Jupyter connects to Kernels be it normal Python or Apache Spark.
Spark has it's own good UI to monitor jobs, by default it runs on spark master server on port 4040. In your case, it will available on http://localhost:4040

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.