Set maximum number of cores for Jupyter notebook - python

I share a computer with a colleague. They are running parallelized calculations there and I need to run Jupyter. I may use only a few cores, not all of them.
However, every time I run a cell which uses numpy in Jupyter, it tries to use as many cores as possible. While a colleague's calculation is running, Python takes half of the cores.
I tried to set niceness of Jupyter process to 19, so that its Python child processes inherit the niceness value and do not try to use all cores, but it does not work.
Is there a way how to limit Jupyter and its Python children to use some maximum number of cores? I hope that there is a variable for this limit.

Jupyter/notebook doesn't have any resource managers like that built in. Mostly that's because all of that stuff ended up in Jupyterhub, which is like another layer on top of the Jupyter architecture that's meant for making Jupyter play nicely with others in a mutli-user environment. Which is pretty much where you're at.
Jupyterhub does offer a way to set a hard limit on the number of cores it will use. See here for details.

Related

Run Python ZMQ Workers simultaneously

I am pretty new in the Python and at distributed systems.
I am using the ZeroMQ Venitlator-Worker-Sink configuration:
Ventilator - Worker - Sink
Everything is working fine at the moment, my problem is, that I need a lot of workers. Every worker is doing the same work.
At the moment every worker is working in his own Python file and has his own Output-Console.
If I have programm changes, I have to change (or copy) the code in every file.
Next problem is that I have to start/run every file, so it quiet annoying to start 12 files.
What are here the best solutions? Threads, processes?
I have to say that the goal is to run every worker on a diffrent raspberry pi.
This appears to be more of a dev/ops problem. You have your worker code, which is presumably a single codebase, on multiple distributed machines or instances. You make a change to that codebase and you need the resulting code to be distributed to each instance, and then the process restarted.
To start, you should at minimum be using a source control system, like Git. With such a system you could at least go to each instance and pull the most recent commit and restart. Beyond that, you could set up a system like Ansible to go out and run those actions on each instance initiated from a single command.
There's a whole host of other tools, strategies and services that will help you do those things in a myriad of different ways. Using Docker to create a single worker container and then distribute and run that container on your various instances is probably one of the more popular ways to do what you're after, but it'll require a more fundamental change to your infrastructure.
Hope this helps.

Is JupyterHub kernel safe across users?

I'm using JupyterHub to share computational power of a big computer among some users. The software that's primarily used is a ctypes extension Python script that uses a sophisticated C/C++ code. This code isn't invulnerable to memory problems and crashes.
My question is: If a low-level problem happens with one user and his kernel gets, say, a segmentation fault, will that crash the main server by design and get all users to lose their kernel information? Or is it designed to create a new server for every user that logs in, so that such problems don't happen?
Even if you were using straight Jupyter Notebook instead of JupyterHub, each kernel is a single process, that runs kind of independent of the notebook server. Crashes of individual kernels will not take down the notebook server.
Check out the architecture documentation. We've been running a setup with a single Jupyter Notebook instance (not even JupyterHub, because Windows :/) for about 3 years now. The only problems occuring are due to resource constraints (e.g. single kernel takes up a lot of memory), but that's solvable on both the OS and organisational level.

Scaling up big data in Jupyter and SciKit-Learn Random Forrest

Running big data in Jupyter Notebooks with Sci-Kit Learn
I usually try to segment my issues into specific component questions, so bear with me here! We are attempting to model some fairly sparse health conditions against pretty ordinary demographics. We have access to a lot of data, hundreds of millions of records, we would like to get up to 20 million records into a Random Forest Classifier. I have been told that sci-kit learn should be able to handle this. We are running into issues. The sort that don't generate tracebacks! Just dead processes.
I recognize that this is not much to go on,, but what I’m looking for is any advice on how to scale and or debug this process of scaling this up.
So we want to run truly pretty big data though Jupyter notebooks and Scikit Learn, primarily Random Forest.
On a 8gb core i7 Notebook running Linux 14.04 and Jupyter running Py 3.5 and the notebook running Python 2.7 code via conda env. (fwiw we store the data in Google Big Query, we use the Pandas Big Query connector for this which allows us to run the same code both on the local machines and in the cloud)
I can get a 100,000 record dataset consumed and model built with a bunch of diagnostic reports and charts that we have built into the notebook, no problem. The following scenarios all use the same code and the same versions of the respective Conda environments but several different pieces of hardware in an attempt to scale up our process.
When I expand the dataset to 1,000,000 records, I get about 95% of the way into the data load from Big query and I see a lot of pegged cpu activity and I guess what seems to be memory swap activity (viewing it on the Linux process monitor) the entire machine seems to freeze Load progress from Big Query seems to stop for an extended period of time, and the browser indicates that it has lost connection to the kernel, sometimes with the browser message indicating failed status and sometimes with the broken link icon on the Jupyter NB, amazingly, it eventually completes normally. And the output in fact gets completely rendered in the notebook.
I tried to put some memory profiling code in using psutil, and interestingly I saw no changes as the Dataframe went through several processes, including a fairly substantial “get_dummies” process which expands the categorical variables into separate columns (which would presumably expand memory usage) but when doing this after an extended time the memory usage stats printed out but the model diagnostics never rendered in the browser (as they had previously) the terminal window indicated a web socket timeout and the browser froze. I finally terminated the notebook terminal.
Interesting my colleague can process 3,000,000 records successfully on his 16gb Macbook.
When we try to process 4,000,000 records on a 30 gb VM on Google Compute engine we get an error indicating “Mem error”. So even a huge machine does not allow us to complete.
I realize we could use serially built models and or spark, but we wish to use sci kit learn and Random Forrest and get through the memory issue if possible?

Distributing jobs over multiple servers using python

I currently has an executable that when running uses all the cores on my server. I want to add another server, and have the jobs split between the two machines, but still each job using all the cores on the machine it is running. If both machines are busy I need the next job to queue until one of the two machines become free.
I thought this might be controlled by python, however I am a novice and not sure which python package would be the best for this problem.
I liked the "heapq" package for the queuing of the jobs, however it looked like it is designed for a single server use. I then looked into Ipython.parallel, but it seemed more designed for creating a separate smaller job for every core (on either one or more servers).
I saw a huge list of different options here (https://wiki.python.org/moin/ParallelProcessing) but I could do with some guidance as which way to go for a problem like this.
Can anyone suggest a package that may help with this problem, or a different way of approaching it?
Celery does exactly what you want - make it easy to distribute a task queue across multiple (many) machines.
See the Celery tutorial to get started.
Alternatively, IPython has its own multiprocessing library built in, based on ZeroMQ; see the introduction. I have not used this before, but it looks pretty straight-forward.

Python multithreading, How is it using multiple Cores?

I am running a multithreaded application(Python2.7.3) in a Intel(R) Core(TM)2 Duo CPU E7500 # 2.93GHz. I thought it would be using only one core but using the "top" command I see that the python processes are constantly changing the core no. Enabling "SHOW THREADS" in the top command shows diffrent thread processes working on different cores.
Can anyone please explain this? It is bothering me as I know from theory that multithreading is executed on a single core.
First off, multithreading means the inverse, namely that multiple cores are being utilized (via threading) at the same time. CPython is indeed crippled when it comes to this, though whenever you call into C code (this includes parts of the standard library, but also extension modules like Numpy) the lock which prevents concurrent execution of Python code may be unlocked. You can still have multiple threads, they just won't be interpreting Python at the same time (instead, they'll take turns quite frequently). You also speak of "Python processes" -- are you confusing terminology, or is this "multithreaded" Python application in fact multiprocessing? Of course multiple Python processes can run concurrently.
However, from your wording I suspect another source of confusion. Even a single thread can run on multiple cores... just not at the same time. It is up to the operating system which thread is running on which CPU, and the OS scheduler does not necessarily re-assign a thread to the same CPU where it used to run before it was suspended (it's beneficial, as David Schwartz notes in the comments, but not vital). That is, it's perfectly normal for a single thread/process to jump from CPU to CPU.
Threads are designed to take advantage of multiple cores when they are available. If you only have one core, they'll run on one core too. :-)
There's nothing to be concerned about, what you observe is "working as intended".

Categories