I'm using JupyterHub to share computational power of a big computer among some users. The software that's primarily used is a ctypes extension Python script that uses a sophisticated C/C++ code. This code isn't invulnerable to memory problems and crashes.
My question is: If a low-level problem happens with one user and his kernel gets, say, a segmentation fault, will that crash the main server by design and get all users to lose their kernel information? Or is it designed to create a new server for every user that logs in, so that such problems don't happen?
Even if you were using straight Jupyter Notebook instead of JupyterHub, each kernel is a single process, that runs kind of independent of the notebook server. Crashes of individual kernels will not take down the notebook server.
Check out the architecture documentation. We've been running a setup with a single Jupyter Notebook instance (not even JupyterHub, because Windows :/) for about 3 years now. The only problems occuring are due to resource constraints (e.g. single kernel takes up a lot of memory), but that's solvable on both the OS and organisational level.
Related
I made a very interesting observation while trying to tune my recommendation engine on a Tesla K80 housed on Google Cloud Platform. Unfortunately I have been unsuccessful at finding any literature which might point me in the right direction. Here is my predicament ...
I run the same code for fitting a fully connected model using a python script and a jupyter notebook. What is surprising is that with the same hyper-parameters (batch size etc.), the code runs faster using the jupyter notebook kernel and uses more memory on the GPU than when I fit the model using a python code invoked from the shell.
Since, I want my code to run in the least possible time and jupyter often ends up closing the connection when unattended due to web socket time-out, is there any way I can increase the amount of memory a python process can use on the GPU ? Open to any alternating way of making this work. Thanks in advance.
Running big data in Jupyter Notebooks with Sci-Kit Learn
I usually try to segment my issues into specific component questions, so bear with me here! We are attempting to model some fairly sparse health conditions against pretty ordinary demographics. We have access to a lot of data, hundreds of millions of records, we would like to get up to 20 million records into a Random Forest Classifier. I have been told that sci-kit learn should be able to handle this. We are running into issues. The sort that don't generate tracebacks! Just dead processes.
I recognize that this is not much to go on,, but what I’m looking for is any advice on how to scale and or debug this process of scaling this up.
So we want to run truly pretty big data though Jupyter notebooks and Scikit Learn, primarily Random Forest.
On a 8gb core i7 Notebook running Linux 14.04 and Jupyter running Py 3.5 and the notebook running Python 2.7 code via conda env. (fwiw we store the data in Google Big Query, we use the Pandas Big Query connector for this which allows us to run the same code both on the local machines and in the cloud)
I can get a 100,000 record dataset consumed and model built with a bunch of diagnostic reports and charts that we have built into the notebook, no problem. The following scenarios all use the same code and the same versions of the respective Conda environments but several different pieces of hardware in an attempt to scale up our process.
When I expand the dataset to 1,000,000 records, I get about 95% of the way into the data load from Big query and I see a lot of pegged cpu activity and I guess what seems to be memory swap activity (viewing it on the Linux process monitor) the entire machine seems to freeze Load progress from Big Query seems to stop for an extended period of time, and the browser indicates that it has lost connection to the kernel, sometimes with the browser message indicating failed status and sometimes with the broken link icon on the Jupyter NB, amazingly, it eventually completes normally. And the output in fact gets completely rendered in the notebook.
I tried to put some memory profiling code in using psutil, and interestingly I saw no changes as the Dataframe went through several processes, including a fairly substantial “get_dummies” process which expands the categorical variables into separate columns (which would presumably expand memory usage) but when doing this after an extended time the memory usage stats printed out but the model diagnostics never rendered in the browser (as they had previously) the terminal window indicated a web socket timeout and the browser froze. I finally terminated the notebook terminal.
Interesting my colleague can process 3,000,000 records successfully on his 16gb Macbook.
When we try to process 4,000,000 records on a 30 gb VM on Google Compute engine we get an error indicating “Mem error”. So even a huge machine does not allow us to complete.
I realize we could use serially built models and or spark, but we wish to use sci kit learn and Random Forrest and get through the memory issue if possible?
We all know situations when you cannot go open source and freely distribute software - and I am in one of these situations.
I have an app that consists of a number of binaries (compiled from C sources) and Python code that wraps it all into a system. This app used to work as a cloud solution so users had access to app functions via network but no chance to touch the actual server where binaries and code are stored.
Now we want to deliver the "local" version of our system. The app will be running on PCs that our users will physically own. We know that everything could be broken, but at least want to protect the app from possible copying and reverse-engineering as much as possible.
I know that Docker is a wonderful deployment tool so I wonder: is it possible to create encrypted Docker containers where no one can see any data stored in the container's filesystem? Is there a known solution to this problem?
Also, maybe there are well known solutions not based on Docker?
The root user on the host machine (where the docker daemon runs) has full access to all the processes running on the host. That means the person who controls the host machine can always get access to the RAM of the application as well as the file system. That makes it impossible to hide a key for decrypting the file system or protecting RAM from debugging.
Using obfuscation on a standard Linux box, you can make it harder to read the file system and RAM, but you can't make it impossible or the container cannot run.
If you can control the hardware running the operating system, then you might want to look at the Trusted Platform Module which starts system verification as soon as the system boots. You could then theoretically do things before the root user has access to the system to hide keys and strongly encrypt file systems. Even then, given physical access to the machine, a determined attacker can always get the decrypted data.
What you are asking about is called obfuscation. It has nothing to do with Docker and is a very language-specific problem; for data you can always do whatever mangling you want, but while you can hope to discourage the attacker it will never be secure. Even state-of-the-art encryption schemes can't help since the program (which you provide) has to contain the key.
C is usually hard enough to reverse engineer, for Python you can try pyobfuscate and similar.
For data, I found this question (keywords: encrypting files game).
If you want a completely secure solution, you're searching for the 'holy grail' of confidentiality: homomorphous encryption. In short, you want to encrypt your application and data, send them to a PC, and have this PC run them without its owner, OS, or anyone else being able to scoop at the data.
Doing so without a massive performance penalty is an active research project. There has been at least one project having managed this, but it still has limitations:
It's windows-only
The CPU has access to the key (ie, you have to trust Intel)
It's optimised for cloud scenarios. If you want to install this to multiple PCs, you need to provide the key in a secure way (ie just go there and type it yourself) to one of the PCs you're going to install your application, and this PC should be able to securely propagate the key to the other PCs.
Andy's suggestion on using the TPM has similar implications to points 2 and 3.
Sounds like Docker is not the right tool, because it was never intended to be used as a full-blown sandbox (at least based on what I've been reading). Why aren't you using a more full-blown VirtualBox approach? At least then you're able to lock up the virtual machine behind logins (as much as a physical installation on someone else's computer can be locked up) and run it isolated, encrypted filesystems and the whole nine yards.
You can either go lightweight and open, or fat and closed. I don't know that there's a "lightweight and closed" option.
I have exactly the same problem. Currently what I was able to discover is bellow.
A. Asylo(https://asylo.dev)
Asylo requires programs/algorithms to be written in C++.
Asylo library is integrated in docker and it seems to be feаsable to create custom dоcker image based on Asylo .
Asylo depends on many not so popular technologies like "proto buffers" and "bazel" etc. To me it seems that learning curve will be steep i.e. the person who is creating docker images/(programs) will need a lot of time to understand how to do it.
Asylo is free of charge
Asylo is bright new with all the advantages and disadvantages of being that.
Asylo is produced by Google but it is NOT an officially supported Google product according to the disclaimer on its page.
Asylo promises that data in trusted environment could be saved even from user with root privileges. However, there is lack of documentation and currently it is not clear how this could be implemented.
B. Scone(https://sconedocs.github.io)
It is binded to INTEL SGX technology but also there is Simulation mode(for development).
It is not free. It has just a small set of functionalities which are not paid.
Seems to support a lot of security functionalities.
Easy for use.
They seems to have more documentation and instructions how to build your own docker image with their technology.
For the Python part, you might consider using Pyinstaller, with appropriate options, it can pack your whole python app in a single executable file, which will not require python installation to be run by end users. It effectively runs a python interpreter on the packaged code, but it has a cipher option, which allows you to encrypt the bytecode.
Yes, the key will be somewhere around the executable, and a very savvy costumer might have the means to extract it, thus unraveling a not so readable code. It's up to you to know if your code contains some big secret you need to hide at all costs. I would probably not do it if I wanted to charge big money for any bug solving in the deployed product. I could use it if client has good compliance standards and is not a potential competitor, nor is expected to pay for more licenses.
While I've done this once, I honestly would avoid doing it again.
Regarding the C code, if you can compile it into executables and/or shared libraries can be included in the executable generated by Pyinstaller.
I'm working on an embedded device that runs Linux on ARM7 with 64MB RAM and 64MB storage (12MB free). The device should be configured via web therefore it needs to run an embedded web server. Currently it's using Lighttpd and LUA, but I'm thinking about replacing LUA (or maybe even Lighttpd) with Python. The server will occasionally be accessed by one or two users for making changes to internal settings of the C program that is running in Linux. So the server load isn't really a lot. I also need it to be Open Source Software. Web.py seems to be small enough but I still need to compile Python which I haven't done before. So I'm wondering what are the system requirements of Python? LUA seems to do quite well for small embedded systems but I don't like its syntax for C-binding.
However, I couldn't find updated information about system requirements for embedding Python in such settings. This page from Michael Lauer seems to be old.
Any ideas? Suggestions? hints? links?
I'm working on this device using OpenWRT + Python:
http://wiki.openwrt.org/oldwiki/OpenWrtDocs/Hardware/Meraki/Mini
The first python run is veeeeeery slow but it are metacompiling all .pyc files, next it work well.
I'm a complete novice in this area, so please excuse my ignorance.
I have three questions:
What's the best (fastest, easiest, headache-free) way of hosting a python program online?
I'm currently looking at Google App Engine and Web Frameworks for Python, but all the options are a bit overwhelming.
Which gui/viz libraries will transfer to a web app environment without problems?
I'm willing to sacrifice some performance for the sake of simplicity.
(Google App Engine can't do C libraries, so this is causing a dilemma.)
Where can I learn more about running a program locally vs. having a program continuously run on a server and taking requests from multiple users?
Currently I have a working Python program that only uses standard Python libraries. It currently uses around 2.7gb of ram, but as I increase my dataset, I'm predicting it will use closer to 6gb. I can run it on my personal machine, and everything is just peachy. I'd like to continue developing on the front end on my home machine and implement the web app later.
Here is a relevant, previous post of mine.
Depending on your knowledge with server administration, you should consider a dedicated server. I was doing running some custom Python modules with Numpy, Scipy, Pandas, etc. on some data on a shared server with Godaddy. One program I wrote took 120 seconds to complete. Recently we switched to a dedicated server and it now takes 2 seconds. The shared environment used CGI to run Python and I installed mod_python on the dedicated server.
Using a dedicated server allows COMPLETE control (including root access) to the server which allows the compilation and/or installation of anything. It is a bit pricy but if you're making money with your stuff it might be worth it.
Another option would be to use something like http://www.dyndns.com/ where you can host a domain on your own machine.
So with that said, perhaps some answers:
It depends on your requirements. ~4gb of RAM might require a dedicated server. What you are asking is not necessarily an easy task so don't be afraid to get your hands dirty.
Not sure what you mean here.
A server is just a computer that responds to requests. On the dedicated server (I keep mentioning) you are operating in a Unix (or Windows) environment just like you would locally. You use SOFTWARE (e.g. Apache web server) to serve client requests. My vote is mod_python.
It's a greater headache than a dedicated server, but it should be much closer to your needs to go with an Amazon EC2 instance.
http://aws.amazon.com/ec2/#instance
Their extra large instance should be more than large enough for what you need to do, and you only turn the instance on when you need it so you don't have the massive bill that you get with a dedicated server that's the same size.
There are some nice javascript based visualization toolkits out there, so you can model your application to return raw (json) data and render that on the client.
I can mention d3.js http://mbostock.github.com/d3/ and the JavaScript InfoVis Toolkit http://thejit.org/