Package python code dependencies for remote execution on the fly - python

my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?

Related

Build Process Tool for Python

I am a Java Developer writing (Stand-Alone) Applications for Apache Spark. To create Artifacts I use Gradle along with the ShadowJar Plugin.
A few team mates want to use Python. Currently, they use JetBrains PyCharm to write these Python Scripts and remotely execute them on the Spark Cluster Environment. However, this process does not scale well (What to do if there is more than one file involved?) and I am looking for a solution in the Python Ecosystem. A problem is that neither I nor one of my Team members is a Python Expert (in fact the other team mates are no developers, but have to write code. Management decisions...), so we do not have any clue what is the best practice for Python development.
I tried PyGradle, but it did not feel smoothly integratable, especially with Apache Spark. I tripped over names like Pip, Pex, Setuptools, VirtualEnv. What are those tools? How do they interfere with each other?
To prevent the X-Y Problem: I want a Codebase which can be build, (unit-)tested and packaged with one command (like gradle build). The resulting artifact should be able to be deployed and executed on a Spark Cluster.
i am also new to this world and want to setup process in the AI startup. i think http://pybuilder.github.io/ is at least good startpoint for automation as i am trying to setup this among us.

Linking docker containers to combine different libraries

Docker containers can be linked. Most examples involve linking a Redis container with an SQL container. The beauty of linking containers is that you can keep the SQL environment separate from your Redis environment, and instead of building one monolithic image one can maintain two nicely separate ones.
I can see how this works for server applications (where the communication is transmitted through ports), but I have troubles replicating a similar approach for different libraries. As a concrete example, I'd like to use a container with Ipython Notebook together with the C/C++-library caffe (which exposes a Python interface through a package in one of its subfolders) and an optimisation library such as Ipopt. Containers for Ipython and Caffe readily exist, and I am currently working on a separate image for Ipopt. Yet how do I link the three together without building one giant monolithic Dockerfile? Caffe, Ipython and Ipopt each have a range of dependencies, making a combined maintenance a real nightmare.
My view on docker containers is that each container typically represents one process. E.g. redis or nginx. Containers typically communicates with each other using networking or via shared files in volumes.
Each container runs its own operating system (typically specified in the FROM-section in your Dockerfile). In your case, you are not running any specific processes but instead you simply wish to share libraries. This is not what docker was designed for and I am not even sure that it is doable but it sure seems as if it is a strange way of doing things.
My suggestion is therefore that you create a base image with the least common denominator (some of the shared libraries that are common to all other images) and that your other images use that image as the FROM-image.
Furthermore, If you need more complex setup of your environment with lots of dependencies and heavy provisioning, I suggest that you take a look at other provisioning tools such as Chef or Puppet.
Docker linking is about linking microservices, that is separate processes, and has no relation to your question as far as I can see.
There is no out-of-the-box facility to compose separate docker images into one container, the way you call 'linking' in your question.
If you don't want to have that giant monolithic image, you might consider using provisioning tools a-la puppet, chef or ansible together with docker. One example here. There you might theoretically get use of the existing recipes/playbooks for the libraries you need. I would be surprised though if this approach would be much easier for you than to maintain your "big monolithic" Dockerfile.

Encrypted and secure docker containers

We all know situations when you cannot go open source and freely distribute software - and I am in one of these situations.
I have an app that consists of a number of binaries (compiled from C sources) and Python code that wraps it all into a system. This app used to work as a cloud solution so users had access to app functions via network but no chance to touch the actual server where binaries and code are stored.
Now we want to deliver the "local" version of our system. The app will be running on PCs that our users will physically own. We know that everything could be broken, but at least want to protect the app from possible copying and reverse-engineering as much as possible.
I know that Docker is a wonderful deployment tool so I wonder: is it possible to create encrypted Docker containers where no one can see any data stored in the container's filesystem? Is there a known solution to this problem?
Also, maybe there are well known solutions not based on Docker?
The root user on the host machine (where the docker daemon runs) has full access to all the processes running on the host. That means the person who controls the host machine can always get access to the RAM of the application as well as the file system. That makes it impossible to hide a key for decrypting the file system or protecting RAM from debugging.
Using obfuscation on a standard Linux box, you can make it harder to read the file system and RAM, but you can't make it impossible or the container cannot run.
If you can control the hardware running the operating system, then you might want to look at the Trusted Platform Module which starts system verification as soon as the system boots. You could then theoretically do things before the root user has access to the system to hide keys and strongly encrypt file systems. Even then, given physical access to the machine, a determined attacker can always get the decrypted data.
What you are asking about is called obfuscation. It has nothing to do with Docker and is a very language-specific problem; for data you can always do whatever mangling you want, but while you can hope to discourage the attacker it will never be secure. Even state-of-the-art encryption schemes can't help since the program (which you provide) has to contain the key.
C is usually hard enough to reverse engineer, for Python you can try pyobfuscate and similar.
For data, I found this question (keywords: encrypting files game).
If you want a completely secure solution, you're searching for the 'holy grail' of confidentiality: homomorphous encryption. In short, you want to encrypt your application and data, send them to a PC, and have this PC run them without its owner, OS, or anyone else being able to scoop at the data.
Doing so without a massive performance penalty is an active research project. There has been at least one project having managed this, but it still has limitations:
It's windows-only
The CPU has access to the key (ie, you have to trust Intel)
It's optimised for cloud scenarios. If you want to install this to multiple PCs, you need to provide the key in a secure way (ie just go there and type it yourself) to one of the PCs you're going to install your application, and this PC should be able to securely propagate the key to the other PCs.
Andy's suggestion on using the TPM has similar implications to points 2 and 3.
Sounds like Docker is not the right tool, because it was never intended to be used as a full-blown sandbox (at least based on what I've been reading). Why aren't you using a more full-blown VirtualBox approach? At least then you're able to lock up the virtual machine behind logins (as much as a physical installation on someone else's computer can be locked up) and run it isolated, encrypted filesystems and the whole nine yards.
You can either go lightweight and open, or fat and closed. I don't know that there's a "lightweight and closed" option.
I have exactly the same problem. Currently what I was able to discover is bellow.
A. Asylo(https://asylo.dev)
Asylo requires programs/algorithms to be written in C++.
Asylo library is integrated in docker and it seems to be feаsable to create custom dоcker image based on Asylo .
Asylo depends on many not so popular technologies like "proto buffers" and "bazel" etc. To me it seems that learning curve will be steep i.e. the person who is creating docker images/(programs) will need a lot of time to understand how to do it.
Asylo is free of charge
Asylo is bright new with all the advantages and disadvantages of being that.
Asylo is produced by Google but it is NOT an officially supported Google product according to the disclaimer on its page.
Asylo promises that data in trusted environment could be saved even from user with root privileges. However, there is lack of documentation and currently it is not clear how this could be implemented.
B. Scone(https://sconedocs.github.io)
It is binded to INTEL SGX technology but also there is Simulation mode(for development).
It is not free. It has just a small set of functionalities which are not paid.
Seems to support a lot of security functionalities.
Easy for use.
They seems to have more documentation and instructions how to build your own docker image with their technology.
For the Python part, you might consider using Pyinstaller, with appropriate options, it can pack your whole python app in a single executable file, which will not require python installation to be run by end users. It effectively runs a python interpreter on the packaged code, but it has a cipher option, which allows you to encrypt the bytecode.
Yes, the key will be somewhere around the executable, and a very savvy costumer might have the means to extract it, thus unraveling a not so readable code. It's up to you to know if your code contains some big secret you need to hide at all costs. I would probably not do it if I wanted to charge big money for any bug solving in the deployed product. I could use it if client has good compliance standards and is not a potential competitor, nor is expected to pay for more licenses.
While I've done this once, I honestly would avoid doing it again.
Regarding the C code, if you can compile it into executables and/or shared libraries can be included in the executable generated by Pyinstaller.

Are all cluster computing libraries compatible with starcluster?

I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.

Python: tool to keep track of deployments

I'm looking for a tool to keep track of "what's running where". We have a bunch of servers, and on each of those a bunch of projects. These projects may be running on a specific version (hg tag/commit nr) and have their requirements at specific versions as well.
Fabric looks like a great start to do the actual deployments by automating the ssh part. However, once a deployment is done there is no overview of what was done.
Before reinventing the wheel I'd like to check here on SO as well (I did my best w/ Google but could be looking for the wrong keywords). Is there any such tool already?
(In practice I'm deploying Django projects, but I'm not sure that's relevant for the question; anything that keeps track of pip/virtualenv installs or server state in general should be fine)
many thanks,
Klaas
==========
EDIT FOR TEMP. SOLUTION
==========
For now, we've chosen to simply store this information in a simple key-value store (in our case: the filesystem) that we take great care to back up (in our case: using a DCVS). We keep track of this store with the same deployment tool that we use to do the actual deploys (in our case: fabric)
Passwords are stored inside a TrueCrypt volume that's stored inside our key-value store.
==========
I will still gladly accept any answer when some kind of Open Source solution to this problem pops up somewhere. I might share (part of) our solution somewhere myself in the near future.
pip freeze gives you a listing of all installed packages. Bonus: if you redirect the output to a file, you can use it as part of your deployment process to install all those packages (pip can programmatically install all packages from the file).
I see you're already using virtualenv. Good. You can run pip freeze -E myvirtualenv > myproject.reqs to generate a dependency file that doubles as a status report of the Python environment.
Perhaps you want something like Opscode Chef.
In their own words:
Chef works by allowing you to write
recipes that describe how you want a
part of your server (such as Apache,
MySQL, or Hadoop) to be configured.
These recipes describe a series of
resources that should be in a
particular state - for example,
packages that should be installed,
services that should be running, or
files that should be written. We then
make sure that each resource is
properly configured, only taking
corrective action when it's
neccessary. The result is a safe,
flexible mechanism for making sure
your servers are always running
exactly how you want them to be.
EDIT: Note Chef is not a Python tool, it is a general purpose tool, written in Ruby (it seems). But it is capable of supporting various "cookbooks", including one for installing/maintaining Python apps.

Categories