Docker containers can be linked. Most examples involve linking a Redis container with an SQL container. The beauty of linking containers is that you can keep the SQL environment separate from your Redis environment, and instead of building one monolithic image one can maintain two nicely separate ones.
I can see how this works for server applications (where the communication is transmitted through ports), but I have troubles replicating a similar approach for different libraries. As a concrete example, I'd like to use a container with Ipython Notebook together with the C/C++-library caffe (which exposes a Python interface through a package in one of its subfolders) and an optimisation library such as Ipopt. Containers for Ipython and Caffe readily exist, and I am currently working on a separate image for Ipopt. Yet how do I link the three together without building one giant monolithic Dockerfile? Caffe, Ipython and Ipopt each have a range of dependencies, making a combined maintenance a real nightmare.
My view on docker containers is that each container typically represents one process. E.g. redis or nginx. Containers typically communicates with each other using networking or via shared files in volumes.
Each container runs its own operating system (typically specified in the FROM-section in your Dockerfile). In your case, you are not running any specific processes but instead you simply wish to share libraries. This is not what docker was designed for and I am not even sure that it is doable but it sure seems as if it is a strange way of doing things.
My suggestion is therefore that you create a base image with the least common denominator (some of the shared libraries that are common to all other images) and that your other images use that image as the FROM-image.
Furthermore, If you need more complex setup of your environment with lots of dependencies and heavy provisioning, I suggest that you take a look at other provisioning tools such as Chef or Puppet.
Docker linking is about linking microservices, that is separate processes, and has no relation to your question as far as I can see.
There is no out-of-the-box facility to compose separate docker images into one container, the way you call 'linking' in your question.
If you don't want to have that giant monolithic image, you might consider using provisioning tools a-la puppet, chef or ansible together with docker. One example here. There you might theoretically get use of the existing recipes/playbooks for the libraries you need. I would be surprised though if this approach would be much easier for you than to maintain your "big monolithic" Dockerfile.
Related
I am working on a project that's evolved from one Dockerfile supporting several apps to one Dockerfile per app.
This generally works better than having them all together in one, but I'd like to share one Python library file among the apps without duplicating it.
I don't see a good way to do this, at least with the structure as currently set up: all apps have individual Bitbucket repos.
I don't think it's worth it to change the repo structure just for this, but is there some easier way I'm missing?
You could create one dockerfile containing the basic stuff for all your applications.
To reuse the basic image you have to push the image to your registry. And then you can use this basic image in your application images
FROM yourrepo/baseimage:latest
The main issue with this solution will be that the image is not updated at runtime. So if you update the baseimage you have to rebuild all your application images.
So you should use a CI/CD pipeline if the number of application containers is growing.
A bioinformatics protocol was developed and we would like to dockerize it to make its usage easier for others. It consists of 2 softwares and several python scripts to prepare and parse data. Locally, we run these modules on a cluster as several dependent jobs (some requiring high resources) with a wrapper script, like this:
parsing input data (python)
running a few 10-100 jobs on a cluster (each having a piece of the output of step 1). Every step's job depends on the previous one finishing, involving:
a) compiled C++ software on each piece from 1
b) a parsing python script on each piece from 2a
c) an other, resource-intensive compiled C++ software; which uses mpirun to distribute all the output of 2b
finalizing results (python script) on all results from step 2
The dockerized version does not necessarily needs to be organized in the same manner, but at least 2c needs to be distributed with mpirun because users will run it on a cluster.
How could I organize this? Have X different containers in a workflow? Any other possible solution that does not involve multiple containers?
Thank you!
ps. I hope I described it clearly enough but can further clarify if needed
I think in your project it is important to differentiate docker images and docker container.
In a docker image, you will package your code and the dependencies to make it work. The first question is, should all your code be in the same image : you have python scripts, c++ software, so it could make sense to have several images, each capable to run one job of your process.
A docker container is a running instance of a docker image. So if you decided to have several images, you will have several docker containers running during your process. If you decide to have only one image, then you can decide to run everything in one container, by running your wrapper script in the container. Or you could have a new wrapper script instantiating docker containers for each step. This could be interesting as you seem to use different hardware depending on the step.
I can't give specifics about mpirun as I'm not familiar with it
my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?
The solution I am working on consists of modules that are decoupled across multiple virtual server instances. All of the modules require the exact same DTO (Data transfer object) classes. Currently I am packaging the DTOs into a library and deploying it to all server modules, when a change is made to the DTOs I have to redeploy the library to all the modules to ensure they are using the latest update.
Are there any technologies or concepts available to share class definitions across multiple server instances without having to redeploy the library manually each time a change occurs?
The only way to share modules across servers without redeploying would be some sort of shared filesystem (NFS, Samba/CIFS, etc...). One thing to explore is whether SaltStack might work for you. It would require deploying, but it would make such deployments a snap -- http://www.saltstack.com/
I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.