Greetings Stackoverflow community!
I recently learned about the power of microservices and containers, and I decided to wrap some of my numerical simulations codes in C++ and make them available as an API. Here are some requirements/details of my applications:
My simulators are coded in C++ with a few dependencies that I link via dynamic or static libraries in windows (e.g. Hypre, for solution of linear systems). They also run in parallel with MPI/OpenMP (in the future I would like to implement CUDA support as well).
The input for a simulator is a simple configuration file with some keys (.json format) and a data file (ascii but could be binary as well) with millions of entries (these are fields with one value for each simulation cells, and my models can be as large as 500x500x500 (=125000000 cells).
A typical call to the simulator in Windows is: mpiexec -n 4 mysimulator.exe "C:\path\to\config.json". Inside my configuration file I have other absolute path to the ascii file with the cellwise values.
I would like to "containerize" this whole mess and create an api available through HTTP requests or any other protocol that would allow the code to be run from outside the container. While the simulation microservice is running on a remote machine, anyone should be able to send a configuration file and the big ascii or binary file to the container, which would receive the request, perfom the simulation and somehow send back the results (which can be files and/or numerical values).
After some research, I feel this could be achieved with the following approach.
Create a docker image with the C++ code. When the container is created using the image as a blueprint, we obtain a binary executable of the C++ simulator.
Implement a python interface that handles the incoming requests using flask or django. We listen to requests at a certain port and once we get a request, we call the binary executable using python's subprocess.
The simulator somehow needs to send a "simulation status" back since these simulations can take hours to finish.
I have a few questions:
Is python "subprocess" call to a binary executable with the C++ code the way to go? Or is it easier/more recommended to implement the treatment to the API calls inside the C++ code?
How do you typically send a big binary/ascii file through HTTP to the microservice running inside a docker container?
If I have a workstation with - let's say - 16 cores...and I want to allow each user to run at most 2 processors, I could have a max of 8 parallel instances. This way, would I need 8 containers running simultaneously in the computer?
Since the simulations take hours to finish, what's the best approach to interact with the client who's requesting the simulation results? Are events typically used in this context?
Thanks,
Rafael.
Is python "subprocess" call to a binary executable with the C++ code the way to go? Or is it easier/more recommended to implement the treatment to the API calls inside the C++ code?
If you don't have performance concerns, use whatever faster to achieve and easier to scale according to your skills. Use the language that you're comfortable with. If performance is essential, then choose it wisely or refactor them later.
How do you typically send a big binary/ascii file through HTTP to the microservice running inside a docker container?
Depends on the scenario. It's possible to send a data through end point or send them part by part. You may refer to this post for restful update.
If I have a workstation with - let's say - 16 cores...and I want to allow each user to run at most 2 processors, I could have a max of 8 parallel instances. This way, would I need 8 containers running simultaneously in the computer?
Keep your service simple. If one service uses only 1 or 2 cores. Then run multiple instance. Since it's easy to scale rather than create a complex multithreading program.
Since the simulations take hours to finish, what's the best approach to interact with the client who's requesting the simulation results? Are events typically used in this context?
Event would be good enough. Use polling if simulation status is important.
Note: This is more of opinion based post, but it has general scenarios worth answering.
Related
I have code that uses a GET command from the Python REQUESTS library to pull data from an API. I am expecting, for example, 10 large files to be sent to me.
Can someone help explain to me how my code should be written where I can take 1 file and analyze it and then take another file in parallel to analyze that and so on? Is it possible to analyze all 10 at once?
First, this is not really a question about AWS and EC2.
Assuming that you don't want to rewrite your code too significantly, you may want to concurrently run many instances of your Python program, each with a different input file as the argument.
Assuming a typical workflow is:
python blah.py inputfile.xyz
You could now run something like:
python blah.py inputfile1.xyz &
python blah.py inputfile2.xyz &
...
python blah.py inputfileN.xyz &
wait
Note: this is the lazy way out. Optimal solutions will require rewriting code to be multithreaded, and analyzing your various resource limits.
The number of processes that you run should be limited by the number of vCPUs provided by your EC2 instance.
You may also be limited by your network bandwidth, in terms of multiple parallel downloads. Finally, some EC2 instances have burst limits after which they perform noticeably poorly.
I've created a python (and C, but the "controlling" part is Python) program for carrying out Bayesian inversion using Markov chain Monte Carlo methods. Unfortunately, McMC can take several days to run. Part of my research is in reducing the time, but we can only reduce so much.
I'm running it on a headless Centos 7 machine using nohup since maintaining a connection and receiving prints for several days is not practical. However, I would like to be able to check in on my program occasionally to see how many iterations it's done, how many proposals have been accepted, whether it's out of burn-in, etc.
Is there something I can use to interact with the python process to get this info?
I would recommend SAWs (Scientific Application Web server). It creates a thread in your process to handle HTTP request. The variables of interest are returned to the client in JSON format .
For the variables managed by the python runtime, write them into a (JSON) file on the harddisk (or any shared memory) and use SimpleHTTPServer to serve it. The SAWs web interface on the client side can still be used as long as you follow the JSON format of SAWs.
I am looking for some advice/opinions of which Python Framework to use in an implementation of multiple 'Worker' PCs co-ordinated from a central Queue Manager.
For completeness, the 'Worker' PCs will be running Audio Conversion routines (which I do not need advice on, and have standalone code that works).
The Audio conversion takes a long time, and I need to co-ordinate an arbitrary number of the 'Workers' from a central location, handing them conversion tasks (such as where to get the source files, or where to ask for the job configuration) with them reporting back some additional info, such as the runtime of the converted audio etc.
At present, I have a script that makes a webservice call to get the 'configuration' for a conversion task, based on source files located on the worker already (we manually copy the source files to the worker, and that triggers a conversion routine). I want to change this, so that we can distribute conversion tasks ("Oy you, process this: xxx") based on availability, and in an ideal world, based on pending tasks too.
There is a chance that Workers can go offline mid-conversion (but this is not likely).
All the workers are Windows based, the co-ordinator can be WIndows or Linux.
I have (in my initial searches) come across the following - and I know that some are cross-dependent:
Celery (with RabbitMQ)
Twisted
Django
Using a framework, rather than home-brewing, seems to make more sense to me right now. I have a limited timeframe in which to develop this functional extension.
An additional consideration would be using a Framework that is compatible with PyQT/PySide so that I can write a simple UI to display Queue status etc.
I appreciate that the specifics above are a little vague, and I hope that someone can offer me a pointer or two.
Again: I am looking for general advice on which Python framework to investigate further, for developing a Server/Worker 'Queue management' solution, for non-web activities (this is why DJango didn't seem the right fit).
How about using pyro? It gives you remote object capability and you just need a client script to coordinate the work.
I'd like to build a website with a sandboxed interpreter (or compiler) either on the client side of on the server side that can take short blocks of code (python/java/c/c++ any common language would do) as input and execute it.
What I want to build is a place where given a programming question, the user can type in the solution and we can run it through some test cases, to either approve the solution or provide a test case where it breaks.
Looking for pointers to libraries, existing implementation or a general idea.
Any help much appreciated.
There are many contest websites that do something like this-- TopCoder and Timus Online Judge are two examples. They don't have much information on the technology, however.
codepad.org is the closest to what you want to do. They run programs on heavily sandboxed and firewalled EC2 servers that are periodically wiped, to prevent exploits.
Codepad is at least partially based on geordi, an IRC bot designed to run arbitrary C++ programs. It uses Haskell and traps system calls to prevent harmful activity.
Of slightly less interest, one of Google App Engine's example projects is a Python shell. It relies on GAE's server-side sandboxing to prevent malicious activity.
In terms of interface, the simplest would be to do something like the Internation Informatics Olympiad. Have people write a function with a certain name in the target language, then invoke that from your testing framework. Have simple functions that will let them request information from the framework, if necessary.
For Python you can compile PyPy in sandboxed mode which gives you a complete interpreter and full standard library but without the ability to execute arbitrary system calls. You can also limit the runtime and heap size of executed scripts.
Here's some code I wrote a while back to execute an arbitrary string containing a Python script in the pypy-sandbox binary and return the output. You can call this code from regular CPython.
Take a look at the paper An Enticing Environment for Programming which discusses building just such an environment.
I'm working on a turn-based web game that will perform all world updates (player orders, physics, scripted events, etc.) on the server. For now, I could simply update the world in a web request callback. Unfortunately, that naive approach is not at all scalable. I don't want to bog down my web server when I start running many concurrent games.
So what is the best way to separate the load from the web server, ideally in a way that could even be run on a separate machine?
A simple python module with infinite loop?
A distributed task in something like Celery?
Some sort of cross-platform Cron scheduler?
Some other fancy Django feature or third-party library that I don't know about?
I also want to minimize code duplication by using the same model layer. That probably means my service would need access to the Django model code, so that definitely determines how I architect the service.
I think Celery, which you mention in your question, is the way to go here. It will interface nicely with the rest of your setup, support your eventual aim of separating out the systems, and is compatible with Django.
I'd just write the backend to just use the Django database interface (look at the setup code in your manage.py), spawn it as its own process, and interface to it with Protocol Buffers. That route should move to a separate machine with little work. MPI may be an option, too.
Pipes, FIFOs, and most other IPC requires both processes to be on the same box.
Though I have to point out a flaw in your premise:
Unfortunately, that naive approach is not at all scalable. I don't want to bog down my web server when I start running many concurrent games.
If you run concurrent games, so long as you keep all the parts for a given game on the same server, this is a non-issue, unless there's a common resource needed by all games. Then the real issue becomes load balancing across the servers.