I have code that uses a GET command from the Python REQUESTS library to pull data from an API. I am expecting, for example, 10 large files to be sent to me.
Can someone help explain to me how my code should be written where I can take 1 file and analyze it and then take another file in parallel to analyze that and so on? Is it possible to analyze all 10 at once?
First, this is not really a question about AWS and EC2.
Assuming that you don't want to rewrite your code too significantly, you may want to concurrently run many instances of your Python program, each with a different input file as the argument.
Assuming a typical workflow is:
python blah.py inputfile.xyz
You could now run something like:
python blah.py inputfile1.xyz &
python blah.py inputfile2.xyz &
...
python blah.py inputfileN.xyz &
wait
Note: this is the lazy way out. Optimal solutions will require rewriting code to be multithreaded, and analyzing your various resource limits.
The number of processes that you run should be limited by the number of vCPUs provided by your EC2 instance.
You may also be limited by your network bandwidth, in terms of multiple parallel downloads. Finally, some EC2 instances have burst limits after which they perform noticeably poorly.
Related
Greetings Stackoverflow community!
I recently learned about the power of microservices and containers, and I decided to wrap some of my numerical simulations codes in C++ and make them available as an API. Here are some requirements/details of my applications:
My simulators are coded in C++ with a few dependencies that I link via dynamic or static libraries in windows (e.g. Hypre, for solution of linear systems). They also run in parallel with MPI/OpenMP (in the future I would like to implement CUDA support as well).
The input for a simulator is a simple configuration file with some keys (.json format) and a data file (ascii but could be binary as well) with millions of entries (these are fields with one value for each simulation cells, and my models can be as large as 500x500x500 (=125000000 cells).
A typical call to the simulator in Windows is: mpiexec -n 4 mysimulator.exe "C:\path\to\config.json". Inside my configuration file I have other absolute path to the ascii file with the cellwise values.
I would like to "containerize" this whole mess and create an api available through HTTP requests or any other protocol that would allow the code to be run from outside the container. While the simulation microservice is running on a remote machine, anyone should be able to send a configuration file and the big ascii or binary file to the container, which would receive the request, perfom the simulation and somehow send back the results (which can be files and/or numerical values).
After some research, I feel this could be achieved with the following approach.
Create a docker image with the C++ code. When the container is created using the image as a blueprint, we obtain a binary executable of the C++ simulator.
Implement a python interface that handles the incoming requests using flask or django. We listen to requests at a certain port and once we get a request, we call the binary executable using python's subprocess.
The simulator somehow needs to send a "simulation status" back since these simulations can take hours to finish.
I have a few questions:
Is python "subprocess" call to a binary executable with the C++ code the way to go? Or is it easier/more recommended to implement the treatment to the API calls inside the C++ code?
How do you typically send a big binary/ascii file through HTTP to the microservice running inside a docker container?
If I have a workstation with - let's say - 16 cores...and I want to allow each user to run at most 2 processors, I could have a max of 8 parallel instances. This way, would I need 8 containers running simultaneously in the computer?
Since the simulations take hours to finish, what's the best approach to interact with the client who's requesting the simulation results? Are events typically used in this context?
Thanks,
Rafael.
Is python "subprocess" call to a binary executable with the C++ code the way to go? Or is it easier/more recommended to implement the treatment to the API calls inside the C++ code?
If you don't have performance concerns, use whatever faster to achieve and easier to scale according to your skills. Use the language that you're comfortable with. If performance is essential, then choose it wisely or refactor them later.
How do you typically send a big binary/ascii file through HTTP to the microservice running inside a docker container?
Depends on the scenario. It's possible to send a data through end point or send them part by part. You may refer to this post for restful update.
If I have a workstation with - let's say - 16 cores...and I want to allow each user to run at most 2 processors, I could have a max of 8 parallel instances. This way, would I need 8 containers running simultaneously in the computer?
Keep your service simple. If one service uses only 1 or 2 cores. Then run multiple instance. Since it's easy to scale rather than create a complex multithreading program.
Since the simulations take hours to finish, what's the best approach to interact with the client who's requesting the simulation results? Are events typically used in this context?
Event would be good enough. Use polling if simulation status is important.
Note: This is more of opinion based post, but it has general scenarios worth answering.
I've created a python (and C, but the "controlling" part is Python) program for carrying out Bayesian inversion using Markov chain Monte Carlo methods. Unfortunately, McMC can take several days to run. Part of my research is in reducing the time, but we can only reduce so much.
I'm running it on a headless Centos 7 machine using nohup since maintaining a connection and receiving prints for several days is not practical. However, I would like to be able to check in on my program occasionally to see how many iterations it's done, how many proposals have been accepted, whether it's out of burn-in, etc.
Is there something I can use to interact with the python process to get this info?
I would recommend SAWs (Scientific Application Web server). It creates a thread in your process to handle HTTP request. The variables of interest are returned to the client in JSON format .
For the variables managed by the python runtime, write them into a (JSON) file on the harddisk (or any shared memory) and use SimpleHTTPServer to serve it. The SAWs web interface on the client side can still be used as long as you follow the JSON format of SAWs.
I have written a script to do some research on HTTP Archive data. This script needs to make HTTP requests to sites scraped by HTTP Archive in order to classify sites into groups (e.g., Drupal, WordPress, etc). The script is working really well; however, the list of sites that I am handling is 300,000 sites long.
I would like to be able to complete the categorization of sites as fast as possible. I have experimented with running multiple instances of the script at the same time and it is working well with appropriate locks in place to prevent race conditions.
How can I max this out to get all of these operations completed as fast as possible? For instance, I am looking at spinning up a VPS with 8 CPUs and 16 GB RAM. How do I maximize these resources to make sure I'm using every bit of processing power possible? I may consider spinning up something more powerful, but I want to make sure I understand how to get the most out of it so I'm not wasting money.
Thanks!
Multiprocessing module is the best option that lets you harness the maximum power of your 8 CPUs:
https://docs.python.org/3.3/library/multiprocessing.html
I am using python programs to nearly everything:
deploy scripts
nagios routines
website backend (web2py)
The reason why I am doing this is because I can reuse the code to provide different kind of services.
Since a while ago I have noticed that those scripts are putting a high CPU load on my servers. I have taken several steps to mitigate this:
late initialization, using cached_property (see here and here), so that only those objects needed are indeed initialized (including import of the related modules)
turning some of my scripts into http services (with a simple web.py implementation, wrapping-up my classes). The services are then triggered (by nagios for example), with simple curl calls.
This has reduced the load dramatically, going from over 20 CPU load to well under 1. It seems python startup is very resource intensive, for complex programs with lots of inter-dependencies.
I would like to know what other strategies are people here implementing to improve the performance of python software.
An easy one-off improvement is to use PyPy instead of the standard CPython for long-lived scripts and daemons (for short-lived scripts it's unlikely to help and may actually have longer startup times). Other than that, it sounds like you've already hit upon one of the biggest improvements for short-lived system scripts, which is to avoid the overhead of starting the Python interpreter for frequently-invoked scripts.
For example, if you invoke one script from another and they're both in Python you should definitely consider importing the other script as a module and calling its functions directly, as opposed to using subprocess or similar.
I appreciate that it's not always possible to do this, since some use-cases rely on external scripts being invoked - Nagios checks, for example, are going to be tricky to keep resident at all times. Your approach of making the actual check script a simple HTTP request seems reasonable enough, but the approach I took was to use passive checks and run an external service to periodically update the status. This allows the service generating check results to be resident as a daemon rather than requiring Nagios to invoke a script for each check.
Also, watch your system to see whether the slowness really is CPU overload or IO issues. You can use utilities like vmstat to watch your IO usage. If you're IO bound then optimising your code won't necessarily help a lot. In this case, if you're doing something like processing lots of text files (e.g. log files) then you can store them gzipped and access them directly using Python's gzip module. This increases CPU load but reduces IO load because you only need transfer the compressed data from disk. You can also write output files directly in gzipped format using the same approach.
I'm afraid I'm not particularly familiar with web2py specifically, but you can investigate whether it's easy to put a caching layer in front if the freshness of the data isn't totally critical. Try and make sure both your server and clients use conditional requests correctly, which will reduce request processing time. If they're using a back-end database, you could investigate whether something like memcached will help. These measures are only likely to give you real benefit if you're experiencing a reasonably high volume of requests or if each request is expensive to handle.
I should also add that generally reducing system load in other ways can occasionally give surprising benefits. I used to have a relatively small server running Apache and I found moving to nginx helped a surprising amount - I believe it was partly more efficient request handling, but primarily it freed up some memory that the filesystem cache could then use to further boost IO-bound operations.
Finally, if overhead is still a problem then carefully profile your most expensive scripts and optimise the hotspots. This could be improving your Python code, or it could mean pushing code out to C extensions if that's an option for you. I've had some great performance by pushing data-path code out into C extensions for large-scale log processing and similar tasks (talking about hundreds of GB of logs at a time). However, this is a heavy-duty and time-consuming approach and should be reserved for the few places where you really need the speed boost. It also depends whether you have someone available who's familiar enough with C to do it.
I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?
Parallel Python is one option for distributing things over multiple machines in a cluster.
This example shows how to do a MapReduce like script, using processes on a single machine. Secondly, if you can, try caching intermediate results. I did this for a NLP task and obtained a significant speed up.
My package, jug, might be very appropriate for your needs. Without more information, I can't really say how the code would look like, but I designed it for sub-hadoop sized problems.