Fastest way of communication between multiple EC2 instances in python - python

I am looking for the absolute fastest way to submit a short string from one to multiple EC2 Instances of the type t2.nano. Example: Something happens on Instance 1, Instance 2,3,4 should (almost) instantly know about it. Target is < 5ms. For now the instances are all in the same region and same cluster availability zone.
What I have looked at so far:
Shared drive where instance1 can drop the data and the rest of the instances can check it
-> Not possible as this instance type does not support shared drives
Redis
-> I tested this locally and it is pretty slow actually, at least XXms, and sometimes XXXms for one read and one write (just for
testing).
Any ideas how to solve this problem?

You can try AWS EFS
Multiple compute instances, including Amazon EC2, Amazon ECS, and AWS Lambda, can access an Amazon EFS file system at the same time, providing a common data source for workloads and applications running on more than one compute instance or server.
https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html

Consider using Twisted for Publish/Subscribe between your clients, where clients see all messages posted by other clients.
Alternatively, consider Autobahn which builds abstraction layers on Twisted including WebSocket-based pub/sub and WAMP.

Related

What's the optimal way to store image data temporarily in a containerized website?

I'm currently working on a website where i want the user to upload one or more images, my flask backend will do some changes on these pictures and then return them back to the front end.
Where do I optimally save these images temporarily especially if there are more then one user at the same time on my website (I'm planning on containerizing the website). Is it safe for me to save the images in the folder of the website or do I need e.g. a database for that?
You should use a database, or external object storage like Amazon S3.
I say this for a couple of reasons:
Accidents do happen. Say the client does an HTTP POST, gets a URL back, and does an HTTP GET to retrieve the result. But in the meantime, the container restarts (because the system crashed; your cloud instance got terminated; you restarted the container to upgrade its image; the application failed); the container-temporary filesystem will get lost.
A worker can run in a separate container. It's very reasonable to structure this application as a front-end Web server, that pushes messages into a job queue, and then a back-end worker picks up messages out of that queue to process the images. The main server and the worker will have separate container-local filesystems.
You might want to scale up the parts of this. You can easily run multiple containers from the same image; they'll each have separate container-local filesystems, and you won't directly control which replica a request goes to, so every container needs access to the same underlying storage.
...and it might not be on the same host. In particular, cluster technologies like Kubernetes or Docker Swarm make it reasonably straightforward to run container-based applications spread across multiple systems; sharing files between hosts isn't straightforward, even in these environments. (Most of the Kubernetes Volume types that are easy to get aren't usable across multiple hosts, unless you set up a separate NFS server.)
That set of constraints would imply trying to avoid even named volumes as much as you can. It makes sense to use volumes for the underlying storage for your database, and it can make sense to use Docker bind mounts to inject configuration files or get log files out, but ideally your container doesn't really use its local filesystem at all and doesn't care how many copies of itself are running.
(Do not rely on Docker's behavior of populating a named volume on first use. There are three big problems with it: it is on first use only, so if you update the underlying image, the volume won't get updated; it only works with Docker named volumes and not other options like bind-mounts; and it only works in Docker proper and not in Kubernetes.)
Other decisions are possible given other sets of constraints. If you're absolutely sure you will never ever want to run this application spread across multiple nodes, Docker volumes or bind mounts might make sense. I'd still avoid the container-temporary filesystem.

Centralized object store for multiple Python processes

I'm looking for a pythonic and simple way to synchronously share a common data source across multiple Python processes.
I've been thinking about using Pyro4 or Flask to write a kind of a CRUD service that I can get and put objects from and into. But Flask appears to be a lot of coding for a simple task and Pyro4 seems to require some name service.
Do you know of any (preferably easy to use, matured, high-level) library or package that provides centralized storage and high performance access to objects shared across multiple Python processes?
Take a look at Redis
Redis is an in-memory key-value database.
And you can download redis-py to use redis with python

How to get filtered VM list on vcenter with python

I'm trying to get a list of registered virtual machines - by their name - in my vcenter. The problem that I have many vms (~5K), and I am doing it a lot of times (O(1000)/hour).
The SDKs I'm using causing it a lot of traffic (1-2MB/request):
pysphere: which ask for all vms, and filters on client side.
pyVmomi, which need to use recursion to list all vms (I saw SI.content.searchIndex.FindByDnsName on reboot_vm.py, but my machines' DNS configuration is not true)
Looking into SOAP documentation didn't help (got into RetrievePropertiesEx.objectSet but it doesn't looks to filter anything), and the new REST (v6.5) didn't help too (since I need to get its "datastore path", and all I can get is the name)
Have you tried a using a property collector, as in pyVmomi.vmodl.query.PropertyCollector.FilterSpec?
The pyVmomi Community Samples contain examples of using this API, such as
https://github.com/vmware/pyvmomi-community-samples/blob/master/samples/filter_vms.py.
vAPI is based on REST, even more, it does network requests every time, so it will be slow. Another way is to use VCDB, where are stored all VMs (there us such one, but I can not help you here more), see https://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.installclassic.doc_41/install/prep_db/c_preparing_the_vmware_infrastructure_databases.html or https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1025914

How to show Spark application's percentage of completion on AWS EMR (and Boto3)?

I am running a Spark step on AWS EMR, this step is added to EMR through Boto3, I will like to return to the user a percentage of completion of the task, is there anyway to do this?
I was thinking to calculate this percentage with the number of completed stages of Spark, I know this won't be too precise, as the stage 4 may take double time than stage 5 but I am fine with that.
Is it possible to access this information with boto3?
I checked the method list_steps (here are the docs) but in the response I am getting only if its running without other information.
DISCLAIMER: I know nothing about AWS EMR and Boto3
I will like to return to the user a percentage of completion of the task, is there anyway to do this?
Any way? Perhaps. Just register a SparkListener and intercept events as they come. That's how web UI works under the covers (which is the definitive source of truth for Spark applications).
Use spark.extraListeners property to register a SparkListener and do whatever you want with the events.
Quoting the official documentation's Application Properties:
spark.extraListeners A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called.
You could also consider REST API interface:
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
This is not supported at the moment and I don't think it will be anytime soon.
You'll just have to follow application logs the old fashioned way. So maybe consider formatting your logs in a way you know what has actually finished.

Amazon EC2 file structure / web app with separate Python backend?

I'm currently running a t2.micro instance on EC2 right now. I have the html/web interface side of it working, along with a MySQL database.
The site allows users to register and stores them in the DB via a PHP script.
I want there to be an actual Python application that queries the MySQL database and returns user data, to then be executed in a Python script.
What I cannot find is whether I host this Python application as a totally separate instance or if it can exist on the same instance, in a different directory. I ultimately just need to query the database, which makes me thing it must exist on the same instance.
Could someone please provide some guidance?
Let me just be clear: this is not a Python web app. This Python backend is entirely separate except making queries against the database.
Either approach is possible, but there are pros & cons to each.
Running separate Python app on the same server:
Pros:
Setting up local access to the database is fairly simple
Only need to handle backups or making snapshots, etc. for a single instance
Cons:
Harder to scale up individual pieces if you need more memory, processing power, etc. in the future
Running the Python app on a separate server:
Pros:
Separate pieces means you can scale up & down the hardware each piece is running on, according to their individual needs
If you're using all micro instances, you get more resources to work with, without any extra costs (assuming you're still meeting all the other 'free tier eligible' criteria)
Cons:
In general, more pieces == more time spent on configuration, administration tasks, etc.
You have to open up the database to non-local access
Simplest: open up the database to access from anywhere (e.g. all remote IP addresses), and have the Python app log in via the internet
Somewhat safer, more complex: set the Python app server up with an elastic IP, open up the database to access only from that address
Much safer, more complex: set up your own virtual private cloud (VPC), and allow connections to the database only from within the VPC. You'd have to configure public access for each of the servers for whatever public traffic you'll have, presumably ports 80 and/or 443.

Categories