How can I tell what strain is caused by pyvmomi on vmware? - python

I need to make a python script that gathers large amounts of information about vcenters. I basically call every single diagnostic possible, one after the other, gathering this info into a database daily for machine learning. This logs into the vcenter via python's pyvmomi and then calls for information about all resource centers, calls for information about all clusters, calls for the hosts on each cluster and information about them, and calls for information about every vm on every host. Where can I see the strain on vcenter? Where is it hosted? I guess the issue is I've been assigned a task, and I can gather documentation and get it done, but I dont want to cause strain on very important tools for our business. How can I be sure this does not cause issues with bandwidth or slow important processes like CPU sharing, memory allocation, and host switching.

You can see the overall stats of the vCenter in the VAMI (vCenter Appliance Management Interface), where you can monitor the CPU and RAM utilization. If you want deeper, more precise, info, it would require logging into the vCenter locally, which is generally not recommended.
vCenter will, in general, handle your queries sequentially. They will queue up if the vCenter can't handle them immediately. The issue is that it will also queue up the other tasks being processed and there's no real way to give tasks a priority.
If you have vROps running in your environment, that might be a better place to pull all of those metrics... assuming they're available there.

Related

When running dask on a local computer, should I create a `client` object?

I run dask distributed jobs on a local computer, I want to utilize all the available cpus's and don't care about the diagnostic dashboard. Are there any advantages of creating a Client object in this scenario?
I'm asking because creating such an object takes some time during the startup of a script, and I would like to cut this time.
This is clearly covered in the documentation: https://docs.dask.org/en/latest/scheduler-overview.html
In short: maybe, depending on the type of workload you are running. Threads have lowest latency and memory sharing, but will not parallelise code that needs the GIL. The distributed scheduler allows you more configuration, e.g., thread/process mix, memory limits and such.

Possible to outsource computations to AWS and utilize results locally?

I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).
I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.
Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.
I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of CearĂ¡), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.

Rabbitmq memory control, queue is full and is not paging. Connection hangs

I'm testing out a rabbitMQ, celery setup.
In the current setup there is a jobqueue (2GB RAM, 65GB HD) and only one worker which pushes a lot of messages to the queue (later, we'll add a bunch of workers). When the jobqueue reaches about ~11 million messages the connection hangs (pretty sure this is a case of blocking due to Memory Based Flow Control as in http://www.rabbitmq.com/memory.html). But the connection hangs forever, never closing the connection, nor paging to disk. This is undesirable behavior -- causing the celery workers to become zombie processes.
In thinking about the total size that system might actually require -- we would like the queue to be able to take something like 10,000 times this load -- a total max of around ~30billion messages in the queue at a time.
Here are some relevant settings:
{vm_memory_high_watermark,0.8},
{vm_memory_high_watermark_paging_ratio,0.5}]
We initially changed the vm_high_watermark from .4 to .8, which allowed more messages in the queue but still not enough.
We're thinking of course the system will need more RAM at some point, although before that happens we want to understand the current problem and how to deal with it.
Right now, there are only 11m tasks in the queue and it is using 80% of 2GB RAM, and the entire system is only using 8GB of disk. The memory usage makes sense given that we set the vm_memory_high_watermark to .8. The disk usage does not make sense at all to me, though -- and suggests that the paginating is not happening. Why isn't RabbitMQ paginating to disk in order to allow the queue to grow more? While obviously slowing down the queue machine, this would allow it to not die -- and seems like desirable fallback behavior. AFAIK this is indeed the whole point of pagination.
Other notes:
We confirmed that the connections are hanging and have in fact been blocked for 41 hours since then (by examining the connections section of rabbitmqctl report). According to http://www.rabbitmq.com/memory.html this means that "flow control is taking place". The question is -- why isn't it paging messages to disk?
Other details:
Ubuntu 12.04.3 LTS
RabbitMQ 3.2.2, Erlang R14B04
Celery 3.0.24
Python 2.7.3
If your queue is not durable, no messages will be paged to disk. The system will be limited by available memory. IF you need messages to be flushed to disk, use a durable=true queue.
And this design, having a lot of load and not consuming the messages, is not ideal. RabbitMQ is not a database, the messages are meant to be transient. IF you need a datastore, use Redis, a RDBMS, etc.

What's a distributed lock and why use it?

Why do people need a distributed lock?
When the shared resource is protected by it's local machine, does this mean that we do not need a distributed lock?
I mean when the shared resource is exposed to others by using some kind of api or service, and this api or service is protected using it's local locks; then we do not need this kind of distributed lock; am I right?
After asking people on quora. I think I got the answer.
Let's say N worker server access a database server. There are two parts here:
The database has its own lock methods to protect the data from corrupted by concurrent access of other clients (N work server). This is where a local lock in the database comes out.
N worker servers may need some coronation to make sure that they are doing the right thing, and this is application specific. This is where a distributed lock comes out. Say, if two worker server running a same process that drops a table of the database and add some record to the table. The database server can sure guarantee that its internal data is right, but this two process needs a distributed lock to coordinate each other, otherwise, one process will drop another process's table.
Yes and no, if you're exposing your information from the local API through a lock to prevent mutex depending on how the lock is setup your implementation might be that of exactly what a distributed lock is trying to accomplish, but if you haven't developed the API then you'll have to dig into the source of the API to find out if it's a localized or distributed locking system. Honestly a lock is a lock is a lock, it's attempting to do the same thing no matter what. The benefit of the distributed lock over a localized one is that you're already accounting for queueing to prevent over access to from clients on expensive cache points.

Coordinating distributed Python processes using queuing or REST web service

Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.
A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.
Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.
Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.
I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where
Server A would write a file, compress it and then produce a message for a queue.
Server B would subscribe to the queue and would process files named in the message body.
I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).
I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.
It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.
I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.
I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.
Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.
Message brokers such as Rabbit contain practical solutions for a number of problems:
multiple producers and consumers are supported without risk of duplication of messages
atomicity and unit-of-work logic provide transactional integrity, preventing duplication and loss of messages in the event of failure
horizontal scaling--most mature brokers can be clustered so that a single queue exists on multiple machines
no-rendezvous messaging--it is not necessary for sender and receiver to be running at the same time, so one can be brought down for maintenance without affecting the other
preservation of FIFO order
Depending on the particular web service platform you are considering, you may find that you need some of these features and must implement them yourself if not using a broker. The web service protocols and formats such as HTTP, SOAP, JSON, etc. do not solve these problems for you.
In my previous job the project management passed on using message brokers early on, but later the team ended up implementing quick-and-dirty logic meant to solve some of the same issues as above in our web service architecture. We had less time to provide business value because we were fixing so many concurrency and error-recovery issues.
So while a message broker may seem on its face like a heavyweight solution, and may actually be more than you need right now, it does have a lot of benefits that you may need later without yet realizing it.
As wberry alluded to, a REST or web-hook based solution can be functional but will not be very tolerant to failure. Paying the operations price up front for messaging will pay long term dividends as you will find additional problems which are a natural fit for the messaging model.
Regarding other OSS options; If you are considering stream based processing in addition to this specific use case, I would recommend taking a look at Apache Kafka. Kafka provides similar messaging semantics to RabbitMQ, but is tightly focused on processing message streams (not to mention that is has been battle tested in production at LinkedIn).

Categories