Evaluation / Simulation of existing python program - python

I implemented a distributed monitoring solution which should be analyzed. Therefore it would be good to scale number of monitoring agents and to evaluate the resulting overhead (CPU, Memory, Disk, Network) on each host (or network). Unfortunately, I have not enough physical nodes. Therefore I made primarily attempts with mininet, however I ran into negativ timing and scheduling effects. In addition, CPU or memory usage on a host is difficult to investigate during emulation on only one physical host.
Hence, I tried simpy. But I'm not able to monitor CPU, memory, disk and network usage of processes. Is there any possibility to do so? It would be good to have a soultion for evaluation which does not depend on host resources and correct timing, like a simulation does. But I'm not sure if simpy is a good decision or how this tool could provide this.

Related

How to bind a process to only physical cores in a cross system way?

I’m using a project where each time you double the number of threads, you add between 40% to 60% overhead. As hyperthreading increases performance to a maximum of 30% this means, the program runs slower than in single thread mode on hyperthreaded systems.
The first steps seem to be simple.
Get the number of threads on the system through len(os.sched_getaffinity(0))
Restrict the number of threads through z3 parameters.
Bind the threads to physical cores using os.sched_setaffinity(0,mask).
Leave smt solutions enabled for systems not containing Intel or amd inside platform.machine().
However several problems arise for doing this.
How to know if the system has hyperthreading enabled?
Before using os.sched_setaffinity(0,mask), how to know which cpu core numbers are physical or logical?
The problem is the program currently supports a wide number of platforms through python3: all Unixes, as well as Windows and Osx and Openvms while not forgetting PyPy.
Any patch to fix the problem shouldn’t spawn a new process nor add a non-included dependency nor drop support for some of the platforms above.
What can be a clean way to fix this?
The loky library contains a fairly portable solution to this. It does spawn a process, and then caches the result -- so it's not like you're spawning a process more than once. Given this is the solution which backs popular libraries like sklearn, I would guess that it's almost as good as it gets.

How can I tell what strain is caused by pyvmomi on vmware?

I need to make a python script that gathers large amounts of information about vcenters. I basically call every single diagnostic possible, one after the other, gathering this info into a database daily for machine learning. This logs into the vcenter via python's pyvmomi and then calls for information about all resource centers, calls for information about all clusters, calls for the hosts on each cluster and information about them, and calls for information about every vm on every host. Where can I see the strain on vcenter? Where is it hosted? I guess the issue is I've been assigned a task, and I can gather documentation and get it done, but I dont want to cause strain on very important tools for our business. How can I be sure this does not cause issues with bandwidth or slow important processes like CPU sharing, memory allocation, and host switching.
You can see the overall stats of the vCenter in the VAMI (vCenter Appliance Management Interface), where you can monitor the CPU and RAM utilization. If you want deeper, more precise, info, it would require logging into the vCenter locally, which is generally not recommended.
vCenter will, in general, handle your queries sequentially. They will queue up if the vCenter can't handle them immediately. The issue is that it will also queue up the other tasks being processed and there's no real way to give tasks a priority.
If you have vROps running in your environment, that might be a better place to pull all of those metrics... assuming they're available there.

Does multiprocessing speed up file transfers compared to multithreading

I am writing a script to simultaneously accept many files transfers from many computers on a subnet using sockets (around 40 jpg files total). I want to use multithreading or multiprocessing to make the the transfer occur as fast as possible.
I'm wondering if this type of image transfer is limited by the CPU - and therefore I should use multiprocessing - or if multithreading will be just as good here.
I would also be curious as to what types of activities are limited by the CPU and require multiprocessing, and which are better suited for multithreading.
If the following assumptions are true:
Your script is simply receiving data from the network and writing that data to disk (more or less) verbatim, i.e. it isn't doing any expensive processing on the data
Your script is running on a modern CPU with typical modern networking hardware (e.g. gigabit Ethernet or slower)
Your script's download routines are not grossly inefficient (e.g. you are receiving reasonably-sized chunks of data and not just 1 byte at a time or something silly like that)
... then it's unlikely that your download rate will be CPU-limited. More likely the bottleneck will be either network bandwidth or disk I/O bandwidth.
In any case, since AFAICT your use-case is embarrassingly parallel (i.e. the various downloads never have to communicate or interact with each other, they just each do their own thing independently), it's unlikely that using multithreading vs multiprocessing will make much difference in terms of performance. Of course, the only way to be certain is to try it both ways and measure the throughput each way.
Short answer:
Generally, it really depends on your workload. If you're serious on the performance, please provide details. for example, whether you store images to disk, whether image sizes are > 1GB or not, and etc.
Note: Generally again, if it not mission-critical, both ways are acceptable since we can easily switch between multithread and multiprocess implementations using threading.Thread and multiprocessing.Process.
some more comments
It seems that not CPU but IO will be the bottleneck.
For multiprocess / multithread, due to GIL and/or your implementation, we may have performance difference. You may implement both ways and make try. BTW, IMHO it won't differ much. I think that async IO vs blocking IO will have greater impact.
If your file transfer isn't extremely slow - slower than writing data to disk, multithreading/multiprocessing isn't going to help. By file transfer I mean downloading images and writing them to the local computer with a single HDD.
Using multithreading or multiprocessing when transferring data from several computers with separate disks definitely can improve overall download performance. Simply data read from several physical disks can be read in paralel. The problem arises when you try to save these images to your local drive.
You have just a single local HDD (if disk array not used), single HDD like most HW devices can do just a single IO operation at time. So trying to write several images to disk in the same time won't improve the overal performance - it can even hamper it.
Just imagine that 40 already downloaded images are trying to be written to a single mechanical HDD with single HDD head to different locations (different physical files) especially if disk is fragmented. Then this can even slow down the whole process because HDD is wasting time moving it magnetic head from one position to different (drives can partially mitigate this by reordering IO operation to limit head movement).
On the other hand if you do some preprocessing with these images that is CPU intensive and just then you are going to save them to disk, multithreading can be really helpful.
And to the question what's preferred. On modern OSs there is not a significant difference between using multithreading and multiprocessing (spanning multiple processes). OSs like Linux or Windows schedule threads not processes - based on process and thread priorities. So there is not a big difference between 40 single threaded processes and a single process containing 40 threads. Using multiple processes normally consumes more memory because OS for every process has to allocate some extra memory (not big), but from point of speed difference between multithreading and multiprocessing is not significant. There are other important question to consider which method to use (will these downloads share some data - like common GUI interface - multithreading is easier to use), (are these files to download so big that 40 transfers can exhaust all virtual address space of a single process - use multiprocessing).
Generally:
Multithreading - easier to use in single application because all threads share virtual address space of a single process and can easily communicate with each other. On the other hand single process has a limited size of virtual address space (less than 4GB on 32bit computer).
Multiprocessing - harder to use in a single application (a need of inter-process communication), but more scalable and more robust (if file transfer process crashes only a single file transfer fails) + more virtual address space to use.

Python distributed computing over LAN

Most of my work involves "embarrassingly parallel" computations, i.e. relatively CPU heavy tasks looped over a range of parameters without any need for communication between the individual computations.
I would like to be able to share computations between my workplace PC and my home PC (and potentially other workplace PCs).
My question is about how to achieve this in the most easy manner. All PCs are behind firewalls, but the PCs at the workplace share a common domain (Windows), if it makes a difference.
Thanks!

Possible to outsource computations to AWS and utilize results locally?

I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).
I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.
Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.
I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of Ceará), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.

Categories