Most of my work involves "embarrassingly parallel" computations, i.e. relatively CPU heavy tasks looped over a range of parameters without any need for communication between the individual computations.
I would like to be able to share computations between my workplace PC and my home PC (and potentially other workplace PCs).
My question is about how to achieve this in the most easy manner. All PCs are behind firewalls, but the PCs at the workplace share a common domain (Windows), if it makes a difference.
Thanks!
Related
I’m using a project where each time you double the number of threads, you add between 40% to 60% overhead. As hyperthreading increases performance to a maximum of 30% this means, the program runs slower than in single thread mode on hyperthreaded systems.
The first steps seem to be simple.
Get the number of threads on the system through len(os.sched_getaffinity(0))
Restrict the number of threads through z3 parameters.
Bind the threads to physical cores using os.sched_setaffinity(0,mask).
Leave smt solutions enabled for systems not containing Intel or amd inside platform.machine().
However several problems arise for doing this.
How to know if the system has hyperthreading enabled?
Before using os.sched_setaffinity(0,mask), how to know which cpu core numbers are physical or logical?
The problem is the program currently supports a wide number of platforms through python3: all Unixes, as well as Windows and Osx and Openvms while not forgetting PyPy.
Any patch to fix the problem shouldn’t spawn a new process nor add a non-included dependency nor drop support for some of the platforms above.
What can be a clean way to fix this?
The loky library contains a fairly portable solution to this. It does spawn a process, and then caches the result -- so it's not like you're spawning a process more than once. Given this is the solution which backs popular libraries like sklearn, I would guess that it's almost as good as it gets.
I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).
I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.
Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.
I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of Ceará), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.
I implemented a distributed monitoring solution which should be analyzed. Therefore it would be good to scale number of monitoring agents and to evaluate the resulting overhead (CPU, Memory, Disk, Network) on each host (or network). Unfortunately, I have not enough physical nodes. Therefore I made primarily attempts with mininet, however I ran into negativ timing and scheduling effects. In addition, CPU or memory usage on a host is difficult to investigate during emulation on only one physical host.
Hence, I tried simpy. But I'm not able to monitor CPU, memory, disk and network usage of processes. Is there any possibility to do so? It would be good to have a soultion for evaluation which does not depend on host resources and correct timing, like a simulation does. But I'm not sure if simpy is a good decision or how this tool could provide this.
Upon hearing that the scientific computing project (happens to be the stochastic tractography method described here) I'm currently running for an investigator would take 4 months on our 50 node cluster, the investigator has asked me to examine other options. The project is currently using parallel python to farm out chunks of a 4d array to different cluster nodes, and put the processed chunks back together.
The jobs I'm currently working with are probably much too coarsely grained, (5 seconds to 10 minutes, I had to increase the timeout default in parallel python) and I estimate I could speed up the process by 2-4 times by rewriting it to make better use of resources (splitting up and putting back together the data is taking too long, that should be parallelized as well). Most of the work in done by numpy arrays.
Let's assume that 2-4 times isn't enough, and I decide to get the code off of our local hardware. For high throughput computing like this, what are my commercial options and how will I need to modify the code?
You might be interested in PiCloud. I have never used it, but their offer apparently includes the Enthought Python Distribution, which covers the standard scientific libraries.
It's tough to say if this will work for your specific case, but the Parallel Python interface is pretty generic. So hopefully not too many changes would be needed. Maybe you can even write a custom scheduler class (implementing the same interface as PP). Actually that might be useful for many people, so maybe you can drum up some support in the PP forum.
The most obvious commercial options which come to mind are Amazon EC2 and the Rackspace Cloud. I have played with both and found the Rackspace API a little easier to use.
The good news is that you can prototype and play with their compute instances (short- or long-lived virtual machines of the OS of your choice) for very little investment, typically $US 0.10 / hr or so. You create them on demand and then release them back to the cloud when you are done, and only pay for what you use. For example, I saw a demo on Django deployment using 6 Rackspace instances which took perhaps an hour and cost the speakers less than a dollar.
For your use case (not clear exactly what you meant by 'high throughput'), you will have to look at your budget and your computing needs, as well as your total network throughput (you pay for that, too). A few small-scale tests and a simple spreadsheet calculation should tell you if it's really practical or not.
There are Python APIs for both Rackspace Cloud and Amazon EC2. Whichever you use, I recommend python-based Fabric for automated deployment and configuration of your instances.
What are some best practises for prototyping a filesystem?
I've had an attempt in Python using fusepy, and now I'm curious:
In the long run, should any
respectable filesystem implementation
be in C? Will not being in C hamper
portability, or eventually cause
performance issues?
Are there other implementations like
FUSE?
Evidently core filesystem technology moves slowly (fat32, ext3, ntfs, everything else is small fish), what debugging techniques are employed?
What is the general course filesystem development takes in arriving at a highly optimized, fully supported implementation in major OSs?
A filesystem that lives in userspace (be that in FUSE or the Mac version thereof) is a very handy thing indeed, but will not have the same performance as a traditional one that lives in kernel space (and thus must be in C). You could say that's the reason that microkernel systems (where filesystems and other things live in userspace) never really "left monolithic kernels in the dust" as A. Tanenbaum so assuredly stated when he attacked Linux in a famous posting on the Minix mailing list almost twenty years ago (as a CS professor, he said he'd fail Linus for choosing a monolithic architecture for his OS -- Linus of course responded spiritedly, and the whole exchange is now pretty famous and can be found in many spots on the web;-).
Portability's not really a problem, unless perhaps you're targeting "embedded" devices with very limited amounts of memory -- with the exception of such devices, you can run Python where you can run C (if anything it's the availability of FUSE that will limit you, not that of a Python runtime). But performance could definitely be.
In the long run, should any
respectable filesystem implementation
be in C? Will not being in C hamper
portability, or eventually cause
performance issues?
Not necessarily, there are plenty of performing languages different to C (O'Caml, C++ are the first that come to mind.) In fact, I expect NTFS to be written in C++. Thing is you seem to come from a Linux background, and as the Linux kernel is written in C, any filesystem with hopes to be merged into the kernel has to be written in C as well.
Are there other implementations like
FUSE?
There are a couple for Windows, for example, http://code.google.com/p/winflux/ and http://dokan-dev.net/en/ in various maturity levels
Evidently core filesystem technology
moves slowly (fat32, ext3, ntfs,
everything else is small fish), what
debugging techniques are employed?
Again, that is mostly true in Windows, in Solaris you have ZFS, and in Linux ext4 and btrfs exist. Debugging techniques usually involve turning machines off in the middle of various operations and see in what state data is left, storing huge amounts of data and see performance.
What is the general course filesystem
development takes in arriving at a
highly optimized, fully supported
implementation in major OSs?
Again, this depends on which OS, but it does involve a fair amount of testing, especially
making sure that failures do not lose data.
I recommend you create a mock object for the kernel block device API layer. The mock layer should use a mmap'd file as a backing store for the file system. There are a lot of benefits for doing this:
Extremely fast FS performance for running unit test cases.
Ability to insert debug code/break points into the mock layer to check for failure conditions.
Easy to save multiple copies of the file system state for study or running test cases.
Ability to deterministically introduce block device errors or other system events that the file system will have to handle.
Respectable filesystems will be fast and efficient. For Linux, that will basically mean writing in C, because you won't be taken seriously if you're not distributed with the kernel.
As for other tools like Fuse, There's MacFUSE, which will allow you to use the same code on macs as well as linux.