Can I write to a HDF5 file from multiple processes/threads? - python

Does hdf5 support parallel writes to the same file, from different threads or from different processes? Alternatively, does hdf5 support non-blocking writes?
If so then is this also supported by NetCDF4, and by the python bindings for either?
I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.)

Not trivially, but there various potential work-arounds.
The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. Consequently NetCDF4, and the python bindings for either, will not support parallel writing.
If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?).
In more recent versions of HDF5, there should be support for virtual datasets. Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file).
There exists a "Parallel HDF5" library for MPI. Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines.
If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure).
[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt).

If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud.
This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library.
The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service.

Related

How to dynamically share input data to parallel processes in Python

I have several parallel processes working on the same data. The said data is composed of 100.000+ arrays (all stored in an HDF5 file, 90.000 values per array).
For now, each process accesses the data individually, and it works well since the HDF5 file support concurrent reading... but only up to a certain number of parallel processes. Above 14-16 processes accessing the data, I see a drop in efficiency. I was expecting it (too many I/O operations on the same file I reckon), but I don't know how to correct this problem properly.
Since the processes all use the same data, the best would be for the main process to read the file, load the array (or a batch of arrays), and feed it to the running parallel processes, without needing to stop them. A kind of dynamic shared memory if you will.
Is there any way to do it properly and solve my scalability issue?
I use the native "multiprocessing" Python library.
Thanks,
Victor

Store objects between remote functions in Ray

I'm writing a project which writes using the same data a ton of times, and I have been using ray to scale this up in a cluster setting, however the files are too large to send back and forth/save on the ray object store all the time. Is there a way to save the python objects on the local nodes between the calls of the remote functions?
Writing to files always tends to be tricky in distributed systems since regular file systems aren't shared between machines. Ray generally doesn't interfere with the file system, but I think you have a few options here.
Expand the object store size: You can change the plasma store size and where it's stored to a larger file by setting the --object-store-memory and --plasma-directory flags.
Use a distributed file system: Distributed filesystems like NFS allow you to share part of your filesystem across machines. If you manually set up an NFS share, you can direct Ray to write to a file within NFS.
Don't use a filesystem: While this is technically a non-answer, this is arguably the most typical approach to distributed systems. Instead of writing to your filesystem, consider writing to S3 or similar KV store or Blob Store.
Downsides of these approaches:
The biggest downside of (1) is that if you aren't careful, you could badly affect your performance.
The biggest downside of (2) is that it can be slow. In particular, if you need to read and write data from multiple nodes. A secondary downside is that you will have to setup NFS yourself.
The biggest downside to (3) is that you're now relying on an external service, and it arguably isn't a direct solution to your problem.

Python 3 - Faster Print & I/O

I'm currently involved in a Python project that involves handling massive amounts of data. In this, I have to print massive amounts of data to files. They are always one-liners, but sometimes consisting of millions of digits.
The actual mathematical operations in Python only take seconds, minutes at most. Printing them to a file takes up to several hours; which I don't always have.
Is there any way of speeding up the I/O?
From what I figure, the number is stored in the RAM (Or at least I assume so, it's the only thing which would take up 11GB of RAM), but Python does not print it to a text file immediately. Is there a way to dump that information -- if it is the number -- to a file? I've tried Task Manager's Dump, which gave me a 22GB dump file (Yes, you read that right), and it doesn't look like there's what I was looking for in there, albeit it wasn't very clear.
If it makes a difference, I have Python 3.5.1 (Anaconda and Spyder), Windows 8.1 x64 and 16GB RAM.
By the way, I do run Garbage Collect (gc module) inside the script, and I delete variables that are not needed, so those 11GB aren't just junk.
If you are indeed I/O bound by the time it takes to write the file, multi-threading with a pool of threads may help. Of course, there is a limit to that, but at least, it would allow you to issue non-blocking file writes.
Multithreading could speed it up (have printers on other threads that you write to in memory that have a queue).
Maybe a system design standpoint, but maybe evaluate whether or not you need to write everything to the file. Perhaps consider creating various levels of logging so that a release mode could run faster (if that makes sense in your context).
Use HDF5 file format
The problem is, you have to write a lot of data.
HDF5 is format being very efficient in size and allowing access to it by various tools.
Be prepared for few challenges:
there are multiple python packages for HDF5, you will have to find the one which fits your needs
installation is not always very simple (but there might be Windows installation binary)
expect a bit of study to understand the data structures to be stored.
it will occasionally need some CPU cycles - typically you write a lot of data quickly and at one moment it has to be flushed to the disk. At this moment it starts compressing the data what can take few seconds. See GIL for IO bounded thread in C extension (HDF5)
Anyway, I think, it is very likely, you will manage and apart of faster writes to the files you will also gain smaller files, which are simpler to handle.

Alternatives to mincemeatpy and octopy

Briefly, octopy and mincemeatpy are python implementations of map-reduce (light-weight), and clients can join the cluster in ad-hoc manner without requiring any installations (Of-course, except python). Here are the project details OCTOPY and Mincemeatpy.
The problem with these is they need to hold the entire data in-memory (including intermediate key-value pairs). So even for a moderate size data, they throw out of memory exceptions.
The key-reasons I'm using them are:
Python.
No cluster installation required.
I just prototype, and I can directly port the algorithm once I'm ready.
So my question is: Is there any package which handles the same stuff, but not just in-memory (which can handle moderate size data) ?
Try PyMapReduce. It runs on your own machine, but on several processes - so you don't need to build up master-node architecture and it have plenty of runners, for example DiskBasedRunner, which seems to store map data to temp files and after reduces them.

Fastest Way to Write Data To Individual Machines?

I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X is 'fed' with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It's a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual computer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each 'piece' of data to each individual machine?
You should take a look at pyzmq:
http://www.zeromq.org/bindings:python
and general guides to zeromq (0mq)
http://nichol.as/zeromq-an-introduction
http://www.zeromq.org/
You have two (classes of) choices:
You could build some distribution mechanism yourself.
You could use an existing tool to handle the distribution and storage.
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you'll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.
What you want is a message queue like RabbitMQ. It is easy to add consumers and producers to a queue. Consumer can either poll or get notified through a callback...

Categories