Is there any parallel way of accessing Netcdf files in Python - python

Is there any way of doing parallel IO for Netcdf files in Python?
I understand that there is a project called PyPNetCDF, but apparently it's old, not updated and doesn't seem to work at all. Has anyone had any success with parallel IO with NetCDF in Python at all?
Any help is greatly appreciated

It's too bad PyPnetcdf is not a bit more mature. I see hard-coded paths and abandoned domain names. It doesn't look like it will take a lot to get something compiled, but then there's the issue of getting it to actually work...
in setup.py you should change the library_dirs_list and include_dirs_list to point to the places on your system where Northwestern/Argonne Parallel-NetCDF is installed and where your MPI distribution is installed.
then one will have to go through and update the way pypnetcdf calls pnetcdf. A few years back (quite a few, actually) we promoted a lot of types to larger versions.

I haven't seen good examples from either of the two python NetCDF modules, see https://github.com/Unidata/netcdf4-python/issues/345
However, if You only need to read files and they are NetCDF4 format, You should be able to use HDF5 directly -- http://docs.h5py.org/en/latest/mpi.html
because NetCDF4 is basically HDF5 with restricted data model. Probably won't work with NetCDF3.

Related

Can I load .npy/.npz files from inside a python package?

I am trying to create my own python package where hopefully one day other people will be able to contribute with their own content. For the package to work it must be possible to have small data files that will be installed as part of the package. It turns out, loading data file that are part of a Python module is not easy. I managed to load very basic ASCII files using something like this:
data = pkgutil.get_data(__name__, "data/test.txt")
data=data.decode('utf-8')
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", data)
datanp=np.zeros(len(rr))
for k in range(len(rr)):
datanp[k]=np.float(rr[k])
I found many comments online that say you should not use commands like np.load(path_to_package) because on some systems packages might actually be stored in zip files or something. That is why I am using pkgutil.get_data() , it is apparently the more robust. Here for example they talk in great length about different ways to safely load data, but not so much how you would actually load different data types.
My question: Are there ways to load .npy files from inside a Python package?

Save Data in C++, Load from Python - Recommended Data Formats

I have a ROS/CPP simulator that saves large amounts of data to a rosbag (around 90 MB). I want to read this data frequently from Python and since reading rosbags is slow and cumbersome, I currently have another python script that reads the rosbag and saves the relevant contents to a HDF5 file.
It would be nice though to be able to just save the data from the simulator directly (in C++) and then read it from my scripts (in Python). So I was wondering which data format I should use.
It should be:
Fast to load from Python
Be compact (so ideally a binary of some sort)
Be easy to use
You might be wondering why I don't just save to HDF5 from my C++ simulator, but it just doesn't seem to be easy. There is basically nothing on forums such as Stackoverflow and the HDF5 Group website is opaque, seems to have some complicated licensing and very poor examples. I just want something quick and dirty that I can get running this afternoon.
You may want to have a look at HDFql as it is a high-level language (similar to SQL) to manage HDF5 files. Amongst others, HDFql supports C++ and Python. There are some examples that illustrates how to use HDFql in these languages here.
I see two solutions that can be useful for your problem :
LV: Length Value that you can store directly in binary into a file.
JSON: This does not add many data more than you need, and there are many libraries in Python or C++ that can simplify you the work
Protocol Buffers is an option with language bindings in C++ and Python, though it might be more time investment than quick/dirty running this afternoon.

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Utilities or libraries for finding most closely matched binary file

I would like to be able to compare a binary file X to a directory of other binary files and find which other file is most similar to X. The nature of the data is such that identical chunks will exist between files, but possibly shifted in location. The files are all 1MB in size, and there are about 200 of them. I would like to be have something quick enough to analyze these in a few minutes or less on a modern desktop computer.
I've googled a bit and found a few different binary diff utilities, but none of them seem appropriate for my application.
For example there is bsdiff, which looks like it creates some a patch file which is optimized for size. Or vbindiff which just displays the differences graphically, but those don't really seem to help me figure out if one file is more similar to X than another file.
If there is not a tool that I can use directly for this purpose, is there a good library someone could recommend for writing my own utility? Python would be preferable, but I'm flexible.
Here's a simple perl script which more or less tries to do exactly that.
Edit: Also have a look at the following stackoverflow thread.

Categories