Multifile or NcML reader for NetCDF4 in Python - python

I would like to find a way in Python to aggregate over the slow index (time) of a NetCDF dataset with dimensions (time,y,x) where the files store blocks of time. Apparently NetCDF4-python do this for a NetCDF4 classic file or NetCDF3, but the files are a done deal. Can anyone explain if there is a way to do this in NetCDF4, either with a multfile access or with something like NcML wrappers?
Or does NetCDF4 not do this for a reason that cannot be overcome?
Thanks.

Related

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

Python: how to close mat file?

I'm reading data from 20k mat files into an array.
After reading around 13k files, the process is ended with "Killed" message.
Apparently, it looks like the problem is that too many files are open.
I've tried to find out how to explicitly "close" mat files in Python, but didn't find any except for savemat which is not what I need in this case.
How can I explicitly close mat files in python?
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
mat = scipy.io.loadmat(l)
x.append(mat['data'])
You don't need to. loadmat does not keep the file open. If given a file name, it loads the contents of the file into memory, then immediately closes it. You can use a file object like #nils-werner suggested, but you will get no benefit from doing so. You can see this from looking at the source code.
You are most likely running out of memory due to simply having too much data at a time. The first thing I would try is to load all the data into one big numpy array. You know the size of each file, and you know how many files there are, so you can pre-allocate an array of the right size and write the data to slices of that array. This will also tell you right away if this is a problem with your array size.
If you are still running out of memory, you will need another solution. A simple solution would be to use dask. This allows you to create something that looks and acts like a numpy array, but lives in a file rather than in memory. This allows you to work with data sets too large to fit into memory. bcolz and blaze offer similar capabilities, although not as seamlessly.
If these are not an option, h5py and pytables allow you to store data sets to files incrementally rather than having to keep the whole thing in memory at once.
Overall, I think this question is a classic example of the XY Problem. It is generally much better to state your symptoms, and ask for help on those symptoms, rather than guessing what the solution is and asking for someone to help you implement the solution.
You can pass an open file handle to scipy.io.loadmat:
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
with open(l, 'r') as matfile:
mat = scipy.io.loadmat(matfile)
x.append(mat['data'])
leaving the with open() context will then automatically close the file.

Convert HDF5 file to other formats

I am having a few big files sets of HDF5 files and I am looking for an efficient way of converting the data in these files into XML, TXT or some other easily readable format. I tried working with the Python package (www.h5py.org), but I was not able to figure out any methods with which I can get this stuff done fast enough. I am not restricted to Python and can also code in Java, Scala or Matlab. Can someone give me some suggestions on how to proceed with this?
Thanks,
TM
Mathias711's method is the best direct way. If you want to do it within python, then use pandas.HDFStore:
from pandas import HDFStore
store = HDFStore('inputFile.hd5')
store['table1Name'].to_csv('outputFileForTable1.csv')
You can use h5dump -o dset.asci -y -w 400 dset.h5
-o dset.asci specifies the output file
-y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
dset.h5 is of course the hdf5 file you want to convert
I think this is the easiest way to convert it to an ascii file, which you can import to excel or whatever you want. I did it a couple of times, and it worked for me. I got his information from this website.

Converting NetCDF to GRIB2

I know there is software like wgrib2 that will convert files in grib and grib2 format to NetCDF files, but I need to go the other way: from NetCDF to grib2, because the local weather offices here can only consume gridded data in grib2 format.
It appears that one solution could be in Python, using the NetCDF4-Python library (or other) to read the NetCDF files and using pygrib to write grib2.
Is there a better way?
After some more research, I ended up using the British Met Office "Iris" package (http://scitools.org.uk/iris/docs/latest/index.html) which can read NetCDF as well as OPeNDAP, GRIB and several other formats, and allows to save as NetCDF or GRIB.
Basically the code looks like:
import iris
cubes = iris.load('input.nc') # each variable in the netcdf file is a cube
iris.save(cubes[0],'output.grib2') # save a specific variable to grib
But if your netcdf file doesn't contain sufficient metadata, you may need to add it, which you can also do with Iris. Here's a full working example:
https://github.com/rsignell-usgs/ipython-notebooks/blob/master/files/Iris_CFSR_wave_wind.ipynb
One can also use climate data operators (cdo's) for the task -https://code.zmaw.de/projects/cdo/wiki
but need to install the software with all additional libraries.
I know CDO is mentioned above, but I thought it would be useful to give the full command
cdo -f grb2 copy in.nc out.grb
ECMWF has a command line based tool to do just this: https://software.ecmwf.int/wiki/display/GRIB/grib_to_netcdf

How to extend h5py so that I can access data within a hdf5 file?

I have a small python program which creates a hdf5 file using the h5py module. I want to write a python module to work on the data from the hdf5 file. How could I do that?
More specifically, I can set the numpy arrays to PyArrayObject and read them using PyArg_ParseTuple. This way, I can read elements from the numpy array when I am writing a python module. How to read hdf5 files so that I can access individual elements?
Update: Thanks for the answers below. I need to read hdf5 file from C and not from Python- I know how to do that. For example:
import h5py as t
import numpy as np
f=t.File('\tmp\tmp.h5', 'w')
#this file is 2+GB
ofmat=np.load('offsetmatrix.npy')
f['FileDataset']=ofmat
f.close()
Now I have a hdf5 file called '\tmp\tmp.h5'. What I need to do is read the individual array elements from the hdf5 file using C (and not python) so that I can do something with those elements. This shows how to extend numpy arrays. How to extend hdf5?
Edit: Grammar
h5py gives you a direct interface for reading/writing and manipulating data stored in an hdf5 file. Have you looked at the docs?
http://docs.h5py.org/
I advise starting with these. These have pretty clear examples of how to do simple data access. If there are specific things that you are trying to do that aren't covered by the methods in h5py, could you please give a more specific description of your desired usage?
If you don't actually need a particular structure of HDF5, but you just need the speed and cross-platform compatibility, I'd recommend taking a look at PyTables. It has the built-in ability to read and write Numpy arrays.

Categories