Convert HDF5 file to other formats - python

I am having a few big files sets of HDF5 files and I am looking for an efficient way of converting the data in these files into XML, TXT or some other easily readable format. I tried working with the Python package (www.h5py.org), but I was not able to figure out any methods with which I can get this stuff done fast enough. I am not restricted to Python and can also code in Java, Scala or Matlab. Can someone give me some suggestions on how to proceed with this?
Thanks,
TM

Mathias711's method is the best direct way. If you want to do it within python, then use pandas.HDFStore:
from pandas import HDFStore
store = HDFStore('inputFile.hd5')
store['table1Name'].to_csv('outputFileForTable1.csv')

You can use h5dump -o dset.asci -y -w 400 dset.h5
-o dset.asci specifies the output file
-y -w 400 specifies the dimension size multiplied by the number of positions and spaces needed to print each value. You should take a very large number here.
dset.h5 is of course the hdf5 file you want to convert
I think this is the easiest way to convert it to an ascii file, which you can import to excel or whatever you want. I did it a couple of times, and it worked for me. I got his information from this website.

Related

change chunk block shape in netCDF file

I have several ~100 GB NetCDF files.
Within each NetCDF file, there is a variable a, from which I have to extract several data series
The dimension is (1440,721,6,8760).
I need to extract ~20k slices of dimension (1,1,1,8760) from each NetCDF file.
Since it is extremely slow to extract one slice (several minutes), I read about how to optimize the process.
Most likely, the chunks are not set optimally.
Therefore, my goal is to change the chunk size to (1,1,1,8760) for a more efficient I/O.
However, I struggle to understand how I can best re-chunk this NetCDF variable.
First of all, by running ncdump -k file.nc, I found that the type is 64-bit offset.
Based on my research, I think this is NetCDF3 which does not support defining chunk sizes.
Therefore, I copied it to NetCDF4 format using nccopy -k 3 source.nc dest.nc.
ncdump -k file.nc now returns netCDF-4.
However, now I'm stuck. I do not know how to proceed.
If anybody has a proper solution in python, matlab, or using nccopy, please share it.
What I'm trying now is the following:
nccopy -k 3 -w -c latitude/1,longitude/1,level/1,time/8760 source.nc dest.nc
Is this the correct approach in theory?
Unfortunately, after 24 hours, it still did not finish on a potent server with more then enough RAM (250GB) and many CPUs (80).
Your command appears to be correct. Re-chunking takes time.
ncks -4 --cnk_dmn latitude,1 --cnk_dmn longitude,1 --cnk_dmn level,1 --cnk_dmn time,8760 in.nc out.nc
to see if that is any faster.

Python plyfile vs pymesh

I need to read, manipulate and write PLY files in Python. PLY is a format for storing 3D objects. Through a simple search I've found two relevant libraries, PyMesh and plyfile. Has anyone had any experience with either of them, and does anyone have any recommendations? plyfile seems to have been dormant for a year now, judging by Github.
I know this question instigates opinion-based answers but I don't really know where else to ask this question.
As of (2020 January).
None, use open3d. It's the easiest and reads .ply files directly into numpy.
import numpy as np
import open3d as o3d
# Read .ply file
input_file = "input.ply"
pcd = o3d.io.read_point_cloud(input_file) # Read the point cloud
# Visualize the point cloud within open3d
o3d.visualization.draw_geometries([pcd])
# Convert open3d format to numpy array
# Here, you have the point cloud in numpy format.
point_cloud_in_numpy = np.asarray(pcd.points)
References:
http://www.open3d.org/docs/release/tutorial/Basic/visualization.html
http://www.open3d.org/docs/release/tutorial/Basic/working_with_numpy.html
I have succesfully used plyfile while working with pointclouds.
It's true that the poject had not presented any activity from a long time, but It meets its purpose.
And is not like the fact of parsing a ply file were something that allows you to recreate yourself by adding new features.
On the other hand PyMesh offers you many other features besides parsing ply files.
So maybe the question is:
Do you want to just 'read, manipulate and write PLY files' or are you looking for a library that provides more extra features?
What made me choose plyfile was that I'm able to incorporate it to my project by just copying 1 source file. Also I wasn't interested in any of the other features that PyMesh offers.
Update
I ended writing my own functions to read/write ply files (supporting ascii and binary) because I found the plyfile source code a little messy.
If anyone is interested, here is a link to the file:
ply reader/writer
I've just updated meshio to support PLY as well, next to about 20 other formats. Install with
pip install meshio
and use either on the command line
meshio convert in.ply out.vtk
or from within Python like
import meshio
mesh = meshio.read("in.ply")
# mesh.points, mesh.cells, ...
I rolled my own ascii ply writer (because it's so simple, I didn't want to take a dependency). Later, I was lazy and took a dependency on plyfile for loading binary .ply files coming from other places. Nothing has caught on fire yet.
A thing to mention, for better or worse, the .ply format is extensible. We shoehorned custom data into it, and that was easy since we also wrote our own writer.

What data format for large files in R?

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

create a large dataset using Python

I want to create a large dataset (that conforms to a given schema) using Python. Is there a nice way to specify the schema (datatype & length of each of the fields), and let Python create about 100,000 observations for me? Any nice tools already there?
I am familiar with Python...hence would like to stick with it. if there is one using Bash or any other way, please let me know as well.
Thanks!
PD.
You should probably check out the fake-factory package.
Please have a look at this:-
https://github.com/sanju51/Generate-large-Dataset-dynamically-in-Python
SPEED:- 100000 records in 5 seconds(10 columns)
USAGE:- python generate_dataset.py -i Metadata.csv -f sample.csv -nrec 100000 -d ',' -hdr Y

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

Categories