is it possible to create a PyTables table without opening or creating an hdf5 file?
What I mean, and what I need, is to create a table (well actually very many tables) in different processes, work with these tables and store the tables into an hdf5 file only in the end after some calculations (and ensuring only one process at a time performs the storage).
In principle I could do all the calculations on normal Python data (arrays strings etc) and perform the storage in the end. However, why I would appreciate to work on PyTables right from the start are the sanity checks. I want to always ensure that the data I work with fits into the predefined tables and does not violate shape constraints etc (and since PyTables checks for those problems I don't need to implement it all by myself).
Thanks a lot and kind regards,
Robert
You are looking for pandas which has great Pytables integration. You will be working with tables all the way and in the end you will be able to save to hdf5 in the easiest possible way.
You can create a numpy-array with given shape and datatype.
my_array = num.empty(shape=my_shape, dtype=num.float)
If you need indexing by name have a look at numpy record-arrays (nee numpy recarray)
But if you work directly with the PyTable-Object it can be faster (see benchmark here).
Related
I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.
I'm trying to decide the best way to store my time series data in mongodb. Outside of mongo I'm working with them as numpy arrays or pandas DataFrames. I have seen a number of people (such as in this post) recommend pickling it and storing the binary, but I was under the impression that pickle should never be used for long term storage. Is that only true for data structures that might have underlying code changes to their class structures? To put it another way, numpy arrays are probably stable so fine to pickle, but pandas DataFrames might go bad as pandas is still evolving?
UPDATE:
A friend pointed me to this, which seems to be a good start on exactly what I want:
http://docs.scipy.org/doc/numpy/reference/routines.io.html
Numpy has its own binary file format, which should be long term storage stable. Once I get it actually working I'll come back and post my code. If someone else has made this work already I'll happily accept your answer.
We've built an open source library for storing numeric data (Pandas, numpy, etc.) in MongoDB:
https://github.com/manahl/arctic
Best of all, it's easy to use, pretty fast and supports data versioning, multiple data libraries and more.
I'm looking for a good database solution to store large (~100's of GB to several TB) amounts of scientific data. Ideally it would be able to handle larger quantities of data.
Requirements
My datafiles are "images", a ~4 million entry array (1000x1000x3 ints + 1000x1000 floats), plus associated metadata of ~50-100 entries per image. The metadata is stored hierarchically. Images will be organized into one or several "folders" (or "projects"), which themselves can contain other folders. Everything has owners, etc.
I will need to search 100-10,000 images, in one or several folders, based predominantly on its metadata. Then, I might need to pull slices from the image -- I really don't want to load all of the data if I only need a fraction of it. The images should be stored in a compressed format.
Edit: It is important to emphasize that I lack uniform data. Images, for instance, are floats or ints of unknown dimensions with typically 10^5-10^6 entries, and the number of metadata per image might vary. Searching metadata across images would of course be limited to those with identical keys.
Current Approach
My current, and not so great, solution is to mix databases. First, I'm using a SQL database (Django + MySQL right now) to handle "folder", owners, and has a record for each image, but none of its data. I might create records for the metadata as well. Second, I'm using PyTables to store the images and metadata in an hdf5 format and treat it like a database. This solves the slicing and compression problem, and allows me to store the metadata hierarchically, but PyTables does not seem scalable and is far less developed than commercial databases. (It's not made for a multiuser environment: I'm writing my own locks!, which is a bad sign.)
Help!
I'm not a hardcore programmer, so a standard database solution is strongly preferred. My "optimization" would definitely include maintenance and programming cost. Can anyone recommend favorite database solutions or architectures? Ideas on relational vs hierarchical vs other?
Options might be SciDB (not common, could be good), SQL (heard it's bad for these applications, maybe PostgreSQL?), and HBase (actually, I know nothing about it). I feel like there must be good solutions in the scientific, especially astronomy, community, but the large-scale projects seem to require a serious team to build and maintain.
I'm happy to provide lots more info.
Did you store the data in HDF5 format? Since you already mentioned that you were reluctant to load all of the data, you may not really like the array database options like SciDB, MonetDB or RasDaMan. It is very painful to load big data in raw scientific format into a database, and it usually also requires some extra programming work.
You can check this paper:Supporting a Light-Weight Data Management Layer over HDF5. This work proposes to manipulate SQL directly over HDF5.
I have a large Python dictionary of vectors (150k vectors, 10k dimensions each) of float numbers that can't be loaded into memory, so I have to use one of the two methods for storing this on disk and retrieving specific vectors when appropriate. The vectors will be created and stored once, but might be read many (thousands of) times -- so it is really important to have efficient reading. After some tests with shelve module, I tend to believe that sqlite will be a better option for this kind of task, but before I start writing code I would like to hear some more opinions on this... For example, are there any other options except of those two that I'm not aware of?
Now, assuming we agree that the best option is sqlite, another question relates to the exact form of the table. I'm thinking of using a fine-grained structure with rows of the form vector_key, element_no, value to help efficient pagination, instead of storing all 10k elements of a vector into the same record. I would really appreciate any suggestions on this issue.
You want sqlite3, then if you use an ORM like sqlalchemy then you can easily grow to expand and use other back end databases.
Shelve is more of a "toy" than actually useful in production code.
The other point you are talking about is called normalization and I have personally never been very good at it this should explain it for you.
Just as an extra note this shows performance failures in shelve vs sqlite3
As you are dealing with numeric vectors, you may find PyTables an interesting alternative.
There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...
At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.
If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.
h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.
pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.
To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)
As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)