pytables - how to copy a row to memory - python

I have a pytable. I often need to copy the rows to an in-memory object and then insert into another pytable. I am wondering what is the easiest way to do this. The following code does not work as one cannot convert a Row object to a dict.
for row in hf.root.my_table.iterrows():
rec = dict(row)
Also, I want to conditionally copy the data to another file, possibly adding 1-2 new columns.To do this, I will need to extract the table description from one table, modify it, and use the modified table description to create a new table. How can I do that?
This won't work either. In general, I find my way of using pytables a little bit awkward, and would like to know better way of doing things.

As mentioned in the documentation, you can use the Row.fetch_all_fields method to obtain to obtain an independent copy which retains its information after the table file is closed, for example.
Your example code would look like
for row in hf.root.my_table.iterrows():
rec = row.fetch_all_fields()
rec is a numpy void scalar with the same keys as the row; rec['field'] yields the same data as row['field'].

Related

python equivalent to listObjects in VBA for Excel (tables)

I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.

Is it possible to read field names from a compound Dataset in an HDF5 file in Python?

I have an HDF5 file that contains a 2D table with column names. It shows up as such in HDFView when I loot at this object, called results.
It turns out that results is a "compound Dataset", a one-dimensional array where each element is a row. Here are its properties as displayed by HDFView:
I can get a handle of this object, let's call it res.
The column names are V2pt, R2pt, etc.
I can read the entire array as data, and I can read one element with
res[0,...,"V2pt"].
This will return the number in the first row of column V2pt. Replacing 0 with 1 will return the second row value, etc.
That works if I know the colunm name a priori. But I don't.
I simply want to get the whole Dataset and its column names. How can I do that?
I see that there is a get_field_info function in the HDF5 documentation in the HDF5 documentation, but I find not such function in h5py.
Am I screwed?
Even better would be a solution to read this table as a pandas DataFrame...
This is pretty easy to do in h5py and works just like compound types in Numpy.
If res is a handle to your dataset, res.dtype.fields.keys() will return a
list of all the field names.
If you need to know a specific dtype, something like res.dtype.fields['V2pt'] will give it.

FITS file change

I have some data given to me by my mentor. The data consists of thousands of .fits files. Some of the .fits files are older versions of the others and the way the data tables are constructed are different. Here is what I mean:
Let's say I have two .fits files: FITS1.fits and FITS2.fits
$ python
>>> import pyfits
>>> a = pyfits.getdata('FITS1.fits')
>>> b = pyfits.getdata('FITS2.fits')
>>> a.names
['time', 'timeerr', 'sap_flux', 'sap_flux_err']
>>> b.names
['time', 'sap_flux', 'timeerr', 'sap_flux_err']
Does anyone know of a way that I can switch around the columns in the data tables? so that FITS2.fits's format is similar to FITS1.fits ?
Your best bet is to not use pyfits directly, but to use the newer, shinier Astropy Table interface. You can read in a FITS table like:
from astropy.table import Table
table = Table.read('FITS1.fits')
As demonstrated in the section on modifying tables, you can then reorder the columns like:
table = table[['time', 'timeerr', 'sap_flux', 'sap_flux_err']]
(technically this creates a new copy of the table, with the columns selected in the order you wanted them to be in; however IIRC this does not copy the underlying column arrays and should still be a fast operation).
It is also perfectly possible to do this with the legacy pyfits interface, but I wouldn't recommend it for most cases.
You shouldn't write code which depends on the order of keys in a dictionary - a dictionary is a hash table and the order that they are stored is essentialy arbitrary.
If you need to match or compare the entries you should get a list of the keys and sort them.
It's probable that pyfits builds a dictionary in the order that the keys are stored in the FITS header, but that isn't necessarily true.

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

How should python dictionaries be stored in pytables?

pytables doesn't natively support python dictionaries. The way I've approached it is to make a data structure of the form:
tables_dict = {
'key' : tables.StringCol(itemsize=40),
'value' : tables.Int32Col(),
}
(note that I ensure that the keys are <40 characters long) and then create a table using this structure:
file_handle.createTable('/', 'dictionary', tables_dict)
and then populate it with:
file_handle.dictionary.append(dictionary.items())
and retrieve data with:
dict(file_handle.dictionary.read())
This works ok, but reading the dictionary back in is extremely slow. I think the problem is that the read() function is causing the entire dictionary to be loaded into memory, which shouldn't really be necessary. Is there a better way to do this?
You can ask PyTables to search inside the table, and also create an index on the key column to speed that up.
To create an index:
table.cols.key.createIndex()
To query the values where key equals the variable search_key:
[row['value'] for row in table.where('key == search_key')]
http://pytables.github.com/usersguide/optimization.html#searchoptim

Categories