How to store numerical lookup table in Python (with labels) - python

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?

Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.

If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.

One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).

For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Related

Insert list into SQLite3 cell

I'm new to python and even newer to SQL and have just run into the following problem:
I want to insert a list (or actually, a list containing one or more dictionaries) into a single cell in my SQL database. This is one row of my data:
[a,b,c,[{key1: int, key2: int},{key1: int, key2: int}]]
As the number of dictionaries inside the lists varies and I want to iterate through the elements of the list later on, I thought it would make sense to keep it in one place (thus not splitting the list into its single elements). However, when trying to insert the list as it is, I get the following error:
sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
How can this kind of list be inserted into a single cell of my SQL database?
SQLite has no facility for a 'nested' column; you'd have to store your list as text or binary data blob; serialise it on the way in, deserialise it again on the way out.
How you serialise to text or binary data depends on your use-cases. JSON (via the json module could be suitable if your lists and dictionaries consist only of text, numbers, booleans and None (with the dictionaries only using strings as keys). JSON is supported by a wide range of other languages, so you keep your data reasonably compatible. Or you could use pickle, which lets you serialise to a binary format and can handle just about anything Python can throw at it, but it's specific to Python.
You can then register an adapter to handle converting between the serialisation format and Python lists:
import json
import sqlite
def adapt_list_to_JSON(lst):
return json.dumps(lst).encode('utf8')
def convert_JSON_to_list(data):
return json.loads(data.decode('utf8'))
sqlite3.register_adapter(list, adapt_list_to_JSON)
sqlite3.register_converter("json", convert_JSON_to_list)
then connect with detect_types=sqlite3.PARSE_DECLTYPES and declare your column type as json, or use detect_types=sqlite3.PARSE_COLNAMES and use [json] in a column alias (SELECT datacol AS "datacol [json]" FROM ...) to trigger the conversion on loading.

ElasticSearch throwing mapper parsing exception when indexing JSON array of integers and strings

I am attempting to use python to pull a JSON array from a file and input it into ElasticSearch. The array looks as follows:
{"name": [["string1", 1, "string2"],["string3", 2, "string4"], ... (variable length) ... ["string n-1", 3, "string n"]]}
ElasticSearch throws a TransportError(400, mapper_parsing_exception, failed to parse) when attempting to index the array. I discovered that ElasticSearch sometimes throws the same error whenever I try to feed it a string with both strings and integers. So, for example, the following will sometimes crash and sometimes succeed:
import json
from elasticsearch import Elasticsearch
es = Elasticsearch()
test = json.loads('{"test": ["a", 1, "b"]}')
print test
es.index(index, body=test)
This code is everything I could safely comment out without breaking the program. I put the JSON in the program instead of having it read from a file. The actual strings I'm inputting are quite long (or else I would just post them) and will always crash the program. Changing the JSON to "test": ["a"] will cause it to work. The current setup crashes if it last crashed, or works if it last worked. What is going on? Will some sort of mapping setup fix this? I haven't figured out how to set a map with variable array length. I'd prefer to take advantage of the schema-less input but I'll take whatever works.
It is possible you are running into type conflicts with your mapping. Since you have expressed a desire to stay "schema-less", I am assuming you have not explicitly provided a mapping for your index. That works fine, just recognize that the first document you index will determine the schema for your index. Each document you index afterwards that has the same fields (by name), those fields must conform to the same type as the first document.
Elasticsearch has no issues with arrays of values. In fact, under the hood it treats all values as arrays (with one or more entries). What is slightly concerning is the example array you chose, which mixes string and numeric types. Since each value in your array gets mapped to the field named "test", and that field may only have one type, if the first value of the first document ES processes is numeric, it will likely assign that field as a long type. Then, future documents that contain a string that does not parse nicely into a number, will cause an exception in Elasticsearch.
Have a look at the documentation on Dynamic Mapping.
It can be nice to go schema-less, but in your scenario you may have more success by explicitly declaring a mapping on your index for at least some of the fields in your documents. If you plan to index arrays full of mixed datatypes, you are better off declaring that field as string type.

storing python list into mysql and accessing it

How can I store python 'list' values into MySQL and access it later from the same database like a normal list?
I tried storing the list as a varchar type and it did store it. However, while accessing the data from MySQL I couldn't access the same stored value as a list, but it instead it acts as a string. So, accessing the list with index was no longer possible. Is it perhaps easier to store some data in the form of sets datatype? I see the MySQL datatype 'set' but i'm unable to use it from python. When I try to store set from python into MySQL, it throws the following error: 'MySQLConverter' object has no attribute '_set_to_mysql'. Any help is appreciated
P.S. I have to store co-ordinate of an image within the list along with the image number. So, it is going to be in the form [1,157,421]
Use a serialization library like json:
import json
l1 = [1,157,421]
s = json.dumps(l1)
l2 = json.loads(s)
Are you using an ORM like SQLAlchemy?
Anyway, to answer your question directly, you can use json or pickle to convert your list to a string and store that. Then to get it back, you can parse it (as JSON or a pickle) and get the list back.
However, if your list is always a 3 point coordinate, I'd recommend making separate x, y, and z columns in your table. You could easily write functions to store a list in the correct columns and convert the columns to a list, if you need that.

How to do data analysis using Python, of a file with thousands of dictionaries in each line

I currently have a file with 5 thousand lines, with one dictionary in each line. All dictionaries have the same fields. My question is:
Should I learn SQL to store this data and do that analysis with it, or is using the file I've got good enough, and I should just use pandas or some other module to do data analysis.
I'm really lost on which path should I take.
While the question is very general - it should be noted that the problems of how do I store my dataset and what tool do I use to analyze my data are very different questions.
Very often for datasets that need to be modified or updated at a regular interval a database will be preferable to i.e. a compressed file (since modifying the compressed file contents will require you to rewrite all of the data). For example I probably wouldn't use sqlite for an nltk.corpus although there maybe uses cases for that as well.
If you do decide to use with sqlite and your original data is in dictionary format, especially with many fields -
You may find exectrace and rowtrace to be useful:
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setrowtrace
and
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setexectrace
useful.
For example to get rows out of sqlite in dict rather than tuple format, you may do:
def rowtracer(cursor, sql):
dictionary = {}
for index, (name, type_) in enumerate(cursor.getdescription()):
dictionary[name] = sql[index]
return dictionary
con.setrowtrace(rowtracer)
And for inserts you can pass values in a dict by i.e.
"""insert into my_table(name, data) values(:name, :date)"""

Python categorize datatypes

I plan to make a 'table' class that I can use throughout my data-analyzis program to store gathered data to. Objective is to make simple tables like this:
ID Mean size Stdv Date measured Relative flatness
----------------------------------------------------------------
1 133.4242 34.43 Oct 20, 2013 32093
2 239.244 34.43 Oct 21, 2012 3434
I will follow the sqlite3 suggestion from this post: python-data-structure-for-maintaing-tabular-data-in-memory, but I will still need to save it as a csv file (not as a dbase) and I want it to eat my data as we go: add columns on the fly whenever new measures become available and are deemed to be interesting. For that the class will need to be able to determine the data type of the data thrown at it.
Sqlite3 has limited datatypes, float, int, date and string. Python and numpy together have many types. Is there an easy was to quickly decide what the datatype is of the variable? So my table class can automatically add a column when new data is entered containing new fields.
I am not too concerned about performance, the table should be fairly small.
I want to use my class like so:
dt = Table()
dt.add_record({'ID':5, 'Mean size':39.4334'})
dt.add_record({'ID':5, 'Goodness of fit': 12})
In the last line, there is new data. the Table class needs to figure out what kind of data that is and then add a column to the sqlite3 table. Making it all string seems a bit to floppy, I still want to keep my high precision floats correct....
Also: If something like this already exists, I'd like to know about it.
It seems that your question is: "Is there an easy was to quickly decide what the datatype is of the variable?". This is a simple question, and the answer is:
type(variable).
But the context you provide requires a more careful answer.
Since SQLite3 only provides only a few data types (slightly different ones than what you said), you need to map your input variables to the types provided by SQLite3.
But you may encounter further problems: You may need to change the types of columns as you receive new records, if you do not want to require that the column type be fixed in advance.
For example, for the Goodness of fit column in your example, you get an int (12) first. But you may get a float (e.g. 10.1) the second time, which shows that both values must be interpreted as floats. And if next time you receive a string, then all of them must be strings, right? But then the exact formatting of the numbers also counts: whereas 12 and 12.0 are the same when you interpret them as floats, they are not when you interpret them as strings; and the first value may become "12.0" when you convert all of them to strings.
So either you throw an exception when the type of consecutive values for the same column do not match, or you try to convert the previous values according to the new ones; but occasionally you may need to re-read the input.
Nevertheless, once you make those decision regarding the expected behavior, it should not be a very difficult problem to implement.
Regarding your last question: I personally do not know of an existing implementation to this problem.

Categories