python dictionaries are hashmaps of pointers to...what? - python

I know that in python, dictionaries are implemented by hashing the keys and storing pointers (to the key-value pairs) in an array, the index being determined by the hash.
But how are the key-value pairs themselves stored? Are they stored together (ie, in contiguous spots in memory)? Are they stored as a tuple or array of pointers, one pointing to the key and one to the value? Or is it something else entirely?
Googling has turned up lots of explanations about hashing and open addressing and the like, but nothing addressing this question.

Roughly speaking there is a function, let's call it F, which calculates an index, F(h), in an array of values. So the values are stored as an array and they are looked up as F(h). The reason it's "roughly speaking", is that hashes are computed differently for different objects. For example, for pointers, it's p>>3; while for strings, hashes are digests of all the bytes of a string.
If you want to look at the C code, search for lookdict_index or just look at the dictobject.c file in CPython's source code. It's pretty readable if you are used to reading C code.
Edit 1:
From the comment in Python 3.6.1's Include/dictobject.h:
/* If ma_values is NULL, the table is "combined": keys and values
are stored in ma_keys.
If ma_values is not NULL, the table is splitted:
keys are stored in ma_keys and values are stored in ma_values */
And an explanation from dictobject.:
/*
The DictObject can be in one of two forms.
Either:
A combined table:
ma_values == NULL, dk_refcnt == 1.
Values are stored in the me_value field of the PyDictKeysObject.
Or:
A split table:
ma_values != NULL, dk_refcnt >= 1
Values are stored in the ma_values array.
Only string (unicode) keys are allowed.
All dicts sharing same key must have same insertion order.
....
*/
The values are either stored as an array of strings which follows an array of "key objects". or each value's pointer is stored in the me_value of PyDictKeyEntry. The keys are stored in me_key fields of PyDictKeyEntry. The array of keys is really an array of PyDictKeyEntry structs.
Just as a reference, PyDictKeyEntry is defined as:
typedef struct {
/* Cached hash code of me_key. */
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /*This field is only meaningful for combined tables*/
} PyDictKeyEntry;
Relevant files to look at: Objects/dict-common.h, Objects/dictobject.c, and Include/dictobject.h in Python's source code.
Objects/dictobject.c has an extensive write up in the comments in the beginning of the file explaining the whole scheme and historical background.

Related

Insert list into SQLite3 cell

I'm new to python and even newer to SQL and have just run into the following problem:
I want to insert a list (or actually, a list containing one or more dictionaries) into a single cell in my SQL database. This is one row of my data:
[a,b,c,[{key1: int, key2: int},{key1: int, key2: int}]]
As the number of dictionaries inside the lists varies and I want to iterate through the elements of the list later on, I thought it would make sense to keep it in one place (thus not splitting the list into its single elements). However, when trying to insert the list as it is, I get the following error:
sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
How can this kind of list be inserted into a single cell of my SQL database?
SQLite has no facility for a 'nested' column; you'd have to store your list as text or binary data blob; serialise it on the way in, deserialise it again on the way out.
How you serialise to text or binary data depends on your use-cases. JSON (via the json module could be suitable if your lists and dictionaries consist only of text, numbers, booleans and None (with the dictionaries only using strings as keys). JSON is supported by a wide range of other languages, so you keep your data reasonably compatible. Or you could use pickle, which lets you serialise to a binary format and can handle just about anything Python can throw at it, but it's specific to Python.
You can then register an adapter to handle converting between the serialisation format and Python lists:
import json
import sqlite
def adapt_list_to_JSON(lst):
return json.dumps(lst).encode('utf8')
def convert_JSON_to_list(data):
return json.loads(data.decode('utf8'))
sqlite3.register_adapter(list, adapt_list_to_JSON)
sqlite3.register_converter("json", convert_JSON_to_list)
then connect with detect_types=sqlite3.PARSE_DECLTYPES and declare your column type as json, or use detect_types=sqlite3.PARSE_COLNAMES and use [json] in a column alias (SELECT datacol AS "datacol [json]" FROM ...) to trigger the conversion on loading.

Python hash() function on strings

How does a hash value of some particular string is calculated in CPython2.7?
For instance, this code:
print hash('abcde' * 1000)
returns the same value even after I restart the Python process and try again (I did it many times).
So, it seems that id() (memory address) of the string doesn't used in this computation, right? Then how?
Hash values are not dependent on the memory location but the contents of the object itself. From the documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
See CPython's implementation of str.__hash__ in:
Objects/unicodeobject.c (for unicode_hash)
Python/pyhash.c (for _Py_HashBytes)

ElasticSearch throwing mapper parsing exception when indexing JSON array of integers and strings

I am attempting to use python to pull a JSON array from a file and input it into ElasticSearch. The array looks as follows:
{"name": [["string1", 1, "string2"],["string3", 2, "string4"], ... (variable length) ... ["string n-1", 3, "string n"]]}
ElasticSearch throws a TransportError(400, mapper_parsing_exception, failed to parse) when attempting to index the array. I discovered that ElasticSearch sometimes throws the same error whenever I try to feed it a string with both strings and integers. So, for example, the following will sometimes crash and sometimes succeed:
import json
from elasticsearch import Elasticsearch
es = Elasticsearch()
test = json.loads('{"test": ["a", 1, "b"]}')
print test
es.index(index, body=test)
This code is everything I could safely comment out without breaking the program. I put the JSON in the program instead of having it read from a file. The actual strings I'm inputting are quite long (or else I would just post them) and will always crash the program. Changing the JSON to "test": ["a"] will cause it to work. The current setup crashes if it last crashed, or works if it last worked. What is going on? Will some sort of mapping setup fix this? I haven't figured out how to set a map with variable array length. I'd prefer to take advantage of the schema-less input but I'll take whatever works.
It is possible you are running into type conflicts with your mapping. Since you have expressed a desire to stay "schema-less", I am assuming you have not explicitly provided a mapping for your index. That works fine, just recognize that the first document you index will determine the schema for your index. Each document you index afterwards that has the same fields (by name), those fields must conform to the same type as the first document.
Elasticsearch has no issues with arrays of values. In fact, under the hood it treats all values as arrays (with one or more entries). What is slightly concerning is the example array you chose, which mixes string and numeric types. Since each value in your array gets mapped to the field named "test", and that field may only have one type, if the first value of the first document ES processes is numeric, it will likely assign that field as a long type. Then, future documents that contain a string that does not parse nicely into a number, will cause an exception in Elasticsearch.
Have a look at the documentation on Dynamic Mapping.
It can be nice to go schema-less, but in your scenario you may have more success by explicitly declaring a mapping on your index for at least some of the fields in your documents. If you plan to index arrays full of mixed datatypes, you are better off declaring that field as string type.

Python get JSON object by index

I have a JSON file that looks something like this:
{ data:
{ 123:
{ 212:
{ 343:
In python I load the JSON using json.loads(r.text). The strings inside the data object are not guaranteed to be those but can be any number. What I want to be able to do is to be able to get those numbers so I can store then in an array. In this example I want the array to look like [123,212,343]. Is there anyway to do this since they are nested objects and not a JSON array?
Thanks
Very briefly:
#!/usr/bin/env python
import json
foo = json.loads('{"123": null, "234": null, "456": null}')
print map(int, foo.keys())
The list foo.keys() will not be in the same order as presented in the JSON object (or its string representation).
If you need to preserve ordering, you might try the following modification:
#!/usr/bin/env python
import json, collections
foo = json.loads('{"123": null, "234": null, "456": null}', object_pairs_hook=collections.OrderedDict)
print map(int, foo.keys())
As you iterate over the list of keys, you can do a reverse lookup on an individual key in the usual way.
Note that you will likely want to convert the integer-key back to a Python string with str(), in order to retrieve its associated value. If you just want the list of keys for lookups, however, and you don't really need the actual integer values, you can skip the initial map call and just preserve the keys as strings.

How should python dictionaries be stored in pytables?

pytables doesn't natively support python dictionaries. The way I've approached it is to make a data structure of the form:
tables_dict = {
'key' : tables.StringCol(itemsize=40),
'value' : tables.Int32Col(),
}
(note that I ensure that the keys are <40 characters long) and then create a table using this structure:
file_handle.createTable('/', 'dictionary', tables_dict)
and then populate it with:
file_handle.dictionary.append(dictionary.items())
and retrieve data with:
dict(file_handle.dictionary.read())
This works ok, but reading the dictionary back in is extremely slow. I think the problem is that the read() function is causing the entire dictionary to be loaded into memory, which shouldn't really be necessary. Is there a better way to do this?
You can ask PyTables to search inside the table, and also create an index on the key column to speed that up.
To create an index:
table.cols.key.createIndex()
To query the values where key equals the variable search_key:
[row['value'] for row in table.where('key == search_key')]
http://pytables.github.com/usersguide/optimization.html#searchoptim

Categories