Implementation of a hash table in SWI-Prolog

Implementation of a hash table in SWI-Prolog - python

So I have to implement an ADT, in this case a hash table in SWI-Prolog.
I need help, because I'm new in this programming language, and don know how to start.
This started as an implementation in python(3) where I defined a class and added functions to work with (add, is_empty?, delete, rehash, hash, etc).
But now, I need to do something similar in prolog.
I have visited some other stackoverflow questions similar to mine, but I'm still helpless.
I expect to define a basic hash table and be able to add key+value data and some other basic functions. I'm not very sure if this is already implemented somewhere else.
Pls help.

Several Prolog systems provide a term hash built-in or library predicate. For example, SWI-Prolog provides term_hash/2 and term_hash/4 built-in predicates. These predicates are often combined with first-argument indexing. A simple example:
% dynamic predicate to hold hash table entries
% with the term hash used as first argument to
% take advantage of first-argument indexing
%
% hash_table(Hash, Term).
:- dynamic(hash_table/2).
add_hash_table_entry(Term) :-
nonvar(Term),
term_hash(Term, Hash), % or term_hash/4
assertz(hash_table(Hash, Term)).
del_hash_table_entry(Term) :-
nonvar(Term),
term_hash(Term, Hash), % or term_hash/4
retractall(hash_table(Hash, _)).
hash_table_entry(Term) :-
( var(Term) ->
hash_table(_, Term)
; term_hash(Term, Hash), % or term_hash/4
hash_table(Hash, Term)
).

This sounds like a misguided idea. How is this "hash table" going to be used? What algorithmic complexity do you expect the different operations to have? Why do you want to implement a hash table in a language that doesn't need hash tables implemented by users?
The only half-decent way to do it will be to use a flat term for the table, one argument per bucket. If you have k buckets, then you use a term with that arity k, so for 256 buckets you get hash_table/256.
empty_hash_table(T) :-
length(Buckets, 256),
maplist(=(nil), Buckets),
T =.. [hash_table|Buckets].
You can now use arg/3 to get a bucket in constant time. You can use setarg/3 to change them.
But this all is starting to sound very fishy. You need to explain your reasons better. Why do you want to implement a hash table in Prolog? How is it going to be used?

Related

Python Efficiency of the in statement

Just a quick question, I know that when looking up entries in a dictionary there's a fast efficient way of doing it:
(Assuming the dictionary is ordered in some way using collections.OrderedDict())
You start at the middle of the dictionary, and find whether the desired key is off to one half or another, such as when testing the position of a name in an alphabetically ordered dictionary (or in rare cases dead on). You then check the next half, and continue this pattern until the item is found (meaning that with a dictionary of 1000000 keys you could effectively find any key within 20 iterations of this algorithm).
So I was wondering, if I were to use an in statement (i.e. if a in somedict:), would it use this same method of checking for the desired key? Does it use a faster/slower algorithm?

Nope. Python's dictionaries basically use a hash table (it actually uses an modified hash table to improve speed) (I won't bother to explain a hash table; the linked Wikipedia article describes it well) which is a neat structure which allows ~O(1) (very fast) access. in looks up the object (the same thing that dict[object] does) except it doesn't return the object, which is the most optimal way of doing it.
The code for in for dictionaries contains this line (dk_lookup() returns a hash table entry if it exists, otherwise NULL (the equivalent of None in C, often indicating an error)):
ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr);

identify 5-tuple flows by hash value in Python

I identify Internet traffic flows by their 5-tuple (src IP, dst port, sport, dport, transport protocol number) and I would like turn this 5-tuple into a much more compact alphanumeric ID for internal use in my script.
What choices do I have in Python?
I read that the built-in function hash is only consistent OS-wise, so I would prefer something else.
I will only ever have to deal with no more than a few hundreds different 5-tuples.

Just choose your own hash function:
import hashlib
hash = hashlib.md5()
t = (1, 2, 3, 4, 5) # whatever
t_as_string = str(t)
hash.update(t_as_string)
print hash.hexdigest()
You can use any of the functions in hashlib. And since this isn't a security issue, it doesn't really matter which one...
BUT: wanna bet, comparing tuples will be faster / more efficient?

The following Python Hash function, by Ewen Cheslack-Postava, shall remain consistent accross several OS and CPU :
https://pypi.python.org/pypi/pyhashxx/

are you worried about collisions across OS's? is that your issue?
But since you are only dealing with a few hundred of 5tuples cant you apply some kind of hash collusion resolution techniques like chaining or open addressing etc.
If I am not missing anything else I believe the above method is better than devising a new hashing algorithm yourself.

Python unhash value

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance

It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects

Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.

You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)

This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

Python Extension Returned Object Etiquette

I am writing a python extension to provide access to Solaris kstat data ( in the same spirit as the shipping perl library Sun::Solaris::Kstat ) and I have a question about conditionally returning a list or a single object. The python use case would look something like:
cpu_stats = cKstats.lookup(module='cpu_stat')
cpu_stat0 = cKstats.lookup('cpu_stat',0,'cpu_stat0')
As it's currently implemented, lookup() returns a list of all kstat objects which match. The first case would result in a list of objects ( as many as there are CPUs ) and the second call specifies a single kstat completely and would return a list containing one kstat.
My question is it poor form to return a single object when there is only one match, and a list when there are many?
Thank you for the thoughtful answer! My python-fu is weak but growing stronger due to folks like you.

"My question is it poor form to return a single object when there is only one match, and a list when there are many?"
It's poor form to return inconsistent types.
Return a consistent type: List of kstat.
Most Pythonistas don't like using type(result) to determine if it's a kstat or a list of kstats.
We'd rather check the length of the list in a simple, consistent way.
Also, if the length depends on a piece of system information, perhaps an API method could provide this metadata.
Look at DB-API PEP for advice and ideas on how to handle query-like things.

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.