Skip duplicates for EMR - python

I currently have enormous amount of medical records which consists medical terms that need to be translated. For cost consideration, we don't want to translate every term for each record. For example, if we found the terms in a record are already frequently appeared in previous records which means these terms might be translated already in previous record, then we don't want to translate them again. I was asked to design a program to accomplish this goal. Hints I got is that I may need to break the records to alphabet level, and matrix may needed to solve this problem. I am literally a beginner in programming. Therefore, I'm looking for help here. Brutal thoughts/suggestions are enough for now. Thanks.
[Edit by Spektre] moved from comments
My problem boils down to this:
Say there are two sentences A and B. A has m tokens (a1, a2, ……, am) and B has n tokens (b1, b2, ……, bn). While A and B might have common tokens. So I need a function to estimate the likelihood of tokens in B that not covered by A.
The tokens are already stored in dictionary.
How to implement this?

So if I see it right you want to know if bi is not in A.
I do not code in python but I see it like this (in C++ like languages)
bool untranslated(int j,int m,int n,string *a,string *b)
{
// the dictionaries are: a[m],b[n]
for (int i=0;j<m;i++) // inspect all tokens of A
if (b[j]==a[i]) // if b[j] present in A
return false;
return true;
}
Now if the dictionaries are rather large then you need to change this linear search to binary search. Also to speed up (if the words are big) you need to use hashes (hash map) for matching. Of coarse depending on your language you can not compare words naively with == rather implement some function that will convert the word into its simplex grammatical form and store to dictionary just that. That can be pretty complicated to implement.
Now the probability of whole sentence would be:
// your dictionaries:
const int m=?,n=?;
string A[m],string B[n];
// code:
int j; float p;
for (p=0.0,j=0;j<n;j++) // test all words of B
if (untranslated(j,m,n,A,B)) p++; // and count how many are untranslated
p/=float(n); // normalize p to <0,1> its your probability that sentence B is not in A
the resulting probability p is in range <0,1> so if you want percentage instead just multiply it by 100.
[Edit1] occurrence of bi
that is entirely different problem but also solvable relatively easy. Its the same as computing histogram so:
add counter for each word in A dictionary
so each record of A will be like this:
struct A_record
{
string word;
int cnt;
};
int m=0;
A_record a[];
process B sentences
on each word bi look into dictionary A. If not present there add it to dictionary and set its counter to 1. If present then just increment its counter by one instead.
const int n=?; // input sentence word count
string b[n]={...}; // input sentence words
int i,j;
for (i=0;i<n;i++) // process B
for (j=0;j<m;j++) // search in A (should be binary search or has-map search)
if (b[i]==a[j].word)
{ a[j].cnt++; j=-1; break; } // here a[j].cnt is the bi occurrence you wanted if divided by m then its probability <0,1>
if (j<0)
{ a[m].word=b[i]; a[m].cnt=1; m++; } // here no previous occurrence of bi
Now if you want just previous occurrence of bithen look at the matched a[j].cnt during the search. If you want the occurrence of any b[i] word in whole text look at the same counter after whole text is processed.

Related

How to do random.choice in C

As a Python intermediate learner, I made an 8 ball in Python.
Now that I am starting on learning C, is there a way to simulate to the way random.choice can select a string from a list of strings , but in C ?
The closest thing to a "list of strings" in C is an array of string pointers; and the only standard library function that produces random numbers is rand(), defined in <stdlib.h>.
A simple example:
#include <stdio.h>
#include <stdlib.h>
#include <time.h> // needed for the usual srand() initialization
int main(void)
{
const char *string_table[] = { // array of pointers to constant strings
"alpha",
"beta",
"gamma",
"delta",
"epsilon"
};
int table_size = 5; // This must match the number of entries above
srand(time(NULL)); // randomize the start value
for (int i = 1; i <= 10; ++i)
{
const char *rand_string = string_table[rand() % table_size];
printf("%2d. %s\n", i, rand_string);
}
return 0;
}
That will generate and print ten random choices from an array of five strings.
The string_table variable is an array of const char * pointers. You should always use a constant pointer to refer to a literal character string like "alpha". It keeps you from using that pointer in a context where the string contents might be changed.
The random numbers are what are called "pseudorandom"; statistically uncorrelated, but completely determined by a starting "seed" value. Using the statement srand(time(NULL)) takes the current time/date value (seconds since some starting date) and uses that as a seed that won't be repeated in any computer's lifetime. But you will get exactly the same "random" numbers if you manage to run the program twice in the same second. This is easy to do in a shell script, for example. A higher-resolution timestamp would be nice, but there isn't anything useful in the C standard library.
The rand() function returns a non-negative int value from 0 to some implementation-dependent maximum value. The symbolic constant RAND_MAX has that value. The expression rand() % N will return the remainder from dividing that value by N, which is a number from 0 to N-1.
Aconcagua has pointed out that this isn't ideal. If N doesn't evenly divide RAND_MAX, then there will be a bias toward smaller numbers. It's okay for now, though, but plan to learn some other methods later if you do serious simulation or statistical work; and if you do get to that point, you probably won't use the built-in rand() function anyway.
You can write a function if you know the size of your array and use rand() % size to get a random index from your array. Then return the value of arr[randidx]

Two different value for the Wavelet Transform (Daubechies D4)

I am testing this code
protected final double sqrt_3 = Math.sqrt( 3 );
protected final double denom = 4 * Math.sqrt( 2 );
//
// forward transform scaling (smoothing) coefficients
//
protected final double h0 = (1 + sqrt_3)/denom;
protected final double h1 = (3 + sqrt_3)/denom;
protected final double h2 = (3 - sqrt_3)/denom;
protected final double h3 = (1 - sqrt_3)/denom;
//
// forward transform wavelet coefficients
//
protected final double g0 = h3;
protected final double g1 = -h2;
protected final double g2 = h1;
protected final double g3 = -h0;
protected void transform( double a[], int n )
{
if (n >= 4) {
int i, j;
int half = n >> 1;
double tmp[] = new double[n];
i = 0;
for (j = 0; j < n-3; j = j + 2) {
tmp[i] = a[j]*h0 + a[j+1]*h1 + a[j+2]*h2 + a[j+3]*h3;
tmp[i+half] = a[j]*g0 + a[j+1]*g1 + a[j+2]*g2 + a[j+3]*g3;
i++;
}
tmp[i] = a[n-2]*h0 + a[n-1]*h1 + a[0]*h2 + a[1]*h3;
tmp[i+half] = a[n-2]*g0 + a[n-1]*g1 + a[0]*g2 + a[1]*g3;
for (i = 0; i < n; i++) {
a[i] = tmp[i];
}
}
} // transform
to perform a Daubechies D4 wavelet transform on this discrete array:
[1,2,0,4,5,6,8,10]
the result is
- 0 : 1.638357430415108
- 1 : 3.6903274198537357
- 2 : -2.6439375651698196
- 3 : 79.01146993331695
- 4 : 7.399237211089009
- 5 : 0.3882285676537802
- 6 : -39.6029588778518
- 7 : -19.794010741818195
- 8 : -2.1213203435596424
- 9 : 0.0
but when I use python pywt.dwt on the same array, I get this:
import pywt
[cA, cD] = pywt.dwt([1,2,0,4,5,6,8,10], 'db4')
>>> >>> cA
array([ 7.14848277, 1.98754736, 1.9747116 , 0.95510018, 4.90207373,
8.72887094, 14.23995582])
>>> cD
array([-0.5373913 , -2.00492859, 0.01927609, 0.1615668 , -0.0823509 ,
-0.32289939, 0.92816281])
Beyond the different values, one has 10 items and the other 7.
what am I missing?
I have never used any of these codes, also, not really sure about your question! But, maybe, this information might help you to get closer to an answer to your question:
Daubechies 4 Wiki
Daubechies Coefficients Wiki
Before that, I think your input vector (signal) maybe too small to make Wavelet calculations come up right? Not sure though! Maybe, try something in 1x128 size.
Maybe, Java code is Fast Wavelet Transform. Guessing based on the following methods:
Code
/**
Forward Daubechies D4 transform
*/
public void daubTrans( double s[] )
{
final int N = s.length;
int n;
for (n = N; n >= 4; n >>= 1) {
transform( s, n );
}
}
/**
Inverse Daubechies D4 transform
*/
public void invDaubTrans( double coef[])
{
final int N = coef.length;
int n;
for (n = 4; n <= N; n <<= 1) {
invTransform( coef, n );
}
}
Based on above methods, it seems this would be "Fast Wavelet Transform", which I'm also not so sure about their calculations, you might look into this link.
There are so many so-called, similar "terms" on Wavelet transforms that it might be best to go through their math to see things, and find out what the exact method might be (e.g., Discrete Wavelet Transform, Continuous Wavelet Transform, Discrete with Packet Decomposition). Every library has some terminologies and assumptions and make different calculations. You might print to see, if you would get anything close to D4 Wavelet = {−0.1830127, −0.3169873, 1.1830127, −0.6830127}; for DB4, first. Or, you may do other testing to see, if the calculations are correct.
Methods of Decomposition in Wavelets
It looks like cA and cD are coefficients of "Approximated" and "Details" signals decomposed by a discrete Wavelet transform. However, I'm not so sure, to how many layers you might have been decomposed your input vector.
There are two well-known ways of decomposing a signal in Wavelet, one is "packet" (which decomposes both "approximations" and "details" signals, so you would get 2^4=16 sub-signals for decomposing your original signal to 4 layers).
The other decomposition method decomposes the low-frequency part of signals. So, you might need to find out about your level of decomposition that your vector is being decomposed.
Also, if you write your own code, you can decompose it however you wish.
Simple Keys to Understand Wavelet
Shifting (Time) vs Scale (Frequency)
There is one simple thing that if you understand, then Wavelet becomes much easier. First, as you may know, Wavelet is a time-frequency method. However, instead of plotting time vs frequency, you do time vs scale, where scale is "inverse" of frequency.
Children of a Wavelet Function such as DB4
Wavelet transform maps a Wavelet function - such as DB4 - throughout your original signal, and that's how it would compute those numbers that you have printed out, perhaps. One thing to consider is to find a base function, DB4, that would "look like" you original signal. How do you do that?
Basically, you pick a base function, DB4, then Wavelet transform creates multiple forms of that base function (e.g., imagine you name them DB4-0, DB4-2, DB4-3, DB4-4, ...,DB4-15). These children are created based on:
(a) Shifting (in a for loop by incrementing time, sliding a child function, then calculating coefficients), shifting is in relation with time, obviously.
(b) Scaling (means "stretching" a Wavelet function, vertically or horizontally, which would changes the frequency nature of the base function, then sliding it again through time), which is converse relation with frequency, meaning that higher scale, lower frequencies, and vice versa.
Therefore, this depends on how many children functions you may need, based on the decompositions (sub-signals). If you have 16 sub-signals (4 level of decomposition with a packet method), then you will have 16 of those "children" functions mapping throughout your signal. And that's how coefficients vectors being calculated. Then, you may toss those unnecessary sub-signals, and keep focusing on those sub-signals (frequencies) that you might be interested in. The thing is Wavelet reserves (maintain) the time information, as opposed to Fourier.
Normal Decomposition
Also, since you are a good programmer, I'm pretty sure, you can quickly crack the code and I don't think you might be missing anything here. You can just go through their methods and read a few pages of wikipedia, and you would be probably there, if you wish so.
If you might have really exciting details questions, you may try DSP SE. So many signals experts are in there. Sorry about that! Wrote this too fast, also not a good writer/explainer, later hopefully others would edit and provide the right answer. Not really expert.
In short, you are not missing on anything, good method, good luck and best wishes!

Hash function that can return a integer range based on string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am looking to create a simple c++ hash function which will return a number within a max range based on a string input. So that, the same string will always return the same integer value. Here is an arbitrary example where the max desired range is 36.
Fred Smith -> 25
tree -> 34
Frog -> 0
Fred Smith -> 25
fred smith -> 7
These numbers are arbitrary, but the function should use an algorithm that does a numeric calculation against the string and results in an integer within a defined range. I will eventually rewrite this function for use in Python 2.7 as well.
I am using vs2008 (aka c++9) and std::hash is not available.
I need some advice on the approach.
Why not std::hash?
#include <iostream>
#include <functional>
#include <string>
int main()
{
int max = 100;
std::string str = "Fred Smith";
std::hash<std::string> hash_fn;
int num = (int) hash_fn(str) % max;
std::cout << num << '\n';
}
Output:
33
If you need a custom hashing algorithm implementation that will work across languages, I suggest starting here or here.
#Very simple minded hash
def hashval(str, siz):
hash = 0
# Take ordinal number of char in str, and just add
for x in str: hash += (ord(x))
return(hash % siz) # Depending on the range, do a modulo operation.
print(hashval('stack', 33))
Two important elements in creating good hashes are hash table sizes and salting the hash (adding your own unpredictable touch). Typically there will be an operation of the given string to hash, maybe adding the ASCII values of each character or something like the length of the string being involved in some operation. These are EXTREMELY simple hashing examples for strings.
Now assuming we used an algorithm that uses the ascii values of each character in the string, we can incorporate the two elements I mentioned above to create our hash function like so...
int hash(string s, int tableSize)
{
int sum = 0;
for (int i = 0; i < s.length(); i++)
sum += int(s[i]) * 3 //<- * 3 being my salt to the hash
return sum % tableSize;
}
Using prime numbers when do table size modulus and salting is good practice because it'll reduce the risk of creating patterns in your hashes.
I hope this helps get you on the right track!
A hashmap in java uses the hashing function of the object to get a 32 byte hash and a second hashing function implemented by the hashmap implementation to further reduce the length of the hash. This is explained in the anwser of this question: What hashing function does Java use to implement Hashtable class?
You can look at the hashing function used by the HashMap implementation since it can generate hashed of desired length.
You could just take the integer representation of each character of the string and compute the sum modulo max. Where max+1 is the highest value you want the hash to be.
Edit:
This HASH is reversible easily so it depends on your requirements.

Converting Algorithm from Python to C: Suggestions for Using bin() in C?

So essentially, I have a homework problem to write in c, and instead of taking the easy route, I thought that I would implement a little algorithm and some coding practice to impress my Professor. The question is to help us to pick up C (or review it, the former is for me), and the question tells us to return all of the integers that divide a given integer (such that there is no remainder).
What I did in python was to create a is_prime() method, a pool_of_primes() method, and a combinations() method. So far, I have everything done in C, up to the combinations() method. The problem that I am running into now is some syntax errors (i.e. not being able to alter a string by declaration) and mainly the binary that I was using for the purpose of what would be included in my list of combinations. But without being able to alter my string by declaration, the Python code is kind of broken...
Here is the python code:
def combinations(aList):
'''
The idea is to provide a list of ints and combinations will provide
all of the combinations of that list using binary.
To track the combinations, we use a string representation of binary
and count down from there. Each spot in the binary represents an
on/off (included/excluded) indicator for the numbers.
'''
length = len(aList) #Have this figured out
s = ""
canidates = 0
nList = []
if (length >=21):
print("\nTo many possible canidates for integers that divide our number.\n")
return False
for i in range(0,length):
s += "1"
canidates += pow(2,i)
#We now have a string for on/off switch of the elements in our
#new list. Canidates is the size of the new list.
nList.append(1)
while (canidates != 0):
x = 1
for i in range(0,length):
if(int(s[i]) == 1):
x = x*aList[i]
nList.append(x)
canidates -= 1
s = ''
temp = bin(canidates)
for i in range(2,len(temp)):
s = s+temp[i]
if (len(s) != length):
#This part is needed in cases of [1...000-1 = 0...111]
while( len(s) != length):
s = '0'+s
return nList
Sorry if the entire code is to lengthy or not optimized to a specific suiting. But it works, and it works well :)
Again, I currently have everything that aList would have, stored as a singly-linked list in c (which I am able to print/use). I also have a little macro I included in C to convert binary to an integer:
#define B(x) S_to_binary_(#x)
static inline unsigned long long S_to_binary_(const char *s)
{
unsigned long long i = 0;
while (*s) {
i <<= 1;
i += *s++ - '0';
}
return i;
}
This may be Coder's Block setting in, but I am not seeing how I can change the binary in the same way that I did in Python... Any help would be greatly appreciated! Also, as a note, what is typically the best way to return a finalized code in C?
EDIT:
Accidentally took credit for the macro above.
UPDATE
I just finished the code, and I uploaded it onto Github. I would like to thank #nneonneo for providing the step that I needed to finish it with exemplary code.If anyone has any further suggestions about the code, I would be happy to see there ideas on [Github]!
Why use a string at all? Keep it simple: use an integer, and use bitwise math to work with the number. Then you don't have to do any conversions back and forth. It will also be loads faster.
You can use a uint32_t to store the "bits", which is enough to hold 32 bits (since you max out at 21, this should work great).
For example, you can loop over the bits that are set by using a loop like this:
uint32_t my_number = ...;
for(int i=0; i<32; i++) {
if(my_number & (1<<i)) {
/* bit i is set */
}
}

Hash value for directed acyclic graph

How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. We can assume that the number of vertices in the graph is at most 11.
I am particularly interested in Python code.
Here is what I did. If self.lt is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). Then, I hash the sorted dictionary. Some isomorphic graphs will hash to different values, especially as the number of nodes grows.
I have included all the code to motivate my use case. I am calculating the number of comparisons required to find the median of 7 numbers. The more that isomorphic graphs hash to the same value the less work that has to be redone. I considered putting larger connected components first, but didn't see how to do that quickly.
from tools.decorator import memoized # A standard memoization decorator
class Graph:
def __init__(self, n):
self.lt = {i: set() for i in range(n)}
def compared(self, i, j):
return j in self.lt[i] or i in self.lt[j]
def withedge(self, i, j):
retval = Graph(len(self.lt))
implied_lt = self.lt[j] | set([j])
for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
retval.lt.items()):
lt_k |= lt_s
if i in lt_k or k == i:
lt_k |= implied_lt
return retval.toposort()
def toposort(self):
mapping = {}
while len(mapping) < len(self.lt):
for i, lt_i in self.lt.items():
if i in mapping:
continue
if any(i in lt_j or len(lt_i) < len(lt_j)
for j, lt_j in self.lt.items()
if j not in mapping):
continue
mapping[i] = len(mapping)
retval = Graph(0)
for i, lt_i in self.lt.items():
retval.lt[mapping[i]] = {mapping[j]
for j in lt_i}
return retval
def median_known(self):
n = len(self.lt)
for i, lt_i in self.lt.items():
if len(lt_i) != n // 2:
continue
if sum(1
for j, lt_j in self.lt.items()
if i in lt_j) == n // 2:
return True
return False
def __repr__(self):
return("[{}]".format(", ".join("{}: {{{}}}".format(
i,
", ".join(str(x) for x in lt_i))
for i, lt_i in self.lt.items())))
def hashkey(self):
return tuple(sorted({k: tuple(sorted(v))
for k, v in self.lt.items()}.items()))
def __hash__(self):
return hash(self.hashkey())
def __eq__(self, other):
return self.hashkey() == other.hashkey()
#memoized
def mincomps(g):
print("Calculating:", g)
if g.median_known():
return 0
nodes = g.lt.keys()
return 1 + min(max(mincomps(g.withedge(i, j)),
mincomps(g.withedge(j, i)))
for i in nodes
for j in nodes
if j > i and not g.compared(i, j))
g = Graph(7)
print(mincomps(g))
To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs.
Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py:
diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py 2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py 2013-01-28 22:09:07.000000000 -0200
## -31,7 +31,9 ##
ext_pynauty = Extension(
name = MODULE + '._pynauty',
- sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+ sources = [ pynauty_dir + '/' + 'pynauty.c',
+ os.path.join(nauty_dir, 'schreier.c'),
+ os.path.join(nauty_dir, 'naurng.c')],
depends = [ pynauty_dir + '/' + 'pynauty.h', ],
extra_compile_args = [ '-O4' ],
extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c 2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c 2013-01-29 00:38:36.000000000 -0200
## -320,7 +320,7 ##
PyObject *adjlist;
PyObject *p;
- int i,j;
+ Py_ssize_t i, j;
int adjlist_length;
int x, y;
Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.
Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.
The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n2 bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.
How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2110 remaining graphs are acyclic and for each graph there may be up to 11! - roughly 225 - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?
How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.
There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.
How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:
the total number of nodes
the total number of (directed) connections
the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n))
All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.
As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c instead of a -> b).
This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:
same as above but with second order in- and outdegree. Ie. the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.
or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) )
number of nodes with eccentricity = x (again for any x)
if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. Due to the xor the specific order in which the nodes where added to the hash does not matter.
You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).
Of course this is much more complex though.
Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).
I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:
Sort the hashes of the node's neighbors
Hash the concatenated sorted hashes
On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).
For more on this you can look at my post here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py
Imho, If the graph could be topologically sorted, the very straightforward solution exists.
For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (p.e. if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334})
For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N);
I think the complexity of this procedure is around (N*N) in the worst case. If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part)
I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.
A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:
1->2, 1->5
2->1, 2->4
3->4
5->3
What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}), here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3).
Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn), Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch.
In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5.
The alphabet(set of possible letters) for the set {1,2,3} would be(ordered lexicographically):
{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}
First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.
Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N. Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N. This is to show that the hash code grows really fast with the increase of N. Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.
There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++ but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:
typedef long long ll;
// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination.
ll code(vector<int> a,int n)
{
sort(a.begin(),a.end()); // not needed if the set you pass is already sorted.
int cur = 0;
int m = a.size();
ll res =0;
for(int i=0;i<a.size();i++)
{
if(a[i] == cur+1)
{
res++;
cur = a[i];
continue;
}
else
{
res++;
int number_of_greater_nums = n - a[i];
for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
res += 1LL << (number_of_greater_nums+increment);
cur = a[i];
}
}
return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the
// combination
vector<int> decode(ll kod, int n)
{
vector<int> res;
int cur = 0;
int left = n; // Out of how many numbers are we left to choose.
while(kod)
{
ll all = 1LL << left;// how many are the total combinations
for(int i=n;i>=0;i--)
{
if(all - (1LL << (n-i+1)) +1 <= kod)
{
res.push_back(i);
left = n-i;
kod -= all - (1LL << (n-i+1)) +1;
break;
}
}
}
return res;
}
Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.
Also have a look at this question.
I'm not sure that it's 100% working, but here is an idea:
Let's code a graph into a string and then take its hash.
hash of an empty graph is ""
hash of a vertex with no outgoing edges is "."
hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (e.g. ",")
To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (e.g. in lexicographical order).
For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots).
edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. That happens when a vertex has several parents - then it "duplicated" for every parent. For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}.
I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily:
THash GetHash(const TGraph &graph)
{
return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
std::vector<std::string> childHashes;
for(auto c:graph.GetChildren(vertex))
childHashes.push_back(GetVertexStringCode(graph,*c));
std::sort(childHashes.begin(),childHashes.end());
std::string result=".";
for(auto h:childHashes)
result+=*h+",";
return result;
}
I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. This proposal is therefore based on isomorphism only.
For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. Here is a starter list:
2d histogram of nodes' in and out degrees.
4d histogram of edges a->b where a and b are both characterized by in/out degree.
Addition
Let me be more explicit. For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I,O values), signifying that there are N nodes with in-degree I and out-degree O. You'd hash this set of triples or better yet use the whole set arranged in some canonical order e.g. lexicographically sorted. For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO, to nodes with bI and bO respectively. Again hash these quintuples or else use them in canonical order as-is for another part of the final hash.
Starting with this and then looking at collisions that still occur will probably provide insights on how to get better.
When I saw the question, I had essentially the same idea as #example. I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs.
This tag consists of the sequence of out-degrees in ascending order. You can hash this tag with the string hash function of your choice to obtain a hash of the graph.
Edit: I expressed my proposal in the context of #NeilG's original question. The only modification to make to his code is to redefine the hashkey function as:
def hashkey(self):
return tuple(sorted(map(len,self.lt.values())))
With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification.
Example code in this StackOverflow answer, the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent.
Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children.

Categories