identify 5-tuple flows by hash value in Python

identify 5-tuple flows by hash value in Python - python

I identify Internet traffic flows by their 5-tuple (src IP, dst port, sport, dport, transport protocol number) and I would like turn this 5-tuple into a much more compact alphanumeric ID for internal use in my script.
What choices do I have in Python?
I read that the built-in function hash is only consistent OS-wise, so I would prefer something else.
I will only ever have to deal with no more than a few hundreds different 5-tuples.

Just choose your own hash function:
import hashlib
hash = hashlib.md5()
t = (1, 2, 3, 4, 5) # whatever
t_as_string = str(t)
hash.update(t_as_string)
print hash.hexdigest()
You can use any of the functions in hashlib. And since this isn't a security issue, it doesn't really matter which one...
BUT: wanna bet, comparing tuples will be faster / more efficient?

The following Python Hash function, by Ewen Cheslack-Postava, shall remain consistent accross several OS and CPU :
https://pypi.python.org/pypi/pyhashxx/

are you worried about collisions across OS's? is that your issue?
But since you are only dealing with a few hundred of 5tuples cant you apply some kind of hash collusion resolution techniques like chaining or open addressing etc.
If I am not missing anything else I believe the above method is better than devising a new hashing algorithm yourself.

Related

Implementation of a hash table in SWI-Prolog

So I have to implement an ADT, in this case a hash table in SWI-Prolog.
I need help, because I'm new in this programming language, and don know how to start.
This started as an implementation in python(3) where I defined a class and added functions to work with (add, is_empty?, delete, rehash, hash, etc).
But now, I need to do something similar in prolog.
I have visited some other stackoverflow questions similar to mine, but I'm still helpless.
I expect to define a basic hash table and be able to add key+value data and some other basic functions. I'm not very sure if this is already implemented somewhere else.
Pls help.

Several Prolog systems provide a term hash built-in or library predicate. For example, SWI-Prolog provides term_hash/2 and term_hash/4 built-in predicates. These predicates are often combined with first-argument indexing. A simple example:
% dynamic predicate to hold hash table entries
% with the term hash used as first argument to
% take advantage of first-argument indexing
%
% hash_table(Hash, Term).
:- dynamic(hash_table/2).
add_hash_table_entry(Term) :-
nonvar(Term),
term_hash(Term, Hash), % or term_hash/4
assertz(hash_table(Hash, Term)).
del_hash_table_entry(Term) :-
nonvar(Term),
term_hash(Term, Hash), % or term_hash/4
retractall(hash_table(Hash, _)).
hash_table_entry(Term) :-
( var(Term) ->
hash_table(_, Term)
; term_hash(Term, Hash), % or term_hash/4
hash_table(Hash, Term)
).

This sounds like a misguided idea. How is this "hash table" going to be used? What algorithmic complexity do you expect the different operations to have? Why do you want to implement a hash table in a language that doesn't need hash tables implemented by users?
The only half-decent way to do it will be to use a flat term for the table, one argument per bucket. If you have k buckets, then you use a term with that arity k, so for 256 buckets you get hash_table/256.
empty_hash_table(T) :-
length(Buckets, 256),
maplist(=(nil), Buckets),
T =.. [hash_table|Buckets].
You can now use arg/3 to get a bucket in constant time. You can use setarg/3 to change them.
But this all is starting to sound very fishy. You need to explain your reasons better. Why do you want to implement a hash table in Prolog? How is it going to be used?

python data type to track duplicates

I often keep track of duplicates with something like this:
processed = set()
for big_string in strings_generator:
if big_string not in processed:
processed.add(big_string)
process(big_string)
I am dealing with massive amounts of data so don't want to maintain the processed set in memory. I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
To cut down on memory use what do you think of using hashes like this:
processed = set()
for big_string in string_generator:
key = hash(big_string)
if key not in ignored:
processed.add(key)
process(big_string)
The drawback is I could lose data through occasional hash collisions.
1 collision in 1 billion hashes would not be a problem for my use.
I tried the md5 hash but found generating the hashes became a bottleneck.
What would you suggest instead?

I'm going to assume you are hashing web pages. You have to hash at most 55 billion web pages (and that measure almost certainly overlooks some overlap).
You are willing to accept a less than one in a billion chance of collision, which means that if we look at a hash function which number of collisions is close to what we would get if the hash was truly random[ˆ1], we want a hash range of size (55*10ˆ9)*10ˆ9. That is log2((55*10ˆ9)*10ˆ9) = 66 bits.
[ˆ1]: since the hash can be considered to be chosen at random for this purpose,
p(collision) = (occupied range)/(total range)
Since there is a speed issue, but no real cryptographic concern, we can use a > 66-bits non-cryptographic hash with the nice collision distribution property outlined above.
It looks like we are looking for the 128-bit version of the Murmur3 hash. People have been reporting speed increases upwards of 12x comparing Murmur3_128 to MD5 on a 64-bit machine. You can use this library to do your speed tests. See also this related answer, which:
shows speed test results in the range of python's str_hash, which speed you have already deemed acceptable elsewhere – though python's hash is a 32-bit hash leaving you only 2ˆ32/(10ˆ9) (that is only 4) values stored with a less than one in a billion chance of collision.
spawned a library of python bindings that you should be able to use directly.
Finally, I hope to have outlined the reasoning that could allow you to compare with other functions of varied size should you feel the need for it (e.g. if you up your collision tolerance, if the size of your indexed set is smaller than the whole Internet, etc, ...).

You have to decide which is more important: space or time.
If time, then you need to create unique representations of your large_item which take as little space as possible (probably some str value) that is easy (i.e. quick) to calculate and will not have collisions, and store them in a set.
If space, find the quickest disk-backed solution you can and store the smallest possible unique value that will identify a large_item.
So either way, you want small unique identifiers -- depending on the nature of large_item this may be a big win, or not possible.
Update
they are strings of html content
Perhaps a hybrid solution then: Keep a set in memory of the normal Python hash, while also keeping the actual html content on disk, keyed by that hash; when you check to see if the current large_item is in the set and get a positive, double-check with the disk-backed solution to see if it's a real hit or not, then skip or process as appropriate. Something like this:
import dbf
on_disk = dbf.Table('/tmp/processed_items', 'hash N(17,0); value M')
index = on_disk.create_index(lambda rec: rec.hash)
fast_check = set()
def slow_check(hashed, item):
matches = on_disk.search((hashed,))
for record in matches:
if item == record.value:
return True
return False
for large_item in many_items:
hashed = hash(large_item) # only calculate once
if hashed not in fast_check or not slow_check(hashed, large_item):
on_disk.append((hashed, large_item))
fast_check.add(hashed)
process(large_item)
FYI: dbf is a module I wrote which you can find on PyPI

If many_items already resides in memory, you are not creating another copy of the large_item. You are just storing a reference to it in the ignored set.
If many_items is a file or some other generator, you'll have to look at other alternatives.
Eg if many_items is a file, perhaps you can store a pointer to the item in the file instead of the actual item

As you have already seen few options but unfortunately none of them can fully address the situation partly because
Memory Constraint, to store entire object in memory
No perfect Hash function, and for huge data set change of collision is there.
Better Hash functions (md5) are slower
Use of database like sqlite would actually make things slower
As I read this following excerpt
I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
I feel if you work on this, it might help you marginally. Here how it should be
Use tmpfs to create a ramdisk. tmpfs has several advantages over other implementation because it supports swapping of less-used space to swap space.
Store the sqlite database on the ramdisk.
Change the size of the ramdisk and profile your code to check your performance.
I suppose you already have a working code to save your data in sqllite. You only need to define a tmpfs and use the path to store your database.
Caveat: This is a linux only solution

a bloom filter? http://en.wikipedia.org/wiki/Bloom_filter

well you can always decorate large_item with a processed flag. Or something similar.

You can give a try to the str type __hash__ function.
In [1]: hash('http://stackoverflow.com')
Out[1]: -5768830964305142685
It's definitely not a cryptographic hash function, but with a little chance you won't have too much collision. It works as described here: http://effbot.org/zone/python-hash.htm.

I suggest you profile standard Python hash functions and choose the fastest: they are all "safe" against collisions enough for your application.
Here are some benchmarks for hash, md5 and sha1:
In [37]: very_long_string = 'x' * 1000000
In [39]: %timeit hash(very_long_string)
10000000 loops, best of 3: 86 ns per loop
In [40]: from hashlib import md5, sha1
In [42]: %timeit md5(very_long_string).hexdigest()
100 loops, best of 3: 2.01 ms per loop
In [43]: %timeit sha1(very_long_string).hexdigest()
100 loops, best of 3: 2.54 ms per loop
md5 and sha1 are comparable in speed. hash is 20k times faster for this string and it does not seem to depend much on the size of the string itself.

how does your sql lite version work? If you insert all your strings into a database table and then run the query "select distinct big_string from table_name", the database should optimize it for you.
Another option for you would be to use hadoop.
Another option could be to split the strings into partitions such that each partition is small enough to fit in memory. then you only need to check for duplicates within each partition. the formula you use to decide the partition will choose the same partition for each duplicate. the easiest way is to just look at the first few digits of the string e.g.:
d=defaultdict(int)
for big_string in strings_generator:
d[big_string[:4]]+=1
print d
now you can decide on your partitions, go through the generator again and write each big_string to a file that has the start of the big_string in the filename. Now you could just use your original method on each file and just loop through all the files

This can be achieved much more easily by performing simpler checks first, then investigating these cases with more elaborate checks. The example below contains extracts of your code, but it is performing the checks on much smaller sets of data. It does this by first matching on a simple case that is cheap to check. And if you find that a (filesize, checksum) pairs are not discriminating enough you can easily change it for a more cheap, yet vigorous check.
# Need to define the following functions
def GetFileSize(filename):
pass
def GenerateChecksum(filename):
pass
def LoadBigString(filename):
pass
# Returns a list of duplicates pairs.
def GetDuplicates(filename_list):
duplicates = list()
# Stores arrays of filename, mapping a quick hash to a list of filenames.
filename_lists_by_quick_checks = dict()
for filename in filename_list:
quickcheck = GetQuickCheck(filename)
if not filename_lists_by_quick_checks.has_key(quickcheck):
filename_lists_by_quick_checks[quickcheck] = list()
filename_lists_by_quick_checks[quickcheck].append(filename)
for quickcheck, filename_list in filename_lists.iteritems():
big_strings = GetBigStrings(filename_list)
duplicates.extend(GetBigstringDuplicates(big_strings))
return duplicates
def GetBigstringDuplicates(strings_generator):
processed = set()
for big_string in strings_generator:
if big_sring not in processed:
processed.add(big_string)
process(big_string)
# Returns a tuple containing (filesize, checksum).
def GetQuickCheck(filename):
return (GetFileSize(filename), GenerateChecksum(filename))
# Returns a list of big_strings from a list of filenames.
def GetBigStrings(file_list):
big_strings = list()
for filename in file_list:
big_strings.append(LoadBigString(filename))
return big_strings

Python unhash value

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance

It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects

Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.

You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)

This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

Creating a unique key based on file content in python

I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.
Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.
So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.
What's the right way of doing this?
edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:
import hashlib
def md5_from_file (fileName, block_size=2**14):
md5 = hashlib.md5()
f = open(fileName)
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
f.close()
return md5.hexdigest()

Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.
Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.
# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)
def store_file(file, nchunks, hash):
"" Tells you whether there is another file with the same contents already, by
making a table lookup ""
# This can be a DB lookup or some way to obtain your hash map
big_table = ObtainTable()
# Two level lookup table might help performance
# Will vary on the number of entries and nature of big_table
if nchunks in big_table:
if hash in big_table[hash]:
raise DuplicateFileException,\
'File is dup with %s' big_table[nchunks][lookup_hash]
else:
big_table[nchunks] = {}
big_table[nchunks].update({
hash: file.filename
})
file.save() # or something
To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.
Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.

The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.
Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.
See http://en.wikipedia.org/wiki/Birthday_Paradox
Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.

The issue with MD5 is that it's broken. For most common uses there's little problem and people still use both MD5 and SHA1, but I think that if you need a hashing function then you need a strong hashing function. To the best of my knowledge there is still no standard substitute for either of these. There are a number of algorithms that are "supposed" to be strong, but we have most experience with SHA1 and MD5. That is, we (think) we know when these two break, whereas we don't really know as much when the newer algorithms break.
Bottom line: think about the risks. If you wish to walk the extra mile then you might add extra checks when you find a hash duplicate, for the price of the performance penalty.

Storing huge hash table in a file in Python

Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!

For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.

You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it

You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes

For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.

Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.

You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)

Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.