Quick hashing with output length as input lengh

Quick hashing with output length as input lengh - python

I'm looking for a hashing algorithm that will take an input of 16 chars string, and output a different string of 16 chars. [that can't be converted to the original string]
I've thought of taking a MD5 result and slice the first 16 chars, but i think it is not the right way to solve the problem, since it looses the hashing idea.
any suggestions?
platform, if matters, is Python.

Actually, you lose the hashing idea already when you decide input size needs to match the output size, as, according to Wikipedia "A hash function is any function that can be used to map digital data of arbitrary size to digital data of fixed size."
If you are building a credit card number tokenization system, just make up a random string after checking the number has not already been tokenized, check that the token does not have a collision, save the original number in the ways allowed by PCI standards (read them, https://www.pcisecuritystandards.org/documents/Tokenization_Guidelines_Info_Supplement.pdf) and you are good to go.
If not, a chopped hash function like SHA256 or MD5 will give repeatability outside your system too and risk of collisions (that's part of it being hashing), but whether those make sense to use really depends on your use case.

Related

Is using an UUID four times for length 4 the same as one time for length 16?

My case is this, I need to offer people a one time code that they can use to login. These people are not tech literate. They need to be offered a human readable code.
The format is something along the lines of this;
ACBE-adK3-SdLK-K23J
a set of 4 times 4 human readable characters. For a total of 16 characters, that seems reasonable secure as an UUID. But can easily be extended if needed.
Now, is using say NanoID 4 times for to generate a 4 character long string equivalent to using it one times for a 16 character string and then chopping it up? I think it is. Programmatically it's trivial to implement either. But, I really wonder about the actual factual answer. If some math specialist would indulge me?
Edit:
To answer the questions;
It's to allow people access to photo's only they should have access to, think photo's for passports, school photo's and the like. People use the code once to link the photo's to their e-mail and from their on login using e-mail/password combo's. Having people signup using e-mail beforehand is in this case not an option.
I am aware using hex digits is the usual case. I need easy human readable. So cutting up a 16 digit hex block into 4 distinct part seemed the logical step.
The chosen alphabet would be a-z A-Z 0-9 and excluding a few symbols, such as 0/o/O and I/1/l to limit mistakes. This would allow expressing the same ID in less characters.
I am aware now, that NanoID is not an UUID implementation. Thans. But for my goal it would be sufficient I think. If not, I'd like to know that as well.
I am using Python 3

A string format such as the one you give in your question is ultimately a one-to-one mapping from integers to human-readable strings. If the integer is generated so as to be unique, so will the human-readable string be.
In your case, you can generate a uniform random integer in the interval [0, AS), where A is the alphabet size (such as 36 for upper-case letters and digits), and S is the number of characters in the ID (which is 16 in your example, excluding hyphens). Then map that integer one-to-one with human-readable strings in the desired format.
In your case, the ID will serve as a secret "confirmation code", in which case it should be generated using a secure random generator, such as secrets.SystemRandom or random.SystemRandom or secrets.randbelow in Python (but note that randomly generated values are not unique by themselves).

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?

You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.

If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.

If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.

If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.

In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.

First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

How many times string appears in another string

I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)

Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).

Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.

If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.

Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

How to compute multiple sequence alignment for text strings

I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.
I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?
Thanks!

The easiest way to align multiple sequences is to do a number of pairwise alignments.
First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.
When you are aligning a sequence to
the aligned sequences, (based on a
pairwise alignment), when you insert a
gap in the sequence that is already in
the set, you insert gaps in the same
place in all sequences in the aligned
set.
Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.
In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.

Are you looking for something quick and dirty, as in the following?
from difflib import SequenceMatcher
a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"
ss = [a,b,c,d]
s = SequenceMatcher()
for i in range(len(ss)):
x = ss[i]
s.set_seq1(x)
for j in range(i+1,len(ss)):
y = ss[j]
s.set_seq2(y)
print
print s.ratio()
print s.get_matching_blocks()

MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:
mafft --text input_text.fa > output_alignment.fa
Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.

I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.

how to work with strings and integers as bit strings in python?

I'm developing a Genetic Algorithm in python were chromosomes are composed of strings and integers. To apply the genetic operations, I want to convert these groups of integers and strings into bit strings.
For example, if one chromosome is:
["Hello", 4, "anotherString"]
I'd like it to become something like:
0100100100101001010011110011
(this is not actual translation). So... How can I do this? Chromosomes will contain the same amount of strings and integers, but this numbers can vary from one algorithm run to another.
To be clear, what I want to obtain is the bit representation of each element in the chromosome concatenated.
If you think this would not be the best way to apply genetic operators (such as mutation and simple crossover) just tell me! I'm open to new ideas.
Thanks a lot!
Manuel

You can turn strings and integers into bytestrings (and back) with the struct module, and that's exactly 8 bits to a byte. If for some reason you want these binary bytestrings as text strings made up of 0 and 1 characters, you can print them in binary form, of course.
Edit: forgot to remind you how to format a byte into a text string made up of 0 and 1 characters -- in Python 2.6 or better:
>>> format(23, '08b')
'00010111'
and to get back from such a string to a byte, of course:
>>> int('00010111', 2)
23

Converting everything into one concatenated string, and than applying genetic operations doesn't seem to be the best idea. Genetic operations can break here many things (especially if you have some constrains on individuals), additionally effectiveness of such solution is probably low. I would suggest different approach.
Try implementing individual using SuperGene concept (wiki). Example of applying it to GA is described here. Additionally as per this they say it improves overall GA performance.
In my opinion it will make design clearer. I would try this approach.

Once you describe exactly how the translation from strings to bitstrings should go, the "how" should be fairly easy. If the genetic algorithms should work on a bit-level then obviously a bit level string makes sense, but it is probably way slower than using numbers or character strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.