How many times string appears in another string - python

I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)

Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).

Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.

If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.

Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.

Related

More efficient data structure/algorithm to find similar imagehashes in database

I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!
Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.

what the fastest method to loop through a very large number of account in a log in page

I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?
Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).

Optimizing a counter that loops through documents to check how many words from the document appear in a list

I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?
You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.
If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.
If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.
If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.
In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.
First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

BST or Hash Table?

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".

Categories