So I was wondering about youtube's url. Specially the video id watch?v=XZmGGAbHqa0. Same for similar tinyurl services.
I came across Tom's video of Will YouTube Ever Run Out Of Video IDs? That was quite interesting. But generating a seemingly random number in a random way, check for duplicates looked quite expensive.
In case you haven't watched the video, youtube generates a large unique number and base64 encode it. Not sure if the whole process is that simple inside youtube but I'm trying to write something similar.
I came across a mathematical function called Modular Multiplicative Inverse. Using this along with a bit of Extended Euclidean algorithm will generate a unique number from our given input. Inverse is possible too, But someone won't be easily able to inverse or brute force without knowing the seeds. So we can easily get a large random number from our subsequent id in database. Maybe even base64 it.
But I have confusions.
As tom mentioned in the video, syncing a subsequent id across multiple servers can lead to problems and overheads. But if we don't make it subsequent, won't it be a tedious and expensive task to find a unique id? We have to check if the id is available in db.
Should I avoid mixing in url with subsequent ids? Or masking it with something like that Multiplicative Inverse is good?
Why would I want to base64 it? To save storage or to save url space? I would still need to generate something seemingly random, unique and be able to quickly generate it without duplicating and search it.
Any better ways to achieve what I'm trying? I did checked all the similar so questions and a few blog posts. On of which is this
Sorry If I couldn't explain well, I'm quite confused what to ask.
This is why you bypass the issue of ensuring uniqueness of randomly generated numbers entirely and use a pseudorandom permutation (PRP) family to transform a pre-coordinated counter.
A PRP family is like a PRF family in that it is a cipher that takes a key and a plaintext and produces a ciphertext except that it has a one-to-one map between plaintexts and ciphertexts.
This lets one use a Twitter Snowflake like design and then simply encode the internal sequential identifiers into non-sequential external identifiers.
See Algorithm to turn numeric IDs in to short, different alphanumeric codes for an implementation in Python.
Also
M. Luby and C. Rackoff. How to construct pseudorandom permutations from pseudorandom functions. SIAM J. Comput., 17(2):373–386, Apr. 1988.
Related
I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!
Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.
I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?
Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).
I am using a lexicon of positive and negative words, and I want to count how many positive and negative words appear in each document from a large corpus. The corpus has almost 2 million documents, so the code I'm running is taking too long to count all these occurrences.
I have tried using numpy, but get a memory error when trying to convert the list of documents into an array.
This is the code I am currently running to count just the positive words in each document.
reviews_pos_wc = []
for review in reviews_upper:
pos_words = 0
for word in review:
if word in pos_word_list:
pos_words += 1
reviews_pos_wc.append(pos_words)
After running this for half an hour, it only gets through 300k documents.
I have done a search for similar questions on this website. I found someone else doing a similar thing, but not nearly on the same scale as they only used one document. The answer suggested using the Counter class, but I thought this would just add more overhead.
It appears that your central problem is that you don't have the hardware needed to do the job you want in the time you want. For instance, your RAM appears insufficient to hold the names of 2M documents in both list and array form.
I do see a couple of possibilities. Note that "vectorization" is not a magic solution to large problems; it's merely a convenient representation that allows certain optimizations to occur among repeated operations.
Regularize your file names, so that you can represent their names in fewer bytes. Iterate through a descriptive expression, rather than the full file names. This could give you freedom to vectorize something later.
Your variable implies that your lexicon is a list. This has inherently linear access. Change this to a data structure amenable to faster search, such as a set (hash function) or some appropriate search tree. Even a sorted list with an interpolation search would speed up your work.
Do consider using popular modules (such as Collections); let the module developers optimize the common operations on your behalf. Write a prototype and time its performance: given the simplicity of your processing, the coding shouldn't take long.
Does that give you some ideas for experimentation? I'm hopeful that my first paragraph proves to be unrealistically pessimistic (i.e. that something does provide a solution, especially the lexicon set).
I need to generate much data of mostly basic data types for stress testing NoSQL databases (Cassandra right now, maybe others in future). I also need to be able to re-create this randomly created data in the future and, more problematicly, retrieve random entries from this already generated data to generate queries.
Re-creating the data itself imposes no problem via providing the same seed to the randomness generator. The hard part is retrieving a random item from the generated data. The obious way would be to store all of it in a data structure, but we are talking about potentially GBs of data, so this should not be an option (or am I wrong here?).
The random re-generation of previously generated items should also be as fast as possible, synchonisable over different threads and ideally provide a way to specify the underlaying distribution for both the generated data and the selection of test data items.
[edit] I just found out, that the random.jumpahead(n)-function might come in handy, only problem is it does not work with the pseudo number generator (PNGR) used since python 2.3. But the old one is still available (random.WichmannHill()), where I could just "jump ahead" n steps from my initial seed.
And just to be clear: I'm using python 2.7.
[edit2] What this question might boil down to is skipping n generation steps. You can do it with the original PNGR with some code like I found here:
def skip(n):
for _ in xrange(n):
random.random()
But, as said in the source and tested by me, this is only efficient for n<~100.000, which is way to small. Using random.WichmannHill() I can use jumpahead(n) for any n with the same perfomance.
If you already know 1) the number of entries you will be generating, and 2) the number of random entries you need from that data, you could just obtain the random entries as you are generating them, storing only those in a data structure.
Say you need to create a million entries for your NoSQL database, and you know you'll want to grab 100 random items out of there to test queries. You can generate 100 random numbers between 1 and 1,000,000, and as you're generating the entries for your stress test, you can take the entries that match up with your randomly-generated numbers and store those specific ones in a data structure. Alternately, you can just save a randomly generated entry to your data structure with some probability m/n, where m is the number of random test queries you need, and n is the total volume of data you're creating.
Basically, it's going to be much easier to obtain the random data while it's being generated than to store everything and pluck data randomly from there. As for how to generate the data, that's going to probably be dependent on your NoSQL implementation and the specific data format you want to use.
EDIT: as dcorking pointed out, you don't even need to store the test items themselves in a data structure. You can just execute them as they show up while you're generating data. All you would need to store is the sequence that determines which tests get run. Or, if you don't want to run the same tests every time, you can just randomly select certain elements to be your test elements as I mentioned above, and store nothing at all.
There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.
Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29