I've been learning about HashMaps and their best practices. One of the things that I stumbled upon was collision resolution.
The methods include:
Direct Chaining,
Linear Probing,
Quadratic Probing,
Double Hashing.
So far I've found that direct chaining was much easier to implement and made the most sense. I'm not sure which I should focus on in terms of being prepared for technical interviews.
For technical interviews, I'd suggest getting a high level understanding of the pros/cons of these approaches - specifically:
direct chaining degrades slowly as load factor increases, whereas closed hashing/open addressing approaches (all the others you list) grow exponentially worse as load factor approaches 1.0 as it gets harder and harder to find an empty bucket
linear probing can be CPU cache friendly with small keys (compared to any of the other techniques): if you can get several keys onto the same cache line, that means the CPU is likely to spend less time groping around in memory after collisions (and SIMD instructions can sometimes help compare against multiple buckets' keys concurrently)
linear probing and identity hashing can produce lower-than-cryptographic-hashing collision rates for keys that happened to have a nice distribution across the buckets, such as ids that tend to increment but may have a gap here and there
linear probing is much more prone to clusters of collisions than quadratic probing, especially with poor hash functions / difficult-to-hash-well keys and higher load factors (e.g. ballpark >= 0.8), as collisions in one part of the table (even if just by chance more than flawed hashing) tend to exacerbate future use of that part of the table
quadratic probing can have a couple bucket offsets that might fall on the same cache line, so you might get a useful probability of the second and even third bucket being on the same cache line as the first, but after a few failures you'll jump off by larger increments, decreasing clustering problems at the expense of more cache misses
double hashing is a bit of a compromise; if the second hash happens to produce a 1 then it's equivalent to linear probing, but you might try every 2nd bucket, or every 3rd etc. - up to some limit. There's still plenty of room for clustering (e.g. if h2(k) returned 6 for one key, and 3 for another key that had hashed to a bucket 3 further into the table than the first key, then they'll visit many of the same buckets during their searches.
I wouldn't recommend focusing on any of them in two much depth, or ignoring any, because the contrasts reinforce your understanding of any of the others.
Related
I've been playing around with GEKKO for solving flow optimizations and I have come across behavior that is confusing me.
Context:
Sources --> [mixing and delivery] --> Sinks
I have multiple sources (where my flow is coming from), and multiple sinks (where my flow goes to). For a given source (e.g., SOURCE_1), the total flow to the resulting sinks must equal to the volume from SOURCE_1. This is my idea of conservation of mass where the 'mixing' plant blends all the source volumes together.
Constraint Example (DOES NOT WORK AS INTENDED):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume:
m.Equation(volume_sink_1[i] + volume_sink_2[i] == max_volumes_for_source_1)
I end up with weird results. With that, I mean, it's not actually optimal, it ends up assigning values very poorly. I am off from the optimal by at least 10% (I tried with different max volumes).
Constraint Example (WORKS BUT I DON'T GET WHY):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume like this:
m.Equation(volume_sink_1[i] + volume_sink_2[i] <= max_volumes_for_source_1 * 0.999999)
With this, I get MUCH closer to the actual optimum to the point where I can just treat it as the optimum. Please note that I had to change it to a less than or equal to and also multiply by 0.999999 which was me messing around with it nonstop and eventually leading to that.
Also, please note that this uses practically all of the source (up to 99.9999% of it) as I would expect. So both formulations make sense to me but the first approach doesn't work.
The only thing I can think of for this behavior is that it's stricter to solve for == than <=. That doesn't explain to me why I have to multiply by 0.999999 though.
Why is this the case? Also, is there a way for me to debug occurrences like this easier?
This same improvement occurs with complementary constraints for conditional statements when using s1*s2<=0 (easier to solve) versus s1*s2==0 (harder to solve).
From the research papers I've seen, the justification is that the solver has more room to search to find the optimal solution even if it always ends up at s1*s2==0. It sounds like your problem may have multiple local minima as well if it converges to a solution, but it isn't the global optimum.
If you can post a complete and minimal problem that demonstrates the issue, we can give more specific suggestions.
Cross posted from csexchange:
Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:
Let s = s0
For k = 0 through kmax (exclusive):
T ← temperature( 1 - (k+1)/kmax )
Pick a random neighbour, snew ← neighbour(s)
If P(E(s), E(snew), T) ≥ random(0, 1):
s ← snew
Output: the final state s
I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.
An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.
Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.
UPDATE:
It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:
I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.
My specific questions are:
Without modification to keep track of the global best solution,
wouldn't this algorithm be extremely prone to local optima, despite
the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
With the addition of tracking the global optima, I can absolutely
see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does
this improve upon simple random search? One specific example being
repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.
By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.
So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.
The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.
Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.
Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.
The question as asked doesn't seem very coherent, it seems to ask about guarantees offered by the vanilla simulated annealing algorithm, then proceeds to propose multiple modifications with little to no justification.
In any case, I will try to be coherent in this answer.
The only known rigorous proof of simulated annealing that I know of is described as
The concept of using an adaptive step-size has been essential to the development of
better annealing algorithms. The Classical Simulated Annealing (CSA) was the first annealing algorithm with a rigorous mathematical proof for its global convergence (Geman and
Geman, 1984). It was proven to converge if a Gaussian distribution is used for GXY (TK),
coupled with an annealing schedule S(Tk) that decreases no faster than T = To/(log k).
The integer k is a counter for the external annealing loop, as will be detailed in the next
section. However, a logarithmic decreasing schedule is considered to be too slow, and for
many problems the number of iterations required is considered as “overkill” (Ingber, 1993).
Where the proof of this can be found in
Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.
In terms of successful modifications of SA, if you peruse literature you will find many empirical claims of superiority, but few proofs. Having studied this problem in the past, I can say from the various modified versions of SA I have tried empirically for the problems I was interested in at the time, the only SA modification that consistently gave better results than classical simulated annealing is known as parallel tempering.
Parallel tempering is the modification of SA whereby instead of mixing one Markov-Chain (i.e. the annealing step) multiple chains are mixed concurrently at very specific temperatures, with exchanges between chains occurring on a dynamic or static schedule.
Original paper on parallel tempering:
Swendsen, Robert H., and Jian-Sheng Wang. "Replica Monte Carlo simulation of spin-glasses." Physical review letters 57.21 (1986): 2607.
There are various free and proprietary implementations of parallel tempering available due to its effectiveness; I would highly recommend starting with a a high quality implementation instead of rolling your own.
Note also, parallel tempering is extensively used throughout various physical sciences for modelling purposes due to its efficacy, but for some reason has a very small representation in other areas of literature (such as computer science,) which actually baffles me somewhat. Perhaps computer scientists general disinterest with SA modifications may stem from their belief that SA is unlikely to be improved upon.
I'm looking for an algorithm with the fastest time per query for a problem similar to nearest-neighbor search, but with two differences:
I need to only approximately confirm (tolerating Type I and Type II error) the existence of a neighbor within some distance k or return the approximate distance of the nearest neighbor.
I can query many at once
I'd like better throughput than the approximate nearest neighbor libraries out there (https://github.com/erikbern/ann-benchmarks) which seem better designed for single queries. In particular, the algorithmic relaxation of the first criteria seems like it should leave room for an algorithmic shortcut, but I can't find any solutions in the literature nor can I figure out how to design one.
Here's my current best solution, which operates at about 10k queries / sec on per CPU. I'm looking for something close to an order-of-magnitude speedup if possible.
sample_vectors = np.random.randint(low=0, high=2, size=(10000, vector_size))
new_vectors = np.random.randint(low=0, high=2, size=(100000, vector_size))
import annoy
ann = annoy.AnnoyIndex(vector_size, metric='hamming')
for i, v in enumerate(sample_vectors):
ann.add_item(i, v)
ann.build(20)
for v in new_vectors:
print(ann.get_nns_by_vector(v, n=1, include_distances=True))
I'm a bit skeptical of benchmarks such as the one you have linked, as in my experience I have found that the definition of the problem at hand far outweighs in importance the merits of any one algorithm across a set of other (possibly similar looking) problems.
More simply put, an algorithm being a high performer on a given benchmark does not imply it will be a higher performer on the problem you care about. Even small or apparently trivial changes to the formulation of your problem can significantly change the performance of any fixed set of algorithms.
That said, given the specifics of the problem you care about I would recommend the following:
use the cascading approach described in the paper [1]
use SIMD operations (either SSE on intel chips or GPUs) to accelerate, the nearest neighbour problem is one where operations closer to the metal and parallelism can really shine
tune the parameters of the algorithm to maximize your objective; in particular, the algorithm of [1] has a few easy to tune parameters which will dramatically trade performance for accuracy, make sure you perform a grid search over these parameters to set them to the sweet spot for your problem
Note: I have recommended the paper [1] because I have tried many of the algorithms listed in the benchmark you linked and found them all inferior (for the task of image reconstruction) to the approach listed in [1] while at the same time being much more complicated than [1], both undesirable properties. YMMV depending on your problem definition.
I appreciate the solutions, they gave me some ideas, but I'll answer my own question as I found a solution that mostly resolved my question, and maybe it will help someone else in the future.
I used one of the libraries linked in the benchmarks, hnswlib as it not only has slightly improved performance over annoy, but it also has a bulk-query option. Hnswlib's algorithm also allows a highly flexible performance/accuracy tradeoff in favor of performance, which is well-suited for the highly-error tolerant approximate checking I want to do. Further, even though the parallelization improvements are far from linear per-core, it's still something. The above factors combined for a ~5x speedup in my particular situation.
As ldog said, your milage may vary depending on your problem statement.
I've run into an interesting problem, where I need to make a many-to-many hash with a minimized number of entries. I'm working in python, so that comes in the form of a dictionary, but this problem would be equally applicable in any language.
The data initially comes in as input of one key to one entry (representing one link in the many-to-many relationship).
So like:
A-1, B-1, B-2, B-3, C-2, C-3
A simple way of handling the data would be linking them one to many:
A: 1
B: 1,2,3
C: 2,3
However the number of entries is the primary computational cost for a later process, as a file will need to be generated and sent over the internet for each entry (that is a whole other story), and there would most likely be thousands of entries in the one-to-many implementation.
Thus a more optimized hash would be:
[A, B]: 1
[B, C]: 2,3
This table would be discarded after use, so maintainability is not a concern, the only concern is the time-complexity of reducing the entries (the time it takes the algorithm to reduce the entries must not exceed the time the algorithm would save in reducing the entries from the baseline one-to-many table).
Now, I'm pretty sure that at least someone has faced this problem, this seems like a problem straight out of my Algorithms class in college. However, I'm having trouble finding applicable algorithms, as I can't find the right search terms. I'm about to take a crack at making an algorithm for this from scratch, but I figured it wouldn't hurt to ask around to see if people can't identify this as a problem commonly solved by a modified [insert well-known algorithm here].
I personally think it's best to start by creating a one-to-many hash and then examining subsets of the values in each entry, creating an entry in the solution hash for the maximum identified set of shared values. But I'm unsure how to guarantee a smaller number of subsets than just the one-to-many baseline implementation.
Let's go back to your your unoptimised dictionary of letters to sets of numbers:
A: 1
B: 1,2,3
C: 2,3
There's a - in this case 2-branch - tree of refactoring steps you could do:
A:1 B:1,2,3 C:2,3
/ \
factor using set 2,3 factor using set 1
/ \
A:1 B:1 B,C:2,3 A,B:1 B:2,3 C:2,3
/ \
factor using set 1 factor using set 2,3
/ \
A,B:1 B,C:2,3 A,B:1 B,C:2,3
In this case at least, you arrive at the same result regardless of which factoring you do first, but I'm not sure if that would always be the case.
Doing an exhaustive exploration of the tree sounds expensive, so you might want to avoid that, but if we could pick the optimal path, or at least a likely-good path, it would be relatively low computational cost. Rather than branching at random, my guy instinct is that it'd be faster and closer to optimal if you tried to make the largest-set factoring change possible at each point in the tree. For example, considering the two-branch tree above, you'd prefer the initial 2,3 factoring over the initial 1 factoring in your tree, because 2,3 has the larger set of size two. More dramatic refactoring suggest the number of refactorings before you get a stable result will be less.
What that amounts to is iterating the sets from largest towards smallest (it doesn't matter which order you iterate over same-length sets), looking for refactoring opportunities.
Much like a bubble-sort, after each refactoring the approach would be "I made a change; it's not stable yet; let's repeat". Restart by iterating from second-longest sets towards shortest sets, checking for optimisation opportunities as you go.
(I'm not sure about python, but in general set comparisons can be expensive, you might want to maintain a value for each set that is the XOR of the hashed values in the set - that's easy and cheap to update if a few set elements are changed, and a trivial comparison can tell you large sets are unequal, saving comparison time; it won't tell you if sets are equal though: multiple sets could have the same XOR-of-hashes value).
I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?
The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.
Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.
Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.
You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?
I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py
I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.
Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)