Minimizing entries in a many to many hashtable

Minimizing entries in a many to many hashtable - python

I've run into an interesting problem, where I need to make a many-to-many hash with a minimized number of entries. I'm working in python, so that comes in the form of a dictionary, but this problem would be equally applicable in any language.
The data initially comes in as input of one key to one entry (representing one link in the many-to-many relationship).
So like:
A-1, B-1, B-2, B-3, C-2, C-3
A simple way of handling the data would be linking them one to many:
A: 1
B: 1,2,3
C: 2,3
However the number of entries is the primary computational cost for a later process, as a file will need to be generated and sent over the internet for each entry (that is a whole other story), and there would most likely be thousands of entries in the one-to-many implementation.
Thus a more optimized hash would be:
[A, B]: 1
[B, C]: 2,3
This table would be discarded after use, so maintainability is not a concern, the only concern is the time-complexity of reducing the entries (the time it takes the algorithm to reduce the entries must not exceed the time the algorithm would save in reducing the entries from the baseline one-to-many table).
Now, I'm pretty sure that at least someone has faced this problem, this seems like a problem straight out of my Algorithms class in college. However, I'm having trouble finding applicable algorithms, as I can't find the right search terms. I'm about to take a crack at making an algorithm for this from scratch, but I figured it wouldn't hurt to ask around to see if people can't identify this as a problem commonly solved by a modified [insert well-known algorithm here].
I personally think it's best to start by creating a one-to-many hash and then examining subsets of the values in each entry, creating an entry in the solution hash for the maximum identified set of shared values. But I'm unsure how to guarantee a smaller number of subsets than just the one-to-many baseline implementation.

Let's go back to your your unoptimised dictionary of letters to sets of numbers:
A: 1
B: 1,2,3
C: 2,3
There's a - in this case 2-branch - tree of refactoring steps you could do:
A:1 B:1,2,3 C:2,3
/ \
factor using set 2,3 factor using set 1
/ \
A:1 B:1 B,C:2,3 A,B:1 B:2,3 C:2,3
/ \
factor using set 1 factor using set 2,3
/ \
A,B:1 B,C:2,3 A,B:1 B,C:2,3
In this case at least, you arrive at the same result regardless of which factoring you do first, but I'm not sure if that would always be the case.
Doing an exhaustive exploration of the tree sounds expensive, so you might want to avoid that, but if we could pick the optimal path, or at least a likely-good path, it would be relatively low computational cost. Rather than branching at random, my guy instinct is that it'd be faster and closer to optimal if you tried to make the largest-set factoring change possible at each point in the tree. For example, considering the two-branch tree above, you'd prefer the initial 2,3 factoring over the initial 1 factoring in your tree, because 2,3 has the larger set of size two. More dramatic refactoring suggest the number of refactorings before you get a stable result will be less.
What that amounts to is iterating the sets from largest towards smallest (it doesn't matter which order you iterate over same-length sets), looking for refactoring opportunities.
Much like a bubble-sort, after each refactoring the approach would be "I made a change; it's not stable yet; let's repeat". Restart by iterating from second-longest sets towards shortest sets, checking for optimisation opportunities as you go.
(I'm not sure about python, but in general set comparisons can be expensive, you might want to maintain a value for each set that is the XOR of the hashed values in the set - that's easy and cheap to update if a few set elements are changed, and a trivial comparison can tell you large sets are unequal, saving comparison time; it won't tell you if sets are equal though: multiple sets could have the same XOR-of-hashes value).

Related

Hash Map, when to use which method?

I've been learning about HashMaps and their best practices. One of the things that I stumbled upon was collision resolution.
The methods include:
Direct Chaining,
Linear Probing,
Quadratic Probing,
Double Hashing.
So far I've found that direct chaining was much easier to implement and made the most sense. I'm not sure which I should focus on in terms of being prepared for technical interviews.

For technical interviews, I'd suggest getting a high level understanding of the pros/cons of these approaches - specifically:
direct chaining degrades slowly as load factor increases, whereas closed hashing/open addressing approaches (all the others you list) grow exponentially worse as load factor approaches 1.0 as it gets harder and harder to find an empty bucket
linear probing can be CPU cache friendly with small keys (compared to any of the other techniques): if you can get several keys onto the same cache line, that means the CPU is likely to spend less time groping around in memory after collisions (and SIMD instructions can sometimes help compare against multiple buckets' keys concurrently)
linear probing and identity hashing can produce lower-than-cryptographic-hashing collision rates for keys that happened to have a nice distribution across the buckets, such as ids that tend to increment but may have a gap here and there
linear probing is much more prone to clusters of collisions than quadratic probing, especially with poor hash functions / difficult-to-hash-well keys and higher load factors (e.g. ballpark >= 0.8), as collisions in one part of the table (even if just by chance more than flawed hashing) tend to exacerbate future use of that part of the table
quadratic probing can have a couple bucket offsets that might fall on the same cache line, so you might get a useful probability of the second and even third bucket being on the same cache line as the first, but after a few failures you'll jump off by larger increments, decreasing clustering problems at the expense of more cache misses
double hashing is a bit of a compromise; if the second hash happens to produce a 1 then it's equivalent to linear probing, but you might try every 2nd bucket, or every 3rd etc. - up to some limit. There's still plenty of room for clustering (e.g. if h2(k) returned 6 for one key, and 3 for another key that had hashed to a bucket 3 further into the table than the first key, then they'll visit many of the same buckets during their searches.
I wouldn't recommend focusing on any of them in two much depth, or ignoring any, because the contrasts reinforce your understanding of any of the others.

Find the 'pits' in a list of integers. Fast

I'm attempting to write a program which finds the 'pits' in a list of
integers.
A pit is any integer x where x is less than or equal to the integers
immediately preceding and following it. If the integer is at the start
or end of the list it is only compared on the inward side.
For example in:
[2,1,3] 1 is a pit.
[1,1,1] all elements are pits.
[4,3,4,3,4] the elements at 1 and 3 are pits.
I know how to work this out by taking a linear approach and walking along
the list however i am curious about how to apply divide and conquer
techniques to do this comparatively quickly. I am quite inexperienced and
am not really sure where to start, i feel like something similar to a binary
tree could be applied?
If its pertinent i'm working in Python 3.
Thanks for your time :).

Without any additional information on the distribution of the values in the list, it is not possible to achieve any algorithmic complexity of less than O(x), where x is the number of elements in the list.
Logically, if the dataset is random, such as a brownian noise, a pit can happen anywhere, requiring a full 1:1 sampling frequency in order to correctly find every pit.
Even if one just wants to find the absolute lowest pit in the sequence, that would not be possible to achieve in sub-linear time without repercussions on the correctness of the results.
Optimizations can be considered, such as mere parallelization or skipping values neighbor to a pit, but the overall complexity would stay the same.

Scheduling: Minimizing Gaps between Non-Overlapping Time Ranges

Using Django to develop a small scheduling web application where people are assigned certain times to meet with their superiors. Employees are stored as models, with a OneToMany relation to a model representing time ranges and day of the week where they are free. For instance:
Bob: (W 9:00, 9:15), (W 9:15, 9:30), ... (W 15:00, 15:20)
Sarah: (Th 9:05, 9:20), (F 9:20, 9:30), ... (Th 16:00, 16:05)
...
Mary: (W 8:55, 9:00), (F 13:00, 13:35), ... etc
My program allows a basic schedule setup, where employers can choose to view the first N possible schedules with the least gaps in between meetings under the condition that they meet all their employees at least once during that week. I am currently generating all possible permutations of meetings, and filtering out schedules where there are overlaps in meeting times. Is there a way to generate the first N schedules out of M possible ones, without going through all M possibilities?
Clarification: We are trying to get the minimum sum of gaps for any given day, summed over all days.

I would use a search algorithm, like A-star, to do this. Each node in the graph represents a person's available time slots and a path from one node to another means that node_a and node_b are in the partial schedule.
Another solution would be to create a graph in which the nodes are each person's availability times and there is a edge from node_a to node_b if the person associated with node_a is not the same as the person associated with node_b. The weight of each node is the amount of time between the time associated with the two nodes.
After creating this graph, you could generate a variant of a minimum spanning tree from the graph. The variant would differ from MSTs in that:
you'll only add a node to the MST if the person associated with the node is not already in the MST.
you finish creating the MST when all persons are in the MST.
The minimum spanning tree generated would represent a single schedule.
To generate other schedules, remove all the edges from the graph which are found in the schedule you just created and then create a new minimum spanning tree from the graph with the removed edges.

In general, scheduling problems are NP-hard, and while I can't figure out a reduction for this problem to prove it such, it's quite similar to a number of other well-known NP-complete problems. There may be a polynomial-time solution for finding the minimum gap for a single day (though I don't know it off hand, either), but I have less hopes for needing to solve it for multiple days. Unfortunately, it's a complicated problem, and there may not be a perfectly elegant answer. (Or, I'm going to kick myself when someone posts one later.)
First off, I'd say that if your dataset is reasonably small and you've been able to compute all possible schedules fairly quickly, you may just want to stick with that solution, as all others will be approximations, and could possibly end up running slower, if the constant factor of their running time is large. (Meaning that it doesn't grow with the size of the dataset, so it will relatively be smaller for a large dataset.)
The simplest approximation would be to just use a greedy heuristic. It will almost assuredly not find the optimum schedules, and may take a long time to find a solution if most of the schedules are overlapping, and there are only a few that are even valid solutions - but I'm going to assume that this is not the case for employee times.
Start with an arbitrary schedule, choosing one timeslot for each employee at random. For each iteration, pick one employee and change his timeslot to the best possible time, with respect to the rest of current schedule. Repeat this process until your satisfied with the result - when it isn't improving quickly enough anymore or has taken too long. You're probably not going to want to repeat until you can't make any more changes that improve the schedule, since this process will likely loop for most data.
It's not a great heuristic, but it should produce some reasonable schedules, and has a lot of room for adjustment. You may want to always try to switch overlapping times first before any others, or you may want to try to flip the employee who currently contributes to the largest gap, or maybe eliminate certain slots that you've already tried. You may want to sometimes allow a move to a less optimal solution in hopes that you're at a local minima and want to get out of it - some randomness can also help with this. Make sure you always keep track of the best solution you've seen so far.
To generate more schedules, the most obvious thing would be to just start the process over with a different random schedule. Or, maybe flip a few arbitrary times from the previous solution you found, and repeat from there.
Edit: This is all fairly related to genetic algorithms, and you may want to use some of the ideas I presented here in a GA.

How to track the progress of a tree traversal?

I have a tree. It has a flat bottom. We're only interested in the bottom-most leaves, but this is roughly how many leaves there are at the bottom...
2 x 1600 x 1600 x 10 x 4 x 1600 x 10 x 4
That's ~13,107,200,000,000 leaves? Because of the size (the calculation performed on each leaf seems unlikely to be optimised to ever take less than one second) I've given up thinking it will be possible to visit every leaf.
So I'm thinking I'll build a 'smart' leaf crawler which inspects the most "likely" nodes first (based on results from the ones around it). So it's reasonable to expect the leaves to be evaluated in branches/groups of neighbours, but the groups will vary in size and distribution.
What's the smartest way to record which leaves have been visited and which have not?

You don't give a lot of information, but I would suggest tuning your search algorithm to help you keep track of what it's seen. If you had a global way of ranking leaves by "likelihood", you wouldn't have a problem since you could just visit leaves in descending order of likelihood. But if I understand you correctly, you're just doing a sort of hill climbing, right? You can reduce storage requirements by searching complete subtrees (e.g., all 1600 x 10 x 4 leaves in a cluster that was chosen as "likely"), and keeping track of clusters rather than individual leaves.
It sounds like your tree geometry is consistent, so depending on how your search works, it should be easy to merge your nodes upwards... e.g., keep track of level 1 nodes whose leaves have all been examined, and when all children of a level 2 node are in your list, drop the children and keep their parent. This might also be a good way to choose what to examine: If three children of a level 3 node have been examined, the fourth and last one is probably worth examining too.
Finally, a thought: Are you really, really sure that there's no way to exclude some solutions in groups (without examining every individual one)? Problems like sudoku have an astronomically large search space, but a good brute-force solver eliminates large blocks of possibilities without examining every possible 9 x 9 board. Given the scale of your problem, this would be the most practical way to attack it.

It seems that you're looking for a quick and efficient ( in terms of memory usage ) way to do a membership test. If so and if you can cope with some false-positives go for a bloom filter.
Bottom line is : Use bloom filters in situations where your data set is really big AND all what you need is checking if a particular element exists in the set AND a small chance of false positives is tolerable.
Some implementation for Python should exist.
Hope this will help.

Maybe this is too obvious, but you could store your results in a similar tree. Since your computation is slow, the results tree should not grow out of hand too quickly. Then just look up if you have results for a given node.

how to generate all possible combinations of a 14x10 matrix containing only 1's and 0's

I'm working on a problem and one solution would require an input of every 14x10 matrix that is possible to be made up of 1's and 0's... how can I generate these so that I can input every possible 14x10 matrix into another function? Thank you!
Added March 21: It looks like I didn't word my post appropriately. Sorry. What I'm trying to do is optimize the output of 10 different production units (given different speeds and amounts of downtime) for several scenarios. My goal is to place blocks of downtime to minimized the differences in production on a day-to-day basis. The amount of downtime and frequency each unit is allowed is given. I am currently trying to evaluate a three week cycle, meaning every three weeks each production unit is taken down for a given amount of hours. I was asking the computer to determine the order the units would be taken down based on the constraint that the lines come down only once every 3 weeks and the difference in daily production is the smallest possible. My first approach was to use Excel (as I tried to describe above) and it didn't work (no suprise there)... where 1- running, 0- off and when these are summed to calculate production. The calculated production is subtracted from a set max daily production. Then, these differences were compared going from Mon-Tues, Tues-Wed, etc for a three week time frame and minimized using solver. My next approach was to write a Matlab code where the input was a tolerance (set allowed variation day-to-day). Is there a program that already does this or an approach to do this easiest? It seems simple enough, but I'm still thinking through the different ways to go about this. Any insight would be much appreciated.

The actual implementation depends heavily on how you want to represent matrices… But assuming the matrix can be represented by a 14 * 10 = 140 element list:
from itertools import product
for matrix in product([0, 1], repeat=140):
# ... do stuff with the matrix ...
Of course, as other posters have noted, this probably isn't what you want to do… But if it really is what you want to do, that's the best code (given your requirements) to do it.

Generating Every possible matrix of 1's and 0's for 14*10 would generate 2**140 matrixes. I don't believe you would have enough lifetime for this. I don't know, if the sun would still shine before you finish that. This is why it is impossible to generate all those matrices. You must look for some other solution, this looks like a brute force.

This is absolutely impossible! The number of possible matrices is 2140, which is around 1.4e42. However, consider the following...
If you were to generate two 14-by-10 matrices at random, the odds that they would be the same are 1 in 1.4e42.
If you were to generate 1 billion unique 14-by-10 matrices, then the odds that the next one you generate would be the same as one of those would still be exceedingly slim: 1 in 1.4e33.
The default random number stream in MATLAB uses a Mersenne twister algorithm that has a period of 219936-1. Therefore, the random number generator shouldn't start repeating itself any time this eon.
Your approach should be thus:
Find a computer no one ever wants to use again.
Give it as much storage space as possible to save your results.
Install MATLAB on it and fire it up.
Start computing matrices at random like so:
while true
newMatrix = randi([0 1],14,10);
%# Process the matrix and output your results to disk
end
Walk away
Since there are so many combinations, you don't have to compare newMatrix with any of the previous matrices since the length of time before a repeat is likely to occur is astronomically large. Your processing is more likely to stop due to other reasons first, such as (in order of likely occurrence):
You run out of disk space to store your results.
There's a power outage.
Your computer suffers a fatal hardware failure.
You pass away.
The Earth passes away.
The Universe dies a slow heat death.
NOTE: Although I injected some humor into the above answer, I think I have illustrated one useful alternative. If you simply want to sample a small subset of the possible combinations (where even 1 billion could be considered "small" due to the sheer number of combinations) then you don't have to go through the extra time- and memory-consuming steps of saving all of the matrices you've already processed and comparing new ones to it to make sure you aren't repeating matrices. Since the odds of repeating a combination are so low, you could safely do this:
for iLoop = 1:whateverBigNumberYouWant
newMatrix = randi([0 1],14,10); %# Generate a new matrix
%# Process the matrix and save your results
end

Are you sure you want every possible 14x10 matrix? There are 140 elements in each matrix, and each element can be on or off. Therefore there are 2^140 possible matrices. I suggest you reconsider what you really want.
Edit: I noticed you mentioned in a comment that you are trying to minimize something. There is an entire mathematical field called optimization devoted to doing this type of thing. The reason this field exists is because quite often it is not possible to exhaustively examine every solution in anything resembling a reasonable amount of time.

Trying this:
import numpy
for i in xrange(int(1e9)): a = numpy.random.random_integers(0,1,(14,10))
(which is much, much, much smaller than what you require) should be enough to convince you that this is not feasible. It also shows you how to calculate one, or few, such random matrices even up to a million is pretty fast).
EDIT: changed to xrange to "improve speed and memory requirements" :)

You don't have to iterate over this:
def everyPossibleMatrix(x,y):
N=x*y
for i in range(2**N):
b="{:0{}b}".format(i,N)
yield '\n'.join(b[j*x:(j+1)*x] for j in range(y))

Depending on what you want to accomplish with the generated matrices, you might be better off generating a random sample and running a number of simulations. Something like:
matrix_samples = []
# generate 10 matrices
for i in range(10):
sample = numpy.random.binomial(1, .5, 14*10)
sample.shape = (14, 10)
matrix_samples.append(sample)
You could do this a number of times to see how results vary across simulations. Of course, you could also modify the code to ensure that there are no repeats in a sample set, again depending on what you're trying to accomplish.

Are you saying that you have a table with 140 cells and each value can be 1 or 0 and you'd like to generate every possible output? If so, you would have 2^140 possible combinations...which is quite a large number.

Instead of just suggesting the this is unfeasible, I would suggest considering a scheme that samples the important subset of all possible combinations instead of applying a brute force approach. As one of your replies suggested, you are doing minimization. There are numerical techniques to do this such as simulated annealing, monte carlo sampling as well as traditional minimization algorithms. You might want to look into whether one is appropriate in your case.

I was actually much more pessimistic to begin with, but consider:
from math import log, e
def timeInYears(totalOpsNeeded=2**140, currentOpsPerSecond=10**9, doublingPeriodInYears=1.5):
secondsPerYear = 365.25 * 24 * 60 * 60
doublingPeriodInSeconds = doublingPeriodInYears * secondsPerYear
k = log(2,e) / doublingPeriodInSeconds # time-proportionality constant
timeInSeconds = log(1 + k*totalOpsNeeded/currentOpsPerSecond, e) / k
return timeInSeconds / secondsPerYear
if we assume that computer processing power continues to double every 18 months, and you can currently do a billion combinations per second (optimistic, but for sake of argument) and you start today, your calculation will be complete on or about April 29th 2137.

Here is an efficient way to do get started Matlab:
First generate all 1024 possible rows of length 10 containing only zeros and ones:
dec2bin(0:2^10-1)
Now you have all possible rows, and you can sample from them as you wish. For example by calling the following line a few times:
randperm(1024,14)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.