Optimization of Unique, non-reversible permutations in Python - python

I'm currently trying to solve the 'dance recital' kattis challenge in Python 3. See here
After taking input on how many performances there are in the dance recital, you must arrange performances in such a way that sequential performances by a dancer are minimized.
I've seen this challenge completed in C++, but my code kept running out of time and I wanted to optimize it.
Question: As of right now, I generate all possible permutations of performances and run comparisons off of that. A faster way to would be to not generate all permutations, as some of them are simply reversed and would result in the exact same output.
import itertools
print(list(itertools.permutations(range(2)))) --> [(0,1),(1,0)] #They're the same, backwards and forwards
print(magic_algorithm(range(2))) --> [(0,1)] #This is what I want
How might I generate such a list of permutations?
I've tried:
-Generating all permutation, running over them again to reversed() duplicates and saving them. This takes too long and the result cannot be hard coded into the solution as the file becomes too big.
-Only generating permutations up to the half-way mark, then stopping, assuming that after that, no unique permutations are generated (not true, as I found out)
-I've checked out questions here, but no one seems to have the same question as me, ditto on the web
Here's my current code:
from itertools import permutations
number_of_routines = int(input()) #first line is number of routines
dance_routine_list = [0]*10
permutation_list = list(permutations(range(number_of_routines))) #generate permutations
for q in range(number_of_routines):
s = input()
for c in s:
v = ord(c) - 65
dance_routine_list[q] |= (1 << v) #each routine ex.'ABC' is A-Z where each char represents a performer in the routine
def calculate():
least_changes_possible = 1e9 #this will become smaller, as optimizations are found
for j in permutation_list:
tmp = 0
for i in range(1,number_of_routines):
tmp += (bin(dance_routine_list[j[i]] & dance_routine_list[j[i - 1]]).count('1')) #each 1 represents a performer who must complete sequential routines
least_changes_possible = min(least_changes_possible, tmp)
return least_changes_possible
print(calculate())
Edit: Took a shower and decided adding a 2-element-comparison look-up table would speed it up, as many of the operations are repeated. Still doesn't fix iterating over the whole permutations, but it should help.
Edit: Found another thread that answered this pretty well. How to generate permutations of a list without "reverse duplicates" in Python using generators
Thank you all!

There are at most 10 possible dance routines, so at most 3.6M permutations, and even bad algorithms like generate 'em all and test will be done very quickly.
If you wanted a fast solution for up to 24 or so routines, then I would do it like this...
Given the the R dance routines, at any point in the recital, in order to decide which routine you can perform next, you need to know:
Which routines you've already performed, because there you can't do those ones next. There are 2R possible sets of already-performed routines; and
Which routine was performed last, because that helps determine the cost of the next one. There are at most R-1 possible values for that.
So there are at less than (R-2)*2R possible recital states...
Imagine a directed graph that connects each possible state to all the possible following states, by an edge for the routine that you would perform to get to that state. Label each edge with the cost of performing that routine.
For example, if you've performed routines 5 and 6, with 5 last, then you would be in state (5,6):5, and there would be an edge to (3,5,6):3 that you could get to after performing routine 3.
Starting at the initial "nothing performed yet" state ():-, use Dijkstra's algorithm to find the least cost path to a state with all routines performed.
Total complexity is O(R2*2R) or so, depending exactly how you implement it.
For R=10, R2*2R is ~100 000, which won't take very long at all. For R=24 it's about 9 billion, which is going to take under half a minute in pretty good C++.

Related

Endless Permutation and RAM

I have a list with around 1500 number (1 to 1500) and I want to get all the possible Permutation out of it to do some calculations and choose the smallest number out of all of it.
Problem is that the number of possibilites as you can figure is wayyy wayyy too big and my computer just freeze whiel running the code so I have to make a forced restart. Also my RAM is 8GB so it should be big enough (?) or so I hope.
To limit it I can specify a start point but that won't reduce it much.
Also it's a super important thing but I feel so lost. what do you think should I do to make it run ?
Use of generators and itertools saves memory a lot. Generators are just like lists with an exception that new elements generated one by one rather than stored in memory. Even if your problem has a better solution, mastering the generator will help you to save memory in future.
Note in Python 3 map is already a generator (while for Python 2 use imap instead of map)
import itertools
array = [ 1, 2, ... , 1500] # or other numbers
f = sum # or whatever function you have
def fmax_on_perms(array, f):
perms = itertools.permutations(array) # generator of all the
permutations rather than static list
fvalues = map(my_function, perms) # generator of values of your function on different permutations
return max(fvalues)

Fast Constrained Least Squares

I have to solve many independent constrained linear least squares problems (with both bounds and constrains). So I do it multiple times in loops. For each problem I am looking for x that is min||Cx-d|| and x is bounded (in 0,1) and all x elements must sum to 1.
I am seeking a way of doing it fast, because although each optimization doesn't require a lot of time, I need to include it in a large loop.
For example my Matlab implementation seems like this:
img = imread('test.tif');
C = randi(255,6,4);
x=zeros(size(C,2),1);
tp = zeros(size(C,2),1);
Aeq = ones(1,size(C,2));
beq = 1;
options = optimset('LargeScale','off','Display','off');
A = (-1).*eye(size(C,2));
b = zeros(1,size(C,2));
result = zeros(size(img,1),size(img,2),size(C,2));
for i=1:size(img,1)
for j=1:size(img,2)
for k=1:size(img,3)
tp(k) = img(i,j,k);
end
x = lsqlin(C,tp,A,b,Aeq,beq,[],[],[],options);
for l=1:size(C,2)
result(i,j,l)=x(l);
end
end
end
and for a 500x500 loop it takes about 5 minutes. But my loops are much bigger than this. Any idea is welcome, but I would prefer a Matlab, Python or R solution.
As mentioned in the comments: The used commands are highly optimized, and switching languages is not going to make a world of difference.
First of all go back to what you are trying to achieve, perhaps to achieve that you don't need to solve this many hard problems.
If you come to the conclusion that you do want to do this exact calculation, consider changing the settings, and play around with the algorithm.
From doc lsqlin it appears the two main parameters that determine the speed are:
Number of iterations
Accepted tolerance
If all these things don't help enough, consider some tricks like:
Only solving problems that are sufficiently different
Solving with easier constraints, and rounding the result

Python loop transfer in other programming language

Again I have a question concerning large loops.
Suppose I have a function
limits
def limits(a,b):
*evaluate integral with upper and lower limits a and b*
return float result
A and B are simple np.arrays that store my values a and b. Now I want to calculate the integral 300'000^2/2 times because A and B are of the length of 300'000 each and the integral is symmetrical.
In Python I tried several ways like itertools.combinations_with_replacement to create the combinations of A and B and then put them into the integral but that takes huge amount of time and the memory is totally overloaded.
Is there any way, for example transferring the loop in another language, to speed this up?
I would like to run the loop
for i in range(len(A)):
for j in range(len(B)):
np.histogram(limits(A[i],B[j]))
I think histrogramming the return of limits is desirable in order not to store additional arrays that grow squarely.
From what I read python is not really the best choice for this iterative ansatzes.
So would it be reasonable to evaluate this loop in another language within Python, if yes, How to do it. I know there are ways to transfer code, but I have never done it so far.
Thanks for your help.
If you're worried about memory footprint, all you need to do is bin the results as you go in the for loop.
num_bins = 100
bin_upper_limits = np.linspace(-456, 456, num=num_bins-1)
# (last bin has no upper limit, it goes from 456 to infinity)
bin_count = np.zeros(num_bins)
for a in A:
for b in B:
if b<a:
# you said the integral is symmetric, so we can skip these, right?
continue
new_result = limits(a,b)
which_bin = np.digitize([new_result], bin_upper_limits)
bin_count[which_bin] += 1
So nothing large is saved in memory.
As for speed, I imagine that the overwhelming majority of time is spent evaluating limits(a,b). The looping and binning is plenty fast in this case, even in python. To convince yourself of this, try replacing the line new_result = limits(a,b) with new_result = 234. You'll find that the loop runs very fast. (A few minutes on my computer, much much less than the 4 hour figure you quote.) Python does not loop very fast compared to C, but it doesn't matter in this case.
Whatever you do to speed up the limits() call (including implementing it in another language) will speed up the program.
If you change the algorithm, there is vast room for improvement. Let's take an example of what it seems you're doing. Let's say A and B are 0,1,2,3. You're integrating a function over the ranges 0-->0, 0-->1, 1-->1, 1-->2, 0-->2, etc. etc. You're re-doing the same work over and over. If you have integrated 0-->1 and 1-->2, then you can add up those two results to get the integral 0-->2. You don't have to use a fancy integration algorithm, you just have to add two numbers you already know.
Therefore it seems to me that you can compute integrals in all the smallest ranges (0-->1, 1-->2, 2-->3), store the results in an array, and add subsets of the results to get the integral over whatever range you want. If you want this program to run in a few minutes instead of 4 hours, I suggest thinking through an alternative algorithm along those lines.
(Sorry if I'm misunderstanding the problem you're trying to solve.)

Generating non-repeating random numbers in Python

Ok this is one of those trickier than it sounds questions so I'm turning to stack overflow because I can't think of a good answer. Here is what I want: I need Python to generate a simple a list of numbers from 0 to 1,000,000,000 in random order to be used for serial numbers (using a random number so that you can't tell how many have been assigned or do timing attacks as easily, i.e. guessing the next one that will come up). These numbers are stored in a database table (indexed) along with the information linked to them. The program generating them doesn't run forever so it can't rely on internal state.
No big deal right? Just generate a list of numbers, shove them into an array and use Python "random.shuffle(big_number_array)" and we're done. Problem is I'd like to avoid having to store a list of numbers (and thus read the file, pop one off the top, save the file and close it). I'd rather generate them on the fly. Problem is that the solutions I can think of have problems:
1) Generate a random number and then check if it has already been used. If it has been used generate a new number, check, repeat as needed until I find an unused one. Problem here is that I may get unlucky and generate a lot of used numbers before getting one that is unused. Possible fix: use a very large pool of numbers to reduce the chances of this (but then I end up with silly long numbers).
2) Generate a random number and then check if it has already been used. If it has been used add or subtract one from the number and check again, keep repeating until I hit an unused number. Problem is this is no longer a random number as I have introduced bias (eventually I will get clumps of numbers and you'd be able to predict the next number with a better chance of success).
3) Generate a random number and then check if it has already been used. If it has been used add or subtract another randomly generated random number and check again, problem is we're back to simply generating random numbers and checking as in solution 1.
4) Suck it up and generate the random list and save it, have a daemon put them into a Queue so there are numbers available (and avoid constantly opening and closing a file, batching it instead).
5) Generate much larger random numbers and hash them (i.e. using MD5) to get a smaller numeric value, we should rarely get collisions, but I end up with larger than needed numbers again.
6) Prepend or append time based information to the random number (i.e. unix timestamp) to reduce chances of a collision, again I get larger numbers than I need.
Anyone have any clever ideas that will reduce the chances of a "collision" (i.e. generating a random number that is already taken) but will also allow me to keep the number "small" (i.e. less than a billion (or a thousand million for your europeans =)).
Answer and why I accepted it:
So I will simply go with 1, and hope it's not an issue, however if it is I will go with the deterministic solution of generating all the numbers and storing them so that there is a guarentee of getting a new random number, and I can use "small" numbers (i.e. 9 digits instead of an MD5/etc.).
This is a neat problem, and I've been thinking about it for a while (with solutions similar to Sjoerd's), but in the end, here's what I think:
Use your point 1) and stop worrying.
Assuming real randomness, the probability that a random number has already been chosen before is the count of previously chosen numbers divided by the size of your pool, i.e. the maximal number.
If you say you only need a billion numbers, i.e. nine digits: Treat yourself to 3 more digits, so you have 12-digit serial numbers (that's three groups of four digits – nice and readable).
Even when you're close to having chosen a billion numbers previously, the probability that your new number is already taken is still only 0,1%.
Do step 1 and draw again. You can still check for an "infinite" loop, say don't try more than 1000 times or so, and then fallback to adding 1 (or something else).
You'll win the lottery before that fallback ever gets used.
You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.
Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 10, width 9 for the parameters of the question), with an algorithm which is still cryptographically robust.
It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.
AES-FFX is one proposed standard method to achieve this.
I've experimented with some basic Python code for AES-FFX--see Python code here (but note that it doesn't fully comply with the AES-FFX specification). It can e.g. encrypt a counter to a random-looking 7-digit decimal number. E.g.:
0000000 0731134
0000001 6161064
0000002 8899846
0000003 9575678
0000004 3030773
0000005 2748859
0000006 5127539
0000007 1372978
0000008 3830458
0000009 7628602
0000010 6643859
0000011 2563651
0000012 9522955
0000013 9286113
0000014 5543492
0000015 3230955
... ...
For another example in Python, using another non-AES-FFX (I think) method, see this blog post "How to Generate an Account Number" which does FPE using a Feistel cipher. It generates numbers from 0 to 2^32-1.
With some modular arithmic and prime numbers, you can create all numbers between 0 and a big prime, out of order. If you choose your numbers carefully, the next number is hard to guess.
modulo = 87178291199 # prime
incrementor = 17180131327 # relative prime
current = 433494437 # some start value
for i in xrange(1, 100):
print current
current = (current + incrementor) % modulo
If they don't have to be random, but just not obviously linear (1, 2, 3, 4, ...), then here's a simple algorithm:
Pick two prime numbers. One of them will be the largest number you can generate, so it should be around one billion. The other should be fairly large.
max_value = 795028841
step = 360287471
previous_serial = 0
for i in xrange(0, max_value):
previous_serial += step
previous_serial %= max_value
print "Serial: %09i" % previous_serial
Just store the previous serial each time so you know where you left off. I can't prove mathmatically that this works (been too long since those particular classes), but it's demonstrably correct with smaller primes:
s = set()
with open("test.txt", "w+") as f:
previous_serial = 0
for i in xrange(0, 2711):
previous_serial += 1811
previous_serial %= 2711
assert previous_serial not in s
s.add(previous_serial)
You could also prove it empirically with 9-digit primes, it'd just take a bit more work (or a lot more memory).
This does mean that given a few serial numbers, it'd be possible to figure out what your values are--but with only nine digits, it's not likely that you're going for unguessable numbers anyway.
If you don't need something cryptographically secure, but just "sufficiently obfuscated"...
Galois Fields
You could try operations in Galois Fields, e.g. GF(2)32, to map a simple incrementing counter x to a seemingly random serial number y:
x = counter_value
y = some_galois_function(x)
Multiply by a constant
Inverse is to multiply by the reciprocal of the constant
Raise to a power: xn
Reciprocal x-1
Special case of raising to power n
It is its own inverse
Exponentiation of a primitive element: ax
Note that this doesn't have an easily-calculated inverse (discrete logarithm)
Ensure a is a primitive element, aka generator
Many of these operations have an inverse, which means, given your serial number, you can calculate the original counter value from which it was derived.
As for finding a library for Galois Field for Python... good question. If you don't need speed (which you wouldn't for this) then you could make your own. I haven't tried these:
NZMATH
Finite field Python package
Sage, although it's a whole environment for mathematical computing, much more than just a Python library
Matrix multiplication in GF(2)
Pick a suitable 32×32 invertible matrix in GF(2), and multiply a 32-bit input counter by it. This is conceptually related to LFSR, as described in S.Lott's answer.
CRC
A related possibility is to use a CRC calculation. Based on the remainder of long-division with an irreducible polynomial in GF(2). Python code is readily available for CRCs (crcmod, pycrc), although you might want to pick a different irreducible polynomial than is normally used, for your purposes. I'm a little fuzzy on the theory, but I think a 32-bit CRC should generate a unique value for every possible combination of 4-byte inputs. Check this. It's quite easy to experimentally check this, by feeding the output back into the input, and checking that it produces a complete cycle of length 232-1 (zero just maps to zero). You may need to get rid of any initial/final XORs in the CRC algorithm for this check to work.
I think you are overestimating the problems with approach 1). Unless you have hard-realtime requirements just checking by random choice terminates rather fast. The probability of needing more than a number of iterations decays exponentially. With 100M numbers outputted (10% fillfactor) you'll have one in billion chance of requiring more than 9 iterations. Even with 50% of numbers taken you'll on average need 2 iterations and have one in a billion chance of requiring more than 30 checks. Or even the extreme case where 99% of the numbers are already taken might still be reasonable - you'll average a 100 iterations and have 1 in a billion change of requiring 2062 iterations
The standard Linear Congruential random number generator's seed sequence CANNOT repeat until the full set of numbers from the starting seed value have been generated. Then it MUST repeat precisely.
The internal seed is often large (48 or 64 bits). The generated numbers are smaller (32 bits usually) because the entire set of bits are not random. If you follow the seed values they will form a distinct non-repeating sequence.
The question is essentially one of locating a good seed that generates "enough" numbers. You can pick a seed, and generate numbers until you get back to the starting seed. That's the length of the sequence. It may be millions or billions of numbers.
There are some guidelines in Knuth for picking suitable seeds that will generate very long sequences of unique numbers.
You can run 1) without running into the problem of too many wrong random numbers if you just decrease the random interval by one each time.
For this method to work, you will need to save the numbers already given (which you want to do anyway) and also save the quantity of numbers taken.
It is pretty obvious that, after having collected 10 numbers, your pool of possible random numbers will have been decreased by 10. Therefore, you must not choose a number between 1 and 1.000.000 but between 1 an 999.990. Of course this number is not the real number but only an index (unless the 10 numbers collected have been 999.991, 999.992, …); you’d have to count now from 1 omitting all the numbers already collected.
Of course, your algorithm should be smarter than just counting from 1 to 1.000.000 but I hope you understand the method.
I don’t like drawing random numbers until I get one which fits either. It just feels wrong.
My solution https://github.com/glushchenko/python-unique-id, i think you should extend matrix for 1,000,000,000 variations and have fun.
I'd rethink the problem itself... You don't seem to be doing anything sequential with the numbers... and you've got an index on the column which has them. Do they actually need to be numbers?
Consider a sha hash... you don't actually need the entire thing. Do what git or other url shortening services do, and take first 3/4/5 characters of the hash. Given that each character now has 36 possible values instead of 10, you have 2,176,782,336 combinations instead of 999,999 combinations (for six digits). Combine that with a quick check on whether the combination exists (a pure index query) and a seed like a timestamp + random number and it should do for almost any situation.
Do you need this to be cryptographically secure or just hard to guess? How bad are collisions? Because if it needs to be cryptographically strong and have zero collisions, it is, sadly, impossible.
I started trying to write an explanation of the approach used below, but just implementing it was easier and more accurate. This approach has the odd behavior that it gets faster the more numbers you've generated. But it works, and it doesn't require you to generate all the numbers in advance.
As a simple optimization, you could easily make this class use a probabilistic algorithm (generate a random number, and if it's not in the set of used numbers add it to the set and return it) at first, keep track of the collision rate, and switch over to the deterministic approach used here once the collision rate gets bad.
import random
class NonRepeatingRandom(object):
def __init__(self, maxvalue):
self.maxvalue = maxvalue
self.used = set()
def next(self):
if len(self.used) >= self.maxvalue:
raise StopIteration
r = random.randrange(0, self.maxvalue - len(self.used))
result = 0
for i in range(1, r+1):
result += 1
while result in self.used:
result += 1
self.used.add(result)
return result
def __iter__(self):
return self
def __getitem__(self):
raise NotImplemented
def get_all(self):
return [i for i in self]
>>> n = NonRepeatingRandom(20)
>>> n.get_all()
[12, 14, 13, 2, 20, 4, 15, 16, 19, 1, 8, 6, 7, 9, 5, 11, 10, 3, 18, 17]
If it is enough for you that a casual observer can't guess the next value, you can use things like a linear congruential generator or even a simple linear feedback shift register to generate the values and keep the state in the database in case you need more values. If you use these right, the values won't repeat until the end of the universe. You'll find more ideas in the list of random number generators.
If you think there might be someone who would have a serious interest to guess the next values, you can use a database sequence to count the values you generate and encrypt them with an encryption algorithm or another cryptographically strong perfect has function. However you need to take care that the encryption algorithm isn't easily breakable if one can get hold of a sequence of successive numbers you generated - a simple RSA, for instance, won't do it because of the Franklin-Reiter Related Message Attack.
Bit late answer, but I haven't seen this suggested anywhere.
Why not use the uuid module to create globally unique identifiers
To generate a list of totally random numbers within a defined threshold, as follows:
plist=list()
length_of_list=100
upbound=1000
lowbound=0
while len(pList)<(length_of_list):
pList.append(rnd.randint(lowbound,upbound))
pList=list(set(pList))
I bumped into the same problem and opened a question with a different title before getting to this one. My solution is a random sample generator of indexes (i.e. non-repeating numbers) in the interval [0,maximal), called itersample. Here are some usage examples:
import random
generator=itersample(maximal)
another_number=generator.next() # pick the next non-repeating random number
or
import random
generator=itersample(maximal)
for random_number in generator:
# do something with random_number
if some_condition: # exit loop when needed
break
itersample generates non-repeating random integers, storage need is limited to picked numbers, and the time needed to pick n numbers should be (as some tests confirm) O(n log(n)), regardelss of maximal.
Here is the code of itersample:
import random
def itersample(c): # c = upper bound of generated integers
sampled=[]
def fsb(a,b): # free spaces before middle of interval a,b
fsb.idx=a+(b+1-a)/2
fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
return fsb.last
while len(sampled)<c:
sample_index=random.randrange(c-len(sampled))
a,b=0,len(sampled)-1
if fsb(a,a)>sample_index:
yielding=sample_index
sampled.insert(0,yielding)
yield yielding
elif fsb(b,b)<sample_index+1:
yielding=len(sampled)+sample_index
sampled.insert(len(sampled),yielding)
yield yielding
else: # sample_index falls inside sampled list
while a+1<b:
if fsb(a,b)<sample_index+1:
a=fsb.idx
else:
b=fsb.idx
yielding=a+1+sample_index
sampled.insert(a+1,yielding)
yield yielding
You are stating that you store the numbers in a database.
Wouldn't it then be easier to store all the numbers there, and ask the database for a random unused number?
Most databases support such a request.
Examples
MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python

There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.
What can I use that's faster than python's bultin find for matching 1KB patterns against 1GB or greater data strings?
(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)
Update: Please bound memory requirements to 16GB.
As you clarify that long-ish preprocessing is acceptable, I'd suggest a variant of Rabin-Karp: "an algorithm of choice for multiple pattern search", as wikipedia puts it.
Define a "rolling hash" function, i.e., one such that, when you know the hash for haystack[x:x+N], computing the hash for haystack[x+1:x+N+1] is O(1). (Normal hashing functions such as Python's built-in hash do not have this property, which is why you have to write your own, otherwise the preprocessing becomes exhaustingly long rather than merely long-ish;-). A polynomial approach is fruitful, and you could use, say, 30-bit hash results (by masking if needed, i.e., you can do the computation w/more precision and just store the masked 30 bits of choice). Let's call this rolling hash function RH for clarity.
So, compute 1G of RH results as you roll along the haystack 1GB string; if you just stored these it would give you an array H of 1G 30-bit values (4GB) mapping index-in-haystack->RH value. But you want the reverse mapping, so use instead an array A of 2**30 entries (1G entries) that for each RH value gives you all the indices of interest in the haystack (indices at which that RH value occurs); for each entry you store the index of the first possibly-interesting haystack index into another array B of 1G indices into the haystack which is ordered to keep all indices into haystack with identical RH values ("collisions" in hashing terms) adjacent. H, A and B both have 1G entries of 4 bytes each, so 12GB total.
Now for each incoming 1K needle, compute its RH, call it k, and use it as an index into A; A[k] gives you the first index b into B at which it's worth comparing. So, do:
ib = A[k]
b = B[ib]
while b < len(haystack) - 1024:
if H[b] != k: return "not found"
if needle == haystack[b:b+1024]: return "found at", b
ib += 1
b = B[ib]
with a good RH you should have few collisions, so the while should execute very few times until returning one way or another. So each needle-search should be really really fast.
There are a number of string matching algorithms use in the field of genetics to find substrings. You might try this paper or this paper
As far as I know, standard find algorithm is naive algorithm with complexity about n*m comparisons, because
it checks patterns against every possible offset. There are some more effective algoithms, requiring about n+m comparisons.
If your string is not a natural language string, you can try
Knuth–Morris–Pratt algorithm . Boyer–Moore search algorithm is fast and simple enough too.
Are you willing to spend a significant time preprocessing the string?
If you are, what you can do is build a list of n-grams with offsets.
Suppose your alphabet is hex bytes and you are using 1-grams.
Then for 00-ff, you can create a dictionary that looks like this(perlese, sorry)
$offset_list{00} = #array_of_offsets
$offset_list{01} = #...etc
where you walk down the string and build the #array_of_offsets from all points where bytes happen. You can do this for arbitrary n-grams.
This provides a "start point for search" that you can use to walk.
Of course, the downside is that you have to preprocess the string, so that's your tradeoff.
edit:
The basic idea here is to match prefixes. This may bomb out badly if the information is super-similar, but if it has a fair amount of divergence between n-grams, you should be able to match prefixes pretty well.
Let's quantify divergence, since you've not discussed the kind of info you're analyzing. For the purposes of this algorithm, we can characterize divergence as a distance function: you need a decently high Hamming distance. If the hamming distance between n-grams is, say, 1, the above idea won't work. But if it's n-1, the above algorithm will be much easier.
To improve on my algorithm, let's build an algorithm that does some successive elimination of possibilities:
We can invoke Shannon Entropy to define information of a given n-gram. Take your search string and successively build a prefix based upon the first m characters. When the entropy of the m-prefix is 'sufficiently high', use it later.
Define p to be an m-prefix of the search string
Search your 1 GB string and create an array of offsets that match p.
Extend the m-prefix to be some k-prefix, k > m, entropy of k-prefix higher than m-prefix.
Keep the elements offset array defined above, such that they match the k-prefix string. Discard the non-matching elements.
Goto 4 until the entire search string is met.
In a sense, this is like reversing Huffman encoding.
With infinite memory, you can hash every 1k string along with its position in the 1 GB file.
With less than infinite memory, you will be bounded by how many memory pages you touch when searching.
I don't know definitively if the find() method for strings is faster than the search() method provided by Python's re (regular expressions) module, but there's only one way to find out.
If you're just searching a string, what you want is this:
import re
def findit(1_gb_string):
yield re.search(1_kb_pattern, 1_gb_string)
However, if you really only want the first match, you might be better off using finditer(), which returns an iterator, and with such large operations might actually be better.
http://www.youtube.com/watch?v=V5hZoJ6uK-s
Will be of most value to you. Its an MIT lecture on Dynamic Programming
If the patterns are fairly random, you can precompute the location of n-prefixes of strings.
Instead of going over all options for n-prefixes, just use the actual ones in the 1GB string - there will be less than 1Gig of those. Use as big a prefix as fits in your memory, I don't have 16GB RAM to check but a prefix of 4 could work (at least in a memory-efficient data structures), if not try 3 or even 2.
For a random 1GB string and random 1KB patterns, you should get a few 10s of locations per prefix if you use 3-byte prefixes, but 4-byte prefixes should get you an average of 0 or 1 , so lookup should be fast.
Precompute Locations
def find_all(pattern, string):
cur_loc = 0
while True:
next_loc = string.find(pattern, cur_loc)
if next_loc < 0: return
yield next_loc
cur_loc = next_loc+1
big_string = ...
CHUNK_SIZE = 1024
PREFIX_SIZE = 4
precomputed_indices = {}
for i in xrange(len(big_string)-CHUNK_SIZE):
prefix = big_string[i:i+PREFIX_SIZE]
if prefix not in precomputed_indices:
precomputed_indices[prefix] = tuple(find_all(prefix, big_string))
Look up a pattern
def find_pattern(pattern):
prefix = pattern[:PREFIX_SIZE]
# optimization - big prefixes will result in many misses
if prefix not in precomputed_indices:
return -1
for loc in precomputed_indices[prefix]:
if big_string[loc:loc+CHUNK_SIZE] == pattern:
return loc
return -1
Someone hinted at a possible way to index this thing if you have abundant RAM (or possibly even disk/swap) available.
Imagine if you performed a simple 32-bit CRC on a 1K block extending from each character in the original Gig string. This would result in 4bytes of checksum data for each byte offset from the beginning of the data.
By itself this might give a modest improvement in search speed. The checksum of each 1K search target could be checked against each CRC ... which each collision tested for a true match. That should still be a couple orders of magnitude faster than a normal linear search.
That, obviously, costs us 4GB of RAM for he CRC array (plus the original Gig for the original data and a little more overhead for the environment and our program).
If we have ~16GB we could sort the checksums and store a list of offsets where each is found. That becomes an indexed search (average of about 16 probes per target search ... worst case around 32 or 33 (might be a fence post there).
It's possible that a 16BG file index would still give better performance than a linear checksum search and it would almost certainly be better than a linear raw search (unless you have extremely slow filesystems/storage).
(Adding): I should clarify that this strategy is only beneficial given that you've described a need to do many searches on the same one gigabyte data blob.
You might use a threaded approach to building the index (while reading it as well as having multiple threads performing the checksumming). You might also offload the indexing into separate processes or a cluster of nodes (particularly if you use a file-based index --- the ~16GB option described above). With a simple 32-bit CRC you might be able to perform the checksums/indexing as fast as your reader thread can get the data (but we are talking about 1024 checksums for each 1K of data so perhaps not).
You might further improve performance by coding a Python module in C for actually performing the search ... and/or possibly for performing the checksumming/indexing.
The development and testing of such C extensions entail other trade-offs, obviously enough. It sounds like this would have near zero re-usability.
One efficient but complex way is full-text indexing with the Burrows-Wheeler transform. It involves performing a BWT on your source text, then using a small index on that to quickly find any substring in the text that matches your input pattern.
The time complexity of this algorithm is roughly O(n) with the length of the string you're matching - and independent of the length of the input string! Further, the size of the index is not much larger than the input data, and with compression can even be reduced below the size of the source text.

Categories