Create a list with initial capacity in Python - python

Code like this often happens:
l = []
while foo:
# baz
l.append(bar)
# qux
This is really slow if you're about to append thousands of elements to your list, as the list will have to be constantly resized to fit the new elements.
In Java, you can create an ArrayList with an initial capacity. If you have some idea how big your list will be, this will be a lot more efficient.
I understand that code like this can often be refactored into a list comprehension. If the for/while loop is very complicated, though, this is unfeasible. Is there an equivalent for us Python programmers?

Warning: This answer is contested. See comments.
def doAppend( size=10000 ):
result = []
for i in range(size):
message= "some unique object %d" % ( i, )
result.append(message)
return result
def doAllocate( size=10000 ):
result=size*[None]
for i in range(size):
message= "some unique object %d" % ( i, )
result[i]= message
return result
Results. (evaluate each function 144 times and average the duration)
simple append 0.0102
pre-allocate 0.0098
Conclusion. It barely matters.
Premature optimization is the root of all evil.

Python lists have no built-in pre-allocation. If you really need to make a list, and need to avoid the overhead of appending (and you should verify that you do), you can do this:
l = [None] * 1000 # Make a list of 1000 None's
for i in xrange(1000):
# baz
l[i] = bar
# qux
Perhaps you could avoid the list by using a generator instead:
def my_things():
while foo:
#baz
yield bar
#qux
for thing in my_things():
# do something with thing
This way, the list isn't every stored all in memory at all, merely generated as needed.

Short version: use
pre_allocated_list = [None] * size
to preallocate a list (that is, to be able to address 'size' elements of the list instead of gradually forming the list by appending). This operation is very fast, even on big lists. Allocating new objects that will be later assigned to list elements will take much longer and will be the bottleneck in your program, performance-wise.
Long version:
I think that initialization time should be taken into account.
Since in Python everything is a reference, it doesn't matter whether you set each element into None or some string - either way it's only a reference. Though it will take longer if you want to create a new object for each element to reference.
For Python 3.2:
import time
import copy
def print_timing (func):
def wrapper (*arg):
t1 = time.time()
res = func (*arg)
t2 = time.time ()
print ("{} took {} ms".format (func.__name__, (t2 - t1) * 1000.0))
return res
return wrapper
#print_timing
def prealloc_array (size, init = None, cp = True, cpmethod = copy.deepcopy, cpargs = (), use_num = False):
result = [None] * size
if init is not None:
if cp:
for i in range (size):
result[i] = init
else:
if use_num:
for i in range (size):
result[i] = cpmethod (i)
else:
for i in range (size):
result[i] = cpmethod (cpargs)
return result
#print_timing
def prealloc_array_by_appending (size):
result = []
for i in range (size):
result.append (None)
return result
#print_timing
def prealloc_array_by_extending (size):
result = []
none_list = [None]
for i in range (size):
result.extend (none_list)
return result
def main ():
n = 1000000
x = prealloc_array_by_appending(n)
y = prealloc_array_by_extending(n)
a = prealloc_array(n, None)
b = prealloc_array(n, "content", True)
c = prealloc_array(n, "content", False, "some object {}".format, ("blah"), False)
d = prealloc_array(n, "content", False, "some object {}".format, None, True)
e = prealloc_array(n, "content", False, copy.deepcopy, "a", False)
f = prealloc_array(n, "content", False, copy.deepcopy, (), False)
g = prealloc_array(n, "content", False, copy.deepcopy, [], False)
print ("x[5] = {}".format (x[5]))
print ("y[5] = {}".format (y[5]))
print ("a[5] = {}".format (a[5]))
print ("b[5] = {}".format (b[5]))
print ("c[5] = {}".format (c[5]))
print ("d[5] = {}".format (d[5]))
print ("e[5] = {}".format (e[5]))
print ("f[5] = {}".format (f[5]))
print ("g[5] = {}".format (g[5]))
if __name__ == '__main__':
main()
Evaluation:
prealloc_array_by_appending took 118.00003051757812 ms
prealloc_array_by_extending took 102.99992561340332 ms
prealloc_array took 3.000020980834961 ms
prealloc_array took 49.00002479553223 ms
prealloc_array took 316.9999122619629 ms
prealloc_array took 473.00004959106445 ms
prealloc_array took 1677.9999732971191 ms
prealloc_array took 2729.999780654907 ms
prealloc_array took 3001.999855041504 ms
x[5] = None
y[5] = None
a[5] = None
b[5] = content
c[5] = some object blah
d[5] = some object 5
e[5] = a
f[5] = []
g[5] = ()
As you can see, just making a big list of references to the same None object takes very little time.
Prepending or extending takes longer (I didn't average anything, but after running this a few times I can tell you that extending and appending take roughly the same time).
Allocating new object for each element - that is what takes the most time. And S.Lott's answer does that - formats a new string every time. Which is not strictly required - if you want to preallocate some space, just make a list of None, then assign data to list elements at will. Either way it takes more time to generate data than to append/extend a list, whether you generate it while creating the list, or after that. But if you want a sparsely-populated list, then starting with a list of None is definitely faster.

The Pythonic way for this is:
x = [None] * numElements
Or whatever default value you wish to prepopulate with, e.g.
bottles = [Beer()] * 99
sea = [Fish()] * many
vegetarianPizzas = [None] * peopleOrderingPizzaNotQuiche
(Caveat Emptor: The [Beer()] * 99 syntax creates one Beer and then populates an array with 99 references to the same single instance)
Python's default approach can be pretty efficient, although that efficiency decays as you increase the number of elements.
Compare
import time
class Timer(object):
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, *args):
end = time.time()
secs = end - self.start
msecs = secs * 1000 # Millisecs
print('%fms' % msecs)
Elements = 100000
Iterations = 144
print('Elements: %d, Iterations: %d' % (Elements, Iterations))
def doAppend():
result = []
i = 0
while i < Elements:
result.append(i)
i += 1
def doAllocate():
result = [None] * Elements
i = 0
while i < Elements:
result[i] = i
i += 1
def doGenerator():
return list(i for i in range(Elements))
def test(name, fn):
print("%s: " % name, end="")
with Timer() as t:
x = 0
while x < Iterations:
fn()
x += 1
test('doAppend', doAppend)
test('doAllocate', doAllocate)
test('doGenerator', doGenerator)
with
#include <vector>
typedef std::vector<unsigned int> Vec;
static const unsigned int Elements = 100000;
static const unsigned int Iterations = 144;
void doAppend()
{
Vec v;
for (unsigned int i = 0; i < Elements; ++i) {
v.push_back(i);
}
}
void doReserve()
{
Vec v;
v.reserve(Elements);
for (unsigned int i = 0; i < Elements; ++i) {
v.push_back(i);
}
}
void doAllocate()
{
Vec v;
v.resize(Elements);
for (unsigned int i = 0; i < Elements; ++i) {
v[i] = i;
}
}
#include <iostream>
#include <chrono>
using namespace std;
void test(const char* name, void(*fn)(void))
{
cout << name << ": ";
auto start = chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < Iterations; ++i) {
fn();
}
auto end = chrono::high_resolution_clock::now();
auto elapsed = end - start;
cout << chrono::duration<double, milli>(elapsed).count() << "ms\n";
}
int main()
{
cout << "Elements: " << Elements << ", Iterations: " << Iterations << '\n';
test("doAppend", doAppend);
test("doReserve", doReserve);
test("doAllocate", doAllocate);
}
On my Windows 7 CoreĀ i7, 64-bit Python gives
Elements: 100000, Iterations: 144
doAppend: 3587.204933ms
doAllocate: 2701.154947ms
doGenerator: 1721.098185ms
While C++ gives (built with Microsoft Visual C++, 64-bit, optimizations enabled)
Elements: 100000, Iterations: 144
doAppend: 74.0042ms
doReserve: 27.0015ms
doAllocate: 5.0003ms
C++ debug build produces:
Elements: 100000, Iterations: 144
doAppend: 2166.12ms
doReserve: 2082.12ms
doAllocate: 273.016ms
The point here is that with Python you can achieve a 7-8% performance improvement, and if you think you're writing a high-performance application (or if you're writing something that is used in a web service or something) then that isn't to be sniffed at, but you may need to rethink your choice of language.
Also, the Python code here isn't really Python code. Switching to truly Pythonesque code here gives better performance:
import time
class Timer(object):
def __enter__(self):
self.start = time.time()
return self
def __exit__(self, *args):
end = time.time()
secs = end - self.start
msecs = secs * 1000 # millisecs
print('%fms' % msecs)
Elements = 100000
Iterations = 144
print('Elements: %d, Iterations: %d' % (Elements, Iterations))
def doAppend():
for x in range(Iterations):
result = []
for i in range(Elements):
result.append(i)
def doAllocate():
for x in range(Iterations):
result = [None] * Elements
for i in range(Elements):
result[i] = i
def doGenerator():
for x in range(Iterations):
result = list(i for i in range(Elements))
def test(name, fn):
print("%s: " % name, end="")
with Timer() as t:
fn()
test('doAppend', doAppend)
test('doAllocate', doAllocate)
test('doGenerator', doGenerator)
Which gives
Elements: 100000, Iterations: 144
doAppend: 2153.122902ms
doAllocate: 1346.076965ms
doGenerator: 1614.092112ms
(in 32-bit, doGenerator does better than doAllocate).
Here the gap between doAppend and doAllocate is significantly larger.
Obviously, the differences here really only apply if you are doing this more than a handful of times or if you are doing this on a heavily loaded system where those numbers are going to get scaled out by orders of magnitude, or if you are dealing with considerably larger lists.
The point here: Do it the Pythonic way for the best performance.
But if you are worrying about general, high-level performance, Python is the wrong language. The most fundamental problem being that Python function calls has traditionally been up to 300x slower than other languages due to Python features like decorators, etc. (PythonSpeed/PerformanceTips, Data Aggregation).

As others have mentioned, the simplest way to preseed a list is with NoneType objects.
That being said, you should understand the way Python lists actually work before deciding this is necessary.
In the CPython implementation of a list, the underlying array is always created with overhead room, in progressively larger sizes ( 4, 8, 16, 25, 35, 46, 58, 72, 88, 106, 126, 148, 173, 201, 233, 269, 309, 354, 405, 462, 526, 598, 679, 771, 874, 990, 1120, etc), so that resizing the list does not happen nearly so often.
Because of this behavior, most list.append() functions are O(1) complexity for appends, only having increased complexity when crossing one of these boundaries, at which point the complexity will be O(n). This behavior is what leads to the minimal increase in execution time in S.Lott's answer.
Source: Python list implementation

I ran S.Lott's code and produced the same 10% performance increase by preallocating. I tried Ned Batchelder's idea using a generator and was able to see the performance of the generator better than that of the doAllocate. For my project the 10% improvement matters, so thanks to everyone as this helps a bunch.
def doAppend(size=10000):
result = []
for i in range(size):
message = "some unique object %d" % ( i, )
result.append(message)
return result
def doAllocate(size=10000):
result = size*[None]
for i in range(size):
message = "some unique object %d" % ( i, )
result[i] = message
return result
def doGen(size=10000):
return list("some unique object %d" % ( i, ) for i in xrange(size))
size = 1000
#print_timing
def testAppend():
for i in xrange(size):
doAppend()
#print_timing
def testAlloc():
for i in xrange(size):
doAllocate()
#print_timing
def testGen():
for i in xrange(size):
doGen()
testAppend()
testAlloc()
testGen()
Output
testAppend took 14440.000ms
testAlloc took 13580.000ms
testGen took 13430.000ms

Concerns about preallocation in Python arise if you're working with NumPy, which has more C-like arrays. In this instance, preallocation concerns are about the shape of the data and the default value.
Consider NumPy if you're doing numerical computation on massive lists and want performance.

Python's list doesn't support preallocation. Numpy allows you to preallocate memory, but in practice it doesn't seem to be worth it if your goal is to speed up the program.
This test simply writes an integer into the list, but in a real application you'd likely do more complicated things per iteration, which further reduces the importance of the memory allocation.
import timeit
import numpy as np
def list_append(size=1_000_000):
result = []
for i in range(size):
result.append(i)
return result
def list_prealloc(size=1_000_000):
result = [None] * size
for i in range(size):
result[i] = i
return result
def numpy_prealloc(size=1_000_000):
result = np.empty(size, np.int32)
for i in range(size):
result[i] = i
return result
setup = 'from __main__ import list_append, list_prealloc, numpy_prealloc'
print(timeit.timeit('list_append()', setup=setup, number=10)) # 0.79
print(timeit.timeit('list_prealloc()', setup=setup, number=10)) # 0.62
print(timeit.timeit('numpy_prealloc()', setup=setup, number=10)) # 0.73

For some applications, a dictionary may be what you are looking for. For example, in the find_totient method, I found it more convenient to use a dictionary since I didn't have a zero index.
def totient(n):
totient = 0
if n == 1:
totient = 1
else:
for i in range(1, n):
if math.gcd(i, n) == 1:
totient += 1
return totient
def find_totients(max):
totients = dict()
for i in range(1,max+1):
totients[i] = totient(i)
print('Totients:')
for i in range(1,max+1):
print(i,totients[i])
This problem could also be solved with a preallocated list:
def find_totients(max):
totients = None*(max+1)
for i in range(1,max+1):
totients[i] = totient(i)
print('Totients:')
for i in range(1,max+1):
print(i,totients[i])
I feel that this is not as elegant and prone to bugs because I'm storing None which could throw an exception if I accidentally use them wrong, and because I need to think about edge cases that the map lets me avoid.
It's true the dictionary won't be as efficient, but as others have commented, small differences in speed are not always worth significant maintenance hazards.

Fastest Way - use * like list1 = [False] * 1_000_000
Comparing all the common methods (list appending vs preallocation vs for vs while), I found that using * gives the most efficient execution time.
import time
large_int = 10_000_000
start_time = time.time()
# Test 1: List comprehension
l1 = [False for _ in range(large_int)]
end_time_1 = time.time()
# Test 2: Using *
l2 = [False] * large_int
end_time_2 = time.time()
# Test 3: Using append with for loop & range
l3 = []
for _ in range(large_int):
l3.append(False)
end_time_3 = time.time()
# Test 4: Using append with while loop
l4, i = [], 0
while i < large_int:
l4.append(False)
i += 1
end_time_4 = time.time()
# Results
diff_1 = end_time_1 - start_time
diff_2 = end_time_2 - end_time_1
diff_3 = end_time_3 - end_time_2
diff_4 = end_time_4 - end_time_3
print(f"Test 1. {diff_1:.4f} seconds")
print(f"Test 2. {diff_2:.4f} seconds")
print(f"Test 3. {diff_3:.4f} seconds")
print(f"Test 4. {diff_4:.4f} seconds")
print("\nTest 2 is faster than - ")
print(f" Test 1 by - {(diff_1 / diff_2 * 100 - 1):,.0f}%")
print(f" Test 3 by - {(diff_3 / diff_2 * 100 - 1):,.0f}%")
print(f" Test 4 by - {(diff_4 / diff_2 * 100 - 1):,.0f}%")

From what I understand, Python lists are already quite similar to ArrayLists. But if you want to tweak those parameters I found this post on the Internet that may be interesting (basically, just create your own ScalableList extension):
http://mail.python.org/pipermail/python-list/2000-May/035082.html

Related

Python string comparison doesn't short circuit?

The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b.
However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.
from time import perf_counter_ns
import random
def timed_cmp(a, b):
start = perf_counter_ns()
a == b
end = perf_counter_ns()
return end - start
def n_timed_cmp(n, a, b):
"average time for a==b done n times"
ts = [timed_cmp(a, b) for _ in range(n)]
return sum(ts) / len(ts)
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 2 ** 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.
# ran with cpython 3.8.3
6 78.051700
1 78.203200
15 78.222700
14 78.384800
11 78.396300
12 78.441800
9 78.476900
13 78.519000
8 78.586200
3 78.631500
---
0 80.691100
1 78.203200
I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get.
Any idea what's going on ???
There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:
A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
## -15,13 +15,13 ## def n_timed_cmp(n, a, b):
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
- n = 2 ** 8
+ n = 2 ** 16
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
and you'll get:
0 122.621000
1 213.465700
2 380.214100
3 460.422000
5 694.278700
4 722.010000
7 894.630300
6 1020.722100
9 1149.473000
8 1341.754500
---
0 122.621000
1 213.465700
Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:
## -21,7 +21,7 ## def check_cmp_time():
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
to only keep the two extreme cases (first letter change vs last letter change) and you'll get:
$ python3 cmp.py
0 124.131800
1 135.566000
Numbers may vary, but most of the time test 0 is a tad faster that test 1.
To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:
from time import perf_counter_ns
from statistics import median
import random
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]
values = {x: [] for x in range(n)}
for _ in range(10_000_000):
for i, diff in enumerate(diffs):
start = perf_counter_ns()
s == diff
values[i].append(perf_counter_ns() - start)
timed = [[k, median(v)] for k, v in values.items()]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Which gives me:
1 221.000000
2 222.000000
3 223.000000
4 223.000000
5 223.000000
6 223.000000
7 223.000000
0 241.000000
The differences are so small, Python and perf_counter_ns may no longer be the right tools here.
See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.
Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
}
(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)
Focus on this line specifically-
cmp = memcmp(data1, data2, len * kind);
Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section 7.24.4.1 of the C Standard Draft
7.24.4.1 The memcmp function
Synopsis
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
The memcmp function compares the first n characters of the object pointed to by s1 to
the first n characters of the object pointed to by s2.
Returns
The memcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?
Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.
What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation
There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.
Edit: Fixed wrong function link
Edit: As #Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)
I pretty much screwed up on the first interview problem...
Question: Given a string of a million numbers (Pi for example), write
a function/program that returns all repeating 3 digit numbers and number of
repetition greater than 1
For example: if the string was: 123412345123456 then the function/program would return:
123 - 3 times
234 - 3 times
345 - 2 times
They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:
000 --> 999
Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?
You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)
There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)-1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++ - '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d -> %d\n", i, freq[i]);
return 0;
}
Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.
A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n-3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s - %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.
The simple O(n) solution would be to count each 3-digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count.
Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than #Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms
I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num) - 3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.
As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))-2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.
As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."
Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)-2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)-2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)-2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than #paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
Image as answer:
Looks like a sliding window.
Here is my solution:
from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}
With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point.
Hope it helps :)
-Telling from the perspective of C.
-You can have an int 3-d array results[10][10][10];
-Go from 0th location to n-4th location, where n being the size of the string array.
-On each location, check the current, next and next's next.
-Increment the cntr as resutls[current][next][next's next]++;
-Print the values of
results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]
-It is O(n) time, there is no comparisons involved.
-You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr) - 2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

Vectorizing numpy Multiple Condition Nested Loops

On attempting to produce Automatic Peak Detection in Noisy Periodic and Quasi-Periodic Signals, by Felix Scholkmann, Jens Boss and Martin Wolf in Python, I've hit a stumbling block in the implementation.
Upon attempting to optimise, I've noticed that the nested for loops are creating a bottleneck in processing time (taking 115394 ms on average to complete).
Is there a more efficient means of constructing the nested for loop?
N.B:
The parameter, signal, is a list of co-ordinates to which the algorithm will process which is of the form
-48701.0
-20914.0
-1757.0
-49278.0
-106781.0
-88139.0
-13587.0
28071.0
11880.0
-13375.0
-18056.0
-15248.0
-12476.0
-9832.0
-26365.0
-65734.0
-81657.0
-41566.0
6382.0
872.0
-30666.0
-20261.0
17543.0
6278.0
...
The list is 32768 lines long.
The function returns the indexes of the peaks detected to which is processed in another function.
def ampd(signal):
s_time = range(1, len(signal)+1)
[fitPolynomial, fitError] = np.polyfit(s_time, signal, 1)
fitSignal = np.polyval([fitPolynomial, fitError], s_time)
dtrSignal = signal - fitSignal
N = len(dtrSignal)
L = math.ceil(N/2.0)-1
creation_start = time.time()
np.random.seed(1969)
LSM = np.random.uniform(0, 2, size=(L, N))
creation_elapsedTime = time.time() - creation_start
print('LSM created in %s ms' % int(creation_elapsedTime * 1000))
loop_start = time.time()
for k in range(1, L):
for i in range(k+2, N-k+1):
if signal[i-1]>signal[i-k-1] and signal[i-1]>signal[i+k-1]:
LSM[k,i] = 0
loop_elapsedTime = time.time() - loop_start
print('Loop completed in %s ms' % int(loop_elapsedTime * 1000))
G = np.sum(LSM, axis=1)
l = min(enumerate(G), key=itemgetter(1))[0]
MLSM = LSM[0:l]
S = np.std(MLSM, ddof=1)
found_indices = np.where(MLSM == ((S-1) == 0))
del LSM
del MLSM
return found_indices[1]
Here is a solution which uses only one loop
for k in range(1, L):
mat=1-((signal[k+1:N-k]>signal[1:N-2*k]) & (signal[k+1:N-k]>signal[2*k+1:N]))
LSM[k,k+2:N-k+1]*=mat
it's faster and seems do give the same solutions. You compare slices (as suggested by Ami Tavory), combine the comparisons with a &, which gives a True/False array; with 1-operation, you transform it to zeros and ones, the zeros corresponding to where the conditions are met. And lastly you multiply the row by the result.

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...

splitting list in chunks of balanced weight

I need an algorithm to split a list of values into such chunks, that sum of values in every chunk is (approximately) equals (its some variation of Knapsack problem, I suppose)
So, for example [1, 2, 1, 4, 10, 3, 8] => [[8, 2], [10], [1, 3, 1, 4]]
Chunks of equal lengths are preferred, but it's not a constraint.
Python is preferred language, but others are welcome as well
Edit: number of chunks is defined
Greedy:
1. Order the available items descending.
2. Create N empty groups
3. Start adding the items one at a time into the group that has the smallest sum in it.
I think in most real life situations this should be enough.
Based on #Alin Purcaru answer and #amit remarks, I wrote code (Python 3.1). It has, as far as I tested, linear performance (both for number of items and number of chunks, so finally it's O(N * M)). I avoid sorting the list every time, keeping current sum of values for every chunk in a dict (can be less practical with greater number of chunks)
import time, random
def split_chunks(l, n):
"""
Splits list l into n chunks with approximately equals sum of values
see http://stackoverflow.com/questions/6855394/splitting-list-in-chunks-of-balanced-weight
"""
result = [[] for i in range(n)]
sums = {i:0 for i in range(n)}
c = 0
for e in l:
for i in sums:
if c == sums[i]:
result[i].append(e)
break
sums[i] += e
c = min(sums.values())
return result
if __name__ == '__main__':
MIN_VALUE = 0
MAX_VALUE = 20000000
ITEMS = 50000
CHUNKS = 256
l =[random.randint(MIN_VALUE, MAX_VALUE ) for i in range(ITEMS)]
t = time.time()
r = split_chunks(l, CHUNKS)
print(ITEMS, CHUNKS, time.time() - t)
Just because, you know, we can, the same code in PHP 5.3 (2 - 3 times slower than Python 3.1):
function split_chunks($l, $n){
$result = array_fill(0, $n, array());
$sums = array_fill(0, $n, 0);
$c = 0;
foreach ($l as $e){
foreach ($sums as $i=>$sum){
if ($c == $sum){
$result[$i][] = $e;
break;
} // if
} // foreach
$sums[$i] += $e;
$c = min($sums);
} // foreach
return $result;
}
define('MIN_VALUE',0);
define('MAX_VALUE',20000000);
define('ITEMS',50000);
define('CHUNKS',128);
$l = array();
for ($i=0; $i<ITEMS; $i++){
$l[] = rand(MIN_VALUE, MAX_VALUE);
}
$t = microtime(true);
$r = split_chunks($l, CHUNKS);
$t = microtime(true) - $t;
print(ITEMS. ' ' . CHUNKS .' ' . $t . ' ');
This will be faster and a little cleaner (based on above ideas!)
def split_chunks2(l, n):
result = [[] for i in range(n)]
sums = [0]*n
i = 0
for e in l:
result[i].append(e)
sums[i] += e
i = sums.index(min(sums))
return result
you may want to use Artificial Intelligence tools for the problem.
first define your problem
States={(c1,c2,...,ck) | c1,...,ck are subgroups of your problem , and union(c1,..,ck)=S }
successors((c1,...,ck)) = {switch one element from one sub list to another }
utility(c1,...,ck) = max{sum(c1),sum(c2)...} - min{sum(c1),sum(c2),...}
now, you can use steepest ascent hill climbing with random-restarts.
this algorithm will be anytime, meaning you can start searching, and when time's up - stop it, and you will get the best result so far. the result will be better as run time increased.
Scala version of foxtrotmikew answer:
def workload_balancer(element_list: Seq[(Long, Any)], partitions: Int): Seq[Seq[(Long, Any)]] = {
val result = scala.collection.mutable.Seq.fill(partitions)(null : Seq[(Long, Any)])
val index = (0 to element_list.size-1)
val weights = scala.collection.mutable.Seq.fill(partitions)(0l)
(0 to partitions-1).foreach( x => weights(x) = 0 )
var i = 0
for (e <- element_list){
result(i) = if(result(i) == null) Seq(e) else result(i) ++: Seq(e)
weights(i) = weights(i) + e._1
i = weights.indexOf( weights.min )
}
result.toSeq
}
element_list should be (weight : Long, Object : Any), then you can order and split objects into different workloads (result). It help me a lot!, thnks.

Categories