Python: simple list merging based on intersections - python

Consider there are some lists of integers as:
#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------
The question is to merge lists having at least one common element. So the results only for the given part will be as follows:
#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------
What is the most efficient way to do this on large data (elements are just numbers)?
Is tree structure something to think about?
I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1:
Here is the code I was using:
#--------------------------------------
lsts = [[0,1,3],
[1,0,3,4,5,10,11],
[2,8],
[3,1,0,16]];
#--------------------------------------
The function is (buggy!!):
#--------------------------------------
def merge(lsts):
sts = [set(l) for l in lsts]
i = 0
while i < len(sts):
j = i+1
while j < len(sts):
if len(sts[i].intersection(sts[j])) > 0:
sts[i] = sts[i].union(sts[j])
sts.pop(j)
else: j += 1 #---corrected
i += 1
lst = [list(s) for s in sts]
return lst
#--------------------------------------
The result is:
#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------
Update 2:
To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting).
The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!
Update 3:
Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.
Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.

My attempt:
def merge(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = True
while merged:
merged = False
results = []
while sets:
common, rest = sets[0], sets[1:]
sets = []
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = True
common |= x
results.append(common)
sets = results
return sets
lst = [[65, 17, 5, 30, 79, 56, 48, 62],
[6, 97, 32, 93, 55, 14, 70, 32],
[75, 37, 83, 34, 9, 19, 14, 64],
[43, 71],
[],
[89, 49, 1, 30, 28, 3, 63],
[35, 21, 68, 94, 57, 94, 9, 3],
[16],
[29, 9, 97, 43],
[17, 63, 24]]
print merge(lst)
Benchmark:
import random
# adapt parameters to your own usage scenario
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
if False: # change to true to generate the test data file (takes a while)
with open("/tmp/test.txt", "w") as f:
lists = []
classes = [
range(class_size * i, class_size * (i + 1)) for i in range(class_count)
]
for c in classes:
# distribute each class across ~300 lists
for i in xrange(list_count_per_class):
lst = []
if random.random() < large_list_probability:
size = random.choice(large_list_sizes)
else:
size = random.choice(small_list_sizes)
nums = set(c)
for j in xrange(size):
x = random.choice(list(nums))
lst.append(x)
nums.remove(x)
random.shuffle(lst)
lists.append(lst)
random.shuffle(lists)
for lst in lists:
f.write(" ".join(str(x) for x in lst) + "\n")
setup = """
# Niklas'
def merge_niklas(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = 1
while merged:
merged = 0
results = []
while sets:
common, rest = sets[0], sets[1:]
sets = []
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = 1
common |= x
results.append(common)
sets = results
return sets
# Rik's
def merge_rik(data):
sets = (set(e) for e in data if e)
results = [next(sets)]
for e_set in sets:
to_update = []
for i, res in enumerate(results):
if not e_set.isdisjoint(res):
to_update.insert(0, i)
if not to_update:
results.append(e_set)
else:
last = results[to_update.pop(-1)]
for i in to_update:
last |= results[i]
del results[i]
last |= e_set
return results
# katrielalex's
def pairs(lst):
i = iter(lst)
first = prev = item = i.next()
for item in i:
yield prev, item
prev = item
yield item, first
import networkx
def merge_katrielalex(lsts):
g = networkx.Graph()
for lst in lsts:
for edge in pairs(lst):
g.add_edge(*edge)
return networkx.connected_components(g)
# agf's (optimized)
from collections import deque
def merge_agf_optimized(lists):
sets = deque(set(lst) for lst in lists if lst)
results = []
disjoint = 0
current = sets.pop()
while True:
merged = False
newsets = deque()
for _ in xrange(disjoint, len(sets)):
this = sets.pop()
if not current.isdisjoint(this):
current.update(this)
merged = True
disjoint = 0
else:
newsets.append(this)
disjoint += 1
if sets:
newsets.extendleft(sets)
if not merged:
results.append(current)
try:
current = newsets.pop()
except IndexError:
break
disjoint = 0
sets = newsets
return results
# agf's (simple)
def merge_agf_simple(lists):
newsets, sets = [set(lst) for lst in lists if lst], []
while len(sets) != len(newsets):
sets, newsets = newsets, []
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets
# alexis'
def merge_alexis(data):
bins = range(len(data)) # Initialize each bin[n] == n
nums = dict()
data = [set(m) for m in data] # Convert to sets
for r, row in enumerate(data):
for num in row:
if num not in nums:
# New number: tag it with a pointer to this row's bin
nums[num] = r
continue
else:
dest = locatebin(bins, nums[num])
if dest == r:
continue # already in the same bin
if dest > r:
dest, r = r, dest # always merge into the smallest bin
data[dest].update(data[r])
data[r] = None
# Update our indices to reflect the move
bins[r] = dest
r = dest
# Filter out the empty bins
have = [m for m in data if m]
return have
def locatebin(bins, n):
while bins[n] != n:
n = bins[n]
return n
lsts = []
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
lst = [int(x) for x in line.split()]
size += len(lst)
if len(lst) > max:
max = len(lst)
num += 1
lsts.append(lst)
"""
setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)
import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)
These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.
Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:
=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================
niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044
===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================
niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144
===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================
niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878

I tried to summurize everything that's been said and done about this topic in this question and in the duplicate one.
I tried to test and time every solution (all the code here).
Testing
This is the TestCase from the testing module:
class MergeTestCase(unittest.TestCase):
def setUp(self):
with open('./lists/test_list.txt') as f:
self.lsts = json.loads(f.read())
self.merged = self.merge_func(deepcopy(self.lsts))
def test_disjoint(self):
"""Check disjoint-ness of merged results"""
from itertools import combinations
for a,b in combinations(self.merged, 2):
self.assertTrue(a.isdisjoint(b))
def test_coverage(self): # Credit to katrielalex
"""Check coverage original data"""
merged_flat = set()
for s in self.merged:
merged_flat |= s
original_flat = set()
for lst in self.lsts:
original_flat |= set(lst)
self.assertTrue(merged_flat == original_flat)
def test_subset(self): # Credit to WolframH
"""Check that every original data is a subset"""
for lst in self.lsts:
self.assertTrue(any(set(lst) <= e for e in self.merged))
This test is supposing a list of sets as result, so I couldn't test a couple of sulutions that worked with lists.
I couldn't test the following:
katrielalex
steabert
Among the ones I could test, two failed:
-- Going to test: agf (optimized) --
Check disjoint-ness of merged results ... FAIL
-- Going to test: robert king --
Check disjoint-ness of merged results ... FAIL
Timing
The performances are strongly related with the data test employed.
So far three answers tried to time theirs and others solution. Since they used different testing data they had different results.
Niklas benchmark is very twakable. With his banchmark one could do different tests changing some parameters.
I've used the same three sets of parameters he used in his own answer, and I put them in three different files:
filename = './lists/timing_1.txt'
class_count = 50,
class_size = 1000,
list_count_per_class = 100,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.5,
filename = './lists/timing_2.txt'
class_count = 15,
class_size = 1000,
list_count_per_class = 300,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.5,
filename = './lists/timing_3.txt'
class_count = 15,
class_size = 1000,
list_count_per_class = 300,
large_list_sizes = (100, 1000),
small_list_sizes = (0, 100),
large_list_probability = 0.1,
This are the results that I got:
From file: timing_1.txt
Timing with: >> Niklas << Benchmark
Info: 5000 lists, average size 305, max size 999
Timing Results:
10.434 -- alexis
11.476 -- agf
11.555 -- Niklas B.
13.622 -- Rik. Poggi
14.016 -- agf (optimized)
14.057 -- ChessMaster
20.208 -- katrielalex
21.697 -- steabert
25.101 -- robert king
76.870 -- Sven Marnach
133.399 -- hochl
From file: timing_2.txt
Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 305, max size 999
Timing Results:
8.247 -- Niklas B.
8.286 -- agf
8.637 -- Rik. Poggi
8.967 -- alexis
9.090 -- ChessMaster
9.091 -- agf (optimized)
18.186 -- katrielalex
19.543 -- steabert
22.852 -- robert king
70.486 -- Sven Marnach
104.405 -- hochl
From file: timing_3.txt
Timing with: >> Niklas << Benchmark
Info: 4500 lists, average size 98, max size 999
Timing Results:
2.746 -- agf
2.850 -- Niklas B.
2.887 -- Rik. Poggi
2.972 -- alexis
3.077 -- ChessMaster
3.174 -- agf (optimized)
5.811 -- katrielalex
7.208 -- robert king
9.193 -- steabert
23.536 -- Sven Marnach
37.436 -- hochl
With Sven's testing data I got the following results:
Timing with: >> Sven << Benchmark
Info: 200 lists, average size 10, max size 10
Timing Results:
2.053 -- alexis
2.199 -- ChessMaster
2.410 -- agf (optimized)
3.394 -- agf
3.398 -- Rik. Poggi
3.640 -- robert king
3.719 -- steabert
3.776 -- Niklas B.
3.888 -- hochl
4.610 -- Sven Marnach
5.018 -- katrielalex
And finally with Agf's benchmark I got:
Timing with: >> Agf << Benchmark
Info: 2000 lists, average size 246, max size 500
Timing Results:
3.446 -- Rik. Poggi
3.500 -- ChessMaster
3.520 -- agf (optimized)
3.527 -- Niklas B.
3.527 -- agf
3.902 -- hochl
5.080 -- alexis
15.997 -- steabert
16.422 -- katrielalex
18.317 -- robert king
1257.152 -- Sven Marnach
As I said at the beginning all the code is available at this git repository. All the merging functions are in a file called core.py, every function there with its name ending with _merge will be auto loaded during the tests, so it shouldn't be hard to add/test/improve your own solution.
Let me also know if there's something wrong, it's been a lot of coding and I could use a couple of fresh eyes :)

Using Matrix Manipulations
Let me preface this answer with the following comment:
THIS IS THE WRONG WAY TO DO THIS. IT IS PRONE TO NUMERICAL INSTABILITY AND IS MUCH SLOWER THAN THE OTHER METHODS PRESENTED, USE AT YOUR OWN RISK.
That being said, I couldn't resist solving the problem from a dynamical point of view (and I hope you'll get a fresh perspective on the problem). In theory this should work all the time, but eigenvalue calculations can often fail. The idea is to think of your list as a flow from rows to columns. If two rows share a common value there is a connecting flow between them. If we were to think of these flows as water, we would see that the flows cluster into little pools when they there is a connecting path between them. For simplicity, I'm going to use a smaller set, though it works with your data set as well:
from numpy import where, newaxis
from scipy import linalg, array, zeros
X = [[0,1,3],[2],[3,1]]
We need to convert the data into a flow graph. If row i flows into value j we put it in the matrix. Here we have 3 rows and 4 unique values:
A = zeros((4,len(X)), dtype=float)
for i,row in enumerate(X):
for val in row: A[val,i] = 1
In general, you'll need to change the 4 to capture the number of unique values you have. If the set is a list of integers starting from 0 as we have, you can simply make this the largest number. We now perform an eigenvalue decomposition. A SVD to be exact, since our matrix is not square.
S = linalg.svd(A)
We want to keep only the 3x3 portion of this answer, since it will represent the flow of the pools. In fact we only want the absolute values of this matrix; we only care if there is a flow in this cluster space.
M = abs(S[2])
We can think of this matrix M as a Markov matrix and make it explicit by row normalizing. Once we have this we compute the (left) eigenvalue decomp. of this matrix.
M /= M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
V = abs(V)
Now a disconnected (non-ergodic) Markov matrix has the nice property that, for each non-connected cluster, there is a eigenvalue of unity. The eigenvectors associated with these unity values are the ones we want:
idx = where(U > .999)[0]
C = V.T[idx] > 0
I have to use .999 due to the aforementioned numerical instability. At this point, we are done! Each independent cluster can now pull the corresponding rows out:
for cluster in C:
print where(A[:,cluster].sum(axis=1))[0]
Which gives, as intended:
[0 1 3]
[2]
Change X to your lst and you'll get: [ 0 1 3 4 5 10 11 16] [2 8].
Addendum
Why might this be useful? I don't know where your underlying data comes from, but what happens when the connections are not absolute? Say row 1 has entry 3 80% of the time - how would you generalize the problem? The flow method above would work just fine, and would be completely parametrized by that .999 value, the further away from unity it is, the looser the association.
Visual Representation
Since a picture is worth 1K words, here are the plots of the matrices A and V for my example and your lst respectively. Notice how in V splits into two clusters (it is a block-diagonal matrix with two blocks after permutation), since for each example there were only two unique lists!
Faster Implementation
In hindsight, I realized that you can skip the SVD step and compute only a single decomp:
M = dot(A.T,A)
M /= M.sum(axis=1)[:,newaxis]
U,V = linalg.eig(M,left=True, right=False)
The advantage with this method (besides speed) is that M is now symmetric, hence the computation can be faster and more accurate (no imaginary values to worry about).

EDIT: OK, the other questions has been closed, posting here.
Nice question! It's much simpler if you think of it as a connected-components problem in a graph. The following code uses the excellent networkx graph library and the pairs function from this question.
def pairs(lst):
i = iter(lst)
first = prev = item = i.next()
for item in i:
yield prev, item
prev = item
yield item, first
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
import networkx
g = networkx.Graph()
for sub_list in lists:
for edge in pairs(sub_list):
g.add_edge(*edge)
networkx.connected_components(g)
[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]
Explanation
We create a new (empty) graph g. For each sub-list in lists, consider its elements as nodes of the graph and add an edge between them. (Since we only care about connectedness, we don't need to add all the edges -- only adjacent ones!) Note that add_edge takes two objects, treats them as nodes (and adds them if they aren't already there), and adds an edge between them.
Then, we just find the connected components of the graph -- a solved problem! -- and output them as our intersecting sets.

Here's my answer. I haven't checked it against today's batch of answers.
The intersection-based algorithms are O(N^2) since they check each new set against all the existing ones, so I used an approach that indexes each number and runs on close to O(N) (if we accept that dictionary lookups are O(1)). Then I ran the benchmarks and felt like a complete idiot because it ran slower, but on closer inspection it turned out that the test data ends up with only a handful of distinct result sets, so the quadratic algorithms don't have a lot work to do. Test it with more than 10-15 distinct bins and my algorithm is much faster. Try test data with more than 50 distinct bins, and it is enormously faster.
(Edit: There was also a problem with the way the benchmark is run, but I was wrong in my diagnosis. I altered my code to work with the way the repeated tests are run).
def mergelists5(data):
"""Check each number in our arrays only once, merging when we find
a number we have seen before.
"""
bins = range(len(data)) # Initialize each bin[n] == n
nums = dict()
data = [set(m) for m in data ] # Convert to sets
for r, row in enumerate(data):
for num in row:
if num not in nums:
# New number: tag it with a pointer to this row's bin
nums[num] = r
continue
else:
dest = locatebin(bins, nums[num])
if dest == r:
continue # already in the same bin
if dest > r:
dest, r = r, dest # always merge into the smallest bin
data[dest].update(data[r])
data[r] = None
# Update our indices to reflect the move
bins[r] = dest
r = dest
# Filter out the empty bins
have = [ m for m in data if m ]
print len(have), "groups in result"
return have
def locatebin(bins, n):
"""
Find the bin where list n has ended up: Follow bin references until
we find a bin that has not moved.
"""
while bins[n] != n:
n = bins[n]
return n

This new function only does the minimum necessary number of disjointness tests, something the other similar solutions fail to do. It also uses a deque to avoid as many linear time operations as possible, like list slicing and deletion from early in the list.
from collections import deque
def merge(lists):
sets = deque(set(lst) for lst in lists if lst)
results = []
disjoint = 0
current = sets.pop()
while True:
merged = False
newsets = deque()
for _ in xrange(disjoint, len(sets)):
this = sets.pop()
if not current.isdisjoint(this):
current.update(this)
merged = True
disjoint = 0
else:
newsets.append(this)
disjoint += 1
if sets:
newsets.extendleft(sets)
if not merged:
results.append(current)
try:
current = newsets.pop()
except IndexError:
break
disjoint = 0
sets = newsets
return results
The less overlap between the sets in a given set of data, the better this will do compared to the other functions.
Here is an example case. If you have 4 sets, you need to compare:
1, 2
1, 3
1, 4
2, 3
2, 4
3, 4
If 1 overlaps with 3, then 2 needs to be re-tested to see if it now overlaps with 1, in order to safely skip testing 2 against 3.
There are two ways to deal with this. The first is to restart the testing of set 1 against the other sets after each overlap and merge. The second is to continue with the testing by comparing 1 with 4, then going back and re-testing. The latter results in fewer disjointness tests, as more merges happen in a single pass, so on the re-test pass, there are fewer sets left to test against.
The problem is to track which sets have to be re-tested. In the above example, 1 needs to be re-tested against 2 but not against 4, since 1 was already in its current state before 4 was tested the first time.
The disjoint counter allows this to be tracked.
My answer doesn't help with the main problem of finding an improved algorithm for recoding into FORTRAN; it is just what appears to me to be the simplest and most elegant way to implement the algorithm in Python.
According to my testing (or the test in the accepted answer), it's slightly (up to 10%) faster than the next fastest solution.
def merge0(lists):
newsets, sets = [set(lst) for lst in lists if lst], []
while len(sets) != len(newsets):
sets, newsets = newsets, []
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets
No need for the un-Pythonic counters (i, range) or complicated mutation (del, pop, insert) used in the other implementations. It uses only simple iteration, merges overlapping sets in the simplest manner, and builds a single new list on each pass through the data.
My (faster and simpler) version of the test code:
import random
tenk = range(10000)
lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)]
setup = """
def merge0(lists):
newsets, sets = [set(lst) for lst in lists if lst], []
while len(sets) != len(newsets):
sets, newsets = newsets, []
for aset in sets:
for eachset in newsets:
if not aset.isdisjoint(eachset):
eachset.update(aset)
break
else:
newsets.append(aset)
return newsets
def merge1(lsts):
sets = [set(lst) for lst in lsts if lst]
merged = 1
while merged:
merged = 0
results = []
while sets:
common, rest = sets[0], sets[1:]
sets = []
for x in rest:
if x.isdisjoint(common):
sets.append(x)
else:
merged = 1
common |= x
results.append(common)
sets = results
return sets
lsts = """ + repr(lsts)
import timeit
print timeit.timeit("merge0(lsts)", setup=setup, number=10)
print timeit.timeit("merge1(lsts)", setup=setup, number=10)

Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common. I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are ints, an array would also work although it might not be much faster.
def merge(data):
parents = {}
def find(i):
j = parents.get(i, i)
if j == i:
return i
k = find(j)
if k != j:
parents[i] = k
return k
for l in filter(None, data):
parents.update(dict.fromkeys(map(find, l), find(l[0])))
merged = {}
for k, v in parents.items():
merged.setdefault(find(v), []).append(k)
return merged.values()
This approach is comparable to the other best algorithms on Rik's benchmarks.

This would be my updated approach:
def merge(data):
sets = (set(e) for e in data if e)
results = [next(sets)]
for e_set in sets:
to_update = []
for i,res in enumerate(results):
if not e_set.isdisjoint(res):
to_update.insert(0,i)
if not to_update:
results.append(e_set)
else:
last = results[to_update.pop(-1)]
for i in to_update:
last |= results[i]
del results[i]
last |= e_set
return results
Note: During the merging empty lists will be removed.
Update: Reliability.
You need two tests for a 100% reliabilty of success:
Check that all the resulting sets are mutually disjointed:
merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}]
from itertools import combinations
for a,b in combinations(merged,2):
if not a.isdisjoint(b):
raise Exception(a,b) # just an example
Check that the merged set cover the original data. (as suggested by katrielalex)
I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.

Just for fun...
def merge(mylists):
results, sets = [], [set(lst) for lst in mylists if lst]
upd, isd, pop = set.update, set.isdisjoint, sets.pop
while sets:
if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]:
results.append(pop(0))
return results
and my rewrite of the best answer
def merge(lsts):
sets = map(set,lsts)
results = []
while sets:
first, rest = sets[0], sets[1:]
merged = False
sets = []
for s in rest:
if s and s.isdisjoint(first):
sets.append(s)
else:
first |= s
merged = True
if merged: sets.append(first)
else: results.append(first)
return results

Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:
Are the result sets disjoint? (number of elements of union == sum of numbers of elements)
Are the elements of the result sets the same as of the input lists?
Is every input list a subset of a result set?
Is every result set minimal, i.e. is it impossible to split it into two smaller sets?
It does not check if there are empty result sets - I don't know if you want them or not...
.
from itertools import chain
def check(lsts, result):
lsts = [set(s) for s in lsts]
all_items = set(chain(*lsts))
all_result_items = set(chain(*result))
num_result_items = sum(len(s) for s in result)
if num_result_items != len(all_result_items):
print("Error: result sets overlap!")
print(num_result_items, len(all_result_items))
print(sorted(map(len, result)), sorted(map(len, lsts)))
if all_items != all_result_items:
print("Error: result doesn't match input lists!")
if not all(any(set(s).issubset(t) for t in result) for s in lst):
print("Error: not all input lists are contained in a result set!")
seen = set()
todo = list(filter(bool, lsts))
done = False
while not done:
deletes = []
for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK
if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)):
seen.update(s)
deletes.append(i)
for i in reversed(deletes):
del todo[i]
done = not deletes
if todo:
print("Error: A result set should be split into two or more parts!")
print(todo)

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
import networkx as nx
g = nx.Graph()
for sub_list in lists:
for i in range(1,len(sub_list)):
g.add_edge(sub_list[0],sub_list[i])
print nx.connected_components(g)
#[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]]
Performance:
5000 lists, 5 classes, average size 74, max size 1000
15.2264976415
Performance of merge1:
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26998780571
So it is 11x slower than the fastest.. but the code is much more simple and readable!

This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in e.g. Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.
It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.
def merge(lsts):
# this is an index list that stores the joined id for each list
joined = range(len(lsts))
# create an ordered list with indices
indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst)
# loop throught the ordered list, and if two elements are the same and
# the lists are not yet joined, alter the list with joined id
el_0,idx_0 = None,None
for el,idx in indexed_list:
if el == el_0 and joined[idx] != joined[idx_0]:
old = joined[idx]
rep = joined[idx_0]
joined = [rep if id == old else id for id in joined]
el_0, idx_0 = el, idx
return joined

Firstly I'm not exactly sure if the benchmarks are fair:
Adding the following code to the start of my function:
c = Counter(chain(*lists))
print c[1]
"88"
This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).
Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.
Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.
Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:
I thought I would order everything
import heapq
from itertools import chain
def merge6(lists):
for l in lists:
l.sort()
one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!!
previous = one_list.next()
d = {i:i for i in range(len(lists))}
for current in one_list:
if current[0]==previous[0]:
d[current[1]] = d[previous[1]]
previous=current
groups=[[] for i in range(len(lists))]
for k in d:
groups[d[k]].append(lists[k]) #add a each list to its group
return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way.
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
print merge6(lists)
"[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]""
import timeit
print timeit.timeit("merge1(lsts)", setup=setup, number=10)
print timeit.timeit("merge4(lsts)", setup=setup, number=10)
print timeit.timeit("merge6(lsts)", setup=setup, number=10)
5000 lists, 5 classes, average size 74, max size 1000
1.26732238315
5000 lists, 5 classes, average size 74, max size 1000
1.16062907437
5000 lists, 5 classes, average size 74, max size 1000
30.7257182826

I found #Niklas B.'s answer really helpful... but it took me a while to read through it and understand the logic. This is a re-write of exactly the same code with new variable names and more explanation... to help the other N00bs out there!
def mergeUntilOnlyDisjointSetsRemain(_listsOfLists):
"""Function for merging lists until there are only disjoint sets"""
"""Imagine this algorithm as if it were processing train cars filled with
integers. It takes the first car of the train, separates it from the rest,
and then compares the first car to each subsequent car.
Start by renaming the first car to 'common'
If the two train cars have a common integer, you merge the two cars into
common, and continue down the line until you reach the end of the train.
Once you reach the end of the train, place the common car in results, (which
is essentially a collection of train cars that have already been compared
to all other cars).
You can exit the loop as soon as you get to the end of the train without
merging any of the cars. This is controlled by the continueMerge variable.
This variable is only reset to True after a merge operation.
"""
# Start by creating a trainCar(i.e. a set) from each list in our listOfLists
freightTrain = [set(trainCar) for trainCar in _listsOfLists if trainCar]
# This continueMerge means that we have not yet compared all cars in the train.
continueMerge = True
while continueMerge:
# Reset the while loop trigger.
continueMerge = False
# Create a fresh empty list of cars that have already been cross checked
crossCheckedCars = []
# While there are still elements in freightTrain
while freightTrain:
# Split the freightTrain up, into first car vs all the remaining cars
commonFirstTrainCar = freightTrain[0]
remainingCars = freightTrain[1:]
# The freightTrain is now empty
freightTrain = []
# Iterate over all the remaining traincars
for currentTrainCar in remainingCars:
# If there are not any common integers with the first car...
if currentTrainCar.isdisjoint(commonFirstTrainCar):
# Add each train car back onto the freightTrain
freightTrain.append(currentTrainCar)
# But if they share a common integer...
else:
# Trigger the reset switch to continueMerging cars
continueMerge = True
# and Join he cars together
commonFirstTrainCar |= currentTrainCar
# Once we have checked commonFirstTrainCar, remove it from the
# freightTrain and place it in crossCheckedCars
crossCheckedCars.append(commonFirstTrainCar)
# Now we have compared the first car to all subsequent cars
# (... but we aren't finished because the 5th and 7th cars might have
# had a common integer with each other... but only 1st and 5th cars
# shared an integer the first time around... so the 1st and 5th cars
# were merged, but the 7th car is still alone!)
# Reset the system by creating a new freightTrain
freightTrain = crossCheckedCars
# Post-process the freight train to turn it into lists instead of sets
listsForReturnTripHome = []
for processedTraincar in freightTrain:
listsForReturnTripHome.append(list(processedTraincar))
return listsForReturnTripHome

My solution, works well on small lists and is quite readable without dependencies.
def merge_list(starting_list):
final_list = []
for i,v in enumerate(starting_list[:-1]):
if set(v)&set(starting_list[i+1]):
starting_list[i+1].extend(list(set(v) - set(starting_list[i+1])))
else:
final_list.append(v)
final_list.append(starting_list[-1])
return final_list
Benchmarking it:
lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]
%timeit merge_list(lists)
100000 loops, best of 3: 4.9 µs per loop

This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs:
(0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)

Use flag to ensure you get the final mutual exclusive results
def merge(lists):
while(1):
flag=0
for i in range(0,len(lists)):
for j in range(i+1,len(lists)):
if len(intersection(lists[i],lists[j]))!=0:
lists[i]=union(lists[i],lists[j])
lists.remove(lists[j])
flag+=1
break
if flag==0:
break
return lists

from itertools import combinations
def merge(elements_list):
d = {index: set(elements) for index, elements in enumerate(elements_list)}
while any(not set.isdisjoint(d[i], d[j]) for i, j in combinations(d.keys(), 2)):
merged = set()
for i, j in combinations(d.keys(), 2):
if not set.isdisjoint(d[i], d[j]):
d[i] = set.union(d[i], d[j])
merged.add(j)
for k in merged:
d.pop(k)
return [v for v in d.values() if v]
lst = [[65, 17, 5, 30, 79, 56, 48, 62],
[6, 97, 32, 93, 55, 14, 70, 32],
[75, 37, 83, 34, 9, 19, 14, 64],
[43, 71],
[],
[89, 49, 1, 30, 28, 3, 63],
[35, 21, 68, 94, 57, 94, 9, 3],
[16],
[29, 9, 97, 43],
[17, 63, 24]]
print(merge(lst))

Related

how to generate 100000 random numbers in python [duplicate]

I tried using random.randint(0, 100), but some numbers were the same. Is there a method/module to create a list unique random numbers?
This will return a list of 10 numbers selected from the range 0 to 99, without duplicates.
import random
random.sample(range(100), 10)
You can use the shuffle function from the random module like this:
import random
nums = list(range(1, 100)) # list of integers from 1 to 99
# adjust this boundaries to fit your needs
random.shuffle(nums)
print(nums) # <- List of unique random numbers
Note here that the shuffle method doesn't return any list as one may expect, it only shuffle the list passed by reference.
You can first create a list of numbers from a to b, where a and b are respectively the smallest and greatest numbers in your list, then shuffle it with Fisher-Yates algorithm or using the Python's random.shuffle method.
Linear Congruential Pseudo-random Number Generator
O(1) Memory
O(k) Operations
This problem can be solved with a simple Linear Congruential Generator. This requires constant memory overhead (8 integers) and at most 2*(sequence length) computations.
All other solutions use more memory and more compute! If you only need a few random sequences, this method will be significantly cheaper. For ranges of size N, if you want to generate on the order of N unique k-sequences or more, I recommend the accepted solution using the builtin methods random.sample(range(N),k) as this has been optimized in python for speed.
Code
# Return a randomized "range" using a Linear Congruential Generator
# to produce the number sequence. Parameters are the same as for
# python builtin "range".
# Memory -- storage for 8 integers, regardless of parameters.
# Compute -- at most 2*"maximum" steps required to generate sequence.
#
def random_range(start, stop=None, step=None):
import random, math
# Set a default values the same way "range" does.
if (stop == None): start, stop = 0, start
if (step == None): step = 1
# Use a mapping to convert a standard range into the desired range.
mapping = lambda i: (i*step) + start
# Compute the number of numbers in this range.
maximum = (stop - start) // step
# Seed range with a random integer.
value = random.randint(0,maximum)
#
# Construct an offset, multiplier, and modulus for a linear
# congruential generator. These generators are cyclic and
# non-repeating when they maintain the properties:
#
# 1) "modulus" and "offset" are relatively prime.
# 2) ["multiplier" - 1] is divisible by all prime factors of "modulus".
# 3) ["multiplier" - 1] is divisible by 4 if "modulus" is divisible by 4.
#
offset = random.randint(0,maximum) * 2 + 1 # Pick a random odd-valued offset.
multiplier = 4*(maximum//4) + 1 # Pick a multiplier 1 greater than a multiple of 4.
modulus = int(2**math.ceil(math.log2(maximum))) # Pick a modulus just big enough to generate all numbers (power of 2).
# Track how many random numbers have been returned.
found = 0
while found < maximum:
# If this is a valid value, yield it in generator fashion.
if value < maximum:
found += 1
yield mapping(value)
# Calculate the next value in the sequence.
value = (value*multiplier + offset) % modulus
Usage
The usage of this function "random_range" is the same as for any generator (like "range"). An example:
# Show off random range.
print()
for v in range(3,6):
v = 2**v
l = list(random_range(v))
print("Need",v,"found",len(set(l)),"(min,max)",(min(l),max(l)))
print("",l)
print()
Sample Results
Required 8 cycles to generate a sequence of 8 values.
Need 8 found 8 (min,max) (0, 7)
[1, 0, 7, 6, 5, 4, 3, 2]
Required 16 cycles to generate a sequence of 9 values.
Need 9 found 9 (min,max) (0, 8)
[3, 5, 8, 7, 2, 6, 0, 1, 4]
Required 16 cycles to generate a sequence of 16 values.
Need 16 found 16 (min,max) (0, 15)
[5, 14, 11, 8, 3, 2, 13, 1, 0, 6, 9, 4, 7, 12, 10, 15]
Required 32 cycles to generate a sequence of 17 values.
Need 17 found 17 (min,max) (0, 16)
[12, 6, 16, 15, 10, 3, 14, 5, 11, 13, 0, 1, 4, 8, 7, 2, ...]
Required 32 cycles to generate a sequence of 32 values.
Need 32 found 32 (min,max) (0, 31)
[19, 15, 1, 6, 10, 7, 0, 28, 23, 24, 31, 17, 22, 20, 9, ...]
Required 64 cycles to generate a sequence of 33 values.
Need 33 found 33 (min,max) (0, 32)
[11, 13, 0, 8, 2, 9, 27, 6, 29, 16, 15, 10, 3, 14, 5, 24, ...]
The solution presented in this answer works, but it could become problematic with memory if the sample size is small, but the population is huge (e.g. random.sample(insanelyLargeNumber, 10)).
To fix that, I would go with this:
answer = set()
sampleSize = 10
answerSize = 0
while answerSize < sampleSize:
r = random.randint(0,100)
if r not in answer:
answerSize += 1
answer.add(r)
# answer now contains 10 unique, random integers from 0.. 100
If you need to sample extremely large numbers, you cannot use range
random.sample(range(10000000000000000000000000000000), 10)
because it throws:
OverflowError: Python int too large to convert to C ssize_t
Also, if random.sample cannot produce the number of items you want due to the range being too small
random.sample(range(2), 1000)
it throws:
ValueError: Sample larger than population
This function resolves both problems:
import random
def random_sample(count, start, stop, step=1):
def gen_random():
while True:
yield random.randrange(start, stop, step)
def gen_n_unique(source, n):
seen = set()
seenadd = seen.add
for i in (i for i in source() if i not in seen and not seenadd(i)):
yield i
if len(seen) == n:
break
return [i for i in gen_n_unique(gen_random,
min(count, int(abs(stop - start) / abs(step))))]
Usage with extremely large numbers:
print('\n'.join(map(str, random_sample(10, 2, 10000000000000000000000000000000))))
Sample result:
7822019936001013053229712669368
6289033704329783896566642145909
2473484300603494430244265004275
5842266362922067540967510912174
6775107889200427514968714189847
9674137095837778645652621150351
9969632214348349234653730196586
1397846105816635294077965449171
3911263633583030536971422042360
9864578596169364050929858013943
Usage where the range is smaller than the number of requested items:
print(', '.join(map(str, random_sample(100000, 0, 3))))
Sample result:
2, 0, 1
It also works with with negative ranges and steps:
print(', '.join(map(str, random_sample(10, 10, -10, -2))))
print(', '.join(map(str, random_sample(10, 5, -5, -2))))
Sample results:
2, -8, 6, -2, -4, 0, 4, 10, -6, 8
-3, 1, 5, -1, 3
If the list of N numbers from 1 to N is randomly generated, then yes, there is a possibility that some numbers may be repeated.
If you want a list of numbers from 1 to N in a random order, fill an array with integers from 1 to N, and then use a Fisher-Yates shuffle or Python's random.shuffle().
Here is a very small function I made, hope this helps!
import random
numbers = list(range(0, 100))
random.shuffle(numbers)
A very simple function that also solves your problem
from random import randint
data = []
def unique_rand(inicial, limit, total):
data = []
i = 0
while i < total:
number = randint(inicial, limit)
if number not in data:
data.append(number)
i += 1
return data
data = unique_rand(1, 60, 6)
print(data)
"""
prints something like
[34, 45, 2, 36, 25, 32]
"""
One straightforward alternative is to use np.random.choice() as shown below
np.random.choice(range(10), size=3, replace=False)
This results in three integer numbers that are different from each other. e.g., [1, 3, 5], [2, 5, 1]...
The answer provided here works very well with respect to time
as well as memory but a bit more complicated as it uses advanced python
constructs such as yield. The simpler answer works well in practice but, the issue with that
answer is that it may generate many spurious integers before actually constructing
the required set. Try it out with populationSize = 1000, sampleSize = 999.
In theory, there is a chance that it doesn't terminate.
The answer below addresses both issues, as it is deterministic and somewhat efficient
though currently not as efficient as the other two.
def randomSample(populationSize, sampleSize):
populationStr = str(populationSize)
dTree, samples = {}, []
for i in range(sampleSize):
val, dTree = getElem(populationStr, dTree, '')
samples.append(int(val))
return samples, dTree
where the functions getElem, percolateUp are as defined below
import random
def getElem(populationStr, dTree, key):
msd = int(populationStr[0])
if not key in dTree.keys():
dTree[key] = range(msd + 1)
idx = random.randint(0, len(dTree[key]) - 1)
key = key + str(dTree[key][idx])
if len(populationStr) == 1:
dTree[key[:-1]].pop(idx)
return key, (percolateUp(dTree, key[:-1]))
newPopulation = populationStr[1:]
if int(key[-1]) != msd:
newPopulation = str(10**(len(newPopulation)) - 1)
return getElem(newPopulation, dTree, key)
def percolateUp(dTree, key):
while (dTree[key] == []):
dTree[key[:-1]].remove( int(key[-1]) )
key = key[:-1]
return dTree
Finally, the timing on average was about 15ms for a large value of n as shown below,
In [3]: n = 10000000000000000000000000000000
In [4]: %time l,t = randomSample(n, 5)
Wall time: 15 ms
In [5]: l
Out[5]:
[10000000000000000000000000000000L,
5731058186417515132221063394952L,
85813091721736310254927217189L,
6349042316505875821781301073204L,
2356846126709988590164624736328L]
In order to obtain a program that generates a list of random values without duplicates that is deterministic, efficient and built with basic programming constructs consider the function extractSamples defined below,
def extractSamples(populationSize, sampleSize, intervalLst) :
import random
if (sampleSize > populationSize) :
raise ValueError("sampleSize = "+str(sampleSize) +" > populationSize (= " + str(populationSize) + ")")
samples = []
while (len(samples) < sampleSize) :
i = random.randint(0, (len(intervalLst)-1))
(a,b) = intervalLst[i]
sample = random.randint(a,b)
if (a==b) :
intervalLst.pop(i)
elif (a == sample) : # shorten beginning of interval
intervalLst[i] = (sample+1, b)
elif ( sample == b) : # shorten interval end
intervalLst[i] = (a, sample - 1)
else :
intervalLst[i] = (a, sample - 1)
intervalLst.append((sample+1, b))
samples.append(sample)
return samples
The basic idea is to keep track of intervals intervalLst for possible values from which to select our required elements from. This is deterministic in the sense that we are guaranteed to generate a sample within a fixed number of steps (solely dependent on populationSize and sampleSize).
To use the above function to generate our required list,
In [3]: populationSize, sampleSize = 10**17, 10**5
In [4]: %time lst1 = extractSamples(populationSize, sampleSize, [(0, populationSize-1)])
CPU times: user 289 ms, sys: 9.96 ms, total: 299 ms
Wall time: 293 ms
We may also compare with an earlier solution (for a lower value of populationSize)
In [5]: populationSize, sampleSize = 10**8, 10**5
In [6]: %time lst = random.sample(range(populationSize), sampleSize)
CPU times: user 1.89 s, sys: 299 ms, total: 2.19 s
Wall time: 2.18 s
In [7]: %time lst1 = extractSamples(populationSize, sampleSize, [(0, populationSize-1)])
CPU times: user 449 ms, sys: 8.92 ms, total: 458 ms
Wall time: 442 ms
Note that I reduced populationSize value as it produces Memory Error for higher values when using the random.sample solution (also mentioned in previous answers here and here). For above values, we can also observe that extractSamples outperforms the random.sample approach.
P.S. : Though the core approach is similar to my earlier answer, there are substantial modifications in implementation as well as approach alongwith improvement in clarity.
The problem with the set based approaches ("if random value in return values, try again") is that their runtime is undetermined due to collisions (which require another "try again" iteration), especially when a large amount of random values are returned from the range.
An alternative that isn't prone to this non-deterministic runtime is the following:
import bisect
import random
def fast_sample(low, high, num):
""" Samples :param num: integer numbers in range of
[:param low:, :param high:) without replacement
by maintaining a list of ranges of values that
are permitted.
This list of ranges is used to map a random number
of a contiguous a range (`r_n`) to a permissible
number `r` (from `ranges`).
"""
ranges = [high]
high_ = high - 1
while len(ranges) - 1 < num:
# generate a random number from an ever decreasing
# contiguous range (which we'll map to the true
# random number).
# consider an example with low=0, high=10,
# part way through this loop with:
#
# ranges = [0, 2, 3, 7, 9, 10]
#
# r_n :-> r
# 0 :-> 1
# 1 :-> 4
# 2 :-> 5
# 3 :-> 6
# 4 :-> 8
r_n = random.randint(low, high_)
range_index = bisect.bisect_left(ranges, r_n)
r = r_n + range_index
for i in xrange(range_index, len(ranges)):
if ranges[i] <= r:
# as many "gaps" we iterate over, as much
# is the true random value (`r`) shifted.
r = r_n + i + 1
elif ranges[i] > r_n:
break
# mark `r` as another "gap" of the original
# [low, high) range.
ranges.insert(i, r)
# Fewer values possible.
high_ -= 1
# `ranges` happens to contain the result.
return ranges[:-1]
I found a quite faster way than having to use the range function (very slow), and without using random function from python (I don´t like the random built-in library because when you seed it, it repeats the pattern of the random numbers generator)
import numpy as np
nums = set(np.random.randint(low=0, high=100, size=150)) #generate some more for the duplicates
nums = list(nums)[:100]
This is quite fast.
You can use Numpy library for quick answer as shown below -
Given code snippet lists down 6 unique numbers between the range of 0 to 5. You can adjust the parameters for your comfort.
import numpy as np
import random
a = np.linspace( 0, 5, 6 )
random.shuffle(a)
print(a)
Output
[ 2. 1. 5. 3. 4. 0.]
It doesn't put any constraints as we see in random.sample as referred here.
import random
sourcelist=[]
resultlist=[]
for x in range(100):
sourcelist.append(x)
for y in sourcelist:
resultlist.insert(random.randint(0,len(resultlist)),y)
print (resultlist)
Try using...
import random
LENGTH = 100
random_with_possible_duplicates = [random.randrange(-3, 3) for _ in range(LENGTH)]
random_without_duplicates = list(set(random_with_possible_duplicates)) # This removes duplicates
Advatages
Fast, efficient and readable.
Possible Issues
This method can change the length of the list if there are duplicates.
If you wish to ensure that the numbers being added are unique, you could use a Set object
if using 2.7 or greater, or import the sets module if not.
As others have mentioned, this means the numbers are not truly random.
If the amount of numbers you want is random, you can do something like this. In this case, length is the highest number you want to choose from.
If it notices the new random number was already chosen, itll subtract 1 from count (since a count was added before it knew whether it was a duplicate or not). If its not in the list, then do what you want with it and add it to the list so it cant get picked again.
import random
def randomizer():
chosen_number=[]
count=0
user_input = int(input("Enter number for how many rows to randomly select: "))
numlist=[]
#length = whatever the highest number you want to choose from
while 1<=user_input<=length:
count=count+1
if count>user_input:
break
else:
chosen_number = random.randint(0, length)
if line_number in numlist:
count=count-1
continue
if chosen_number not in numlist:
numlist.append(chosen_number)
#do what you want here
Edit: ignore my answer here. use python's random.shuffle or random.sample, as mentioned in other answers.
to sample integers without replacement between `minval` and `maxval`:
import numpy as np
minval, maxval, n_samples = -50, 50, 10
generator = np.random.default_rng(seed=0)
samples = generator.permutation(np.arange(minval, maxval))[:n_samples]
# or, if minval is 0,
samples = generator.permutation(maxval)[:n_samples]
with jax:
import jax
minval, maxval, n_samples = -50, 50, 10
key = jax.random.PRNGKey(seed=0)
samples = jax.random.shuffle(key, jax.numpy.arange(minval, maxval))[:n_samples]
From the CLI in win xp:
python -c "import random; print(sorted(set([random.randint(6,49) for i in range(7)]))[:6])"
In Canada we have the 6/49 Lotto. I just wrap the above code in lotto.bat and run C:\home\lotto.bat or just C:\home\lotto.
Because random.randint often repeats a number, I use set with range(7) and then shorten it to a length of 6.
Occasionally if a number repeats more than 2 times the resulting list length will be less than 6.
EDIT: However, random.sample(range(6,49),6) is the correct way to go.

Maximum path sum of 2 lists

My question is about this kata on Codewars. The function takes two sorted lists with distinct elements as arguments. These lists might or might not have common items. The task is find the maximum path sum. While finding the sum, if there any common items you can choose to change your path to the other list.
The given example is like this:
list1 = [0, 2, 3, 7, 10, 12]
list2 = [1, 5, 7, 8]
0->2->3->7->10->12 => 34
0->2->3->7->8 => 20
1->5->7->8 => 21
1->5->7->10->12 => 35 (maximum path)
I solved the kata but my code doesn't match the performance criteria so I get execution timed out. What can I do for it?
Here is my solution:
def max_sum_path(l1:list, l2:list):
common_items = list(set(l1).intersection(l2))
if not common_items:
return max(sum(l1), sum(l2))
common_items.sort()
s = 0
new_start1 = 0
new_start2 = 0
s1 = 0
s2 = 0
for item in common_items:
s1 = sum(itertools.islice(l1, new_start1, l1.index(item)))
s2 = sum(itertools.islice(l2, new_start2, l2.index(item)))
new_start1 = l1.index(item)
new_start2 = l2.index(item)
s += max(s1, s2)
s1 = sum(itertools.islice(l1, new_start1, len(l1)))
s2 = sum(itertools.islice(l2, new_start2, len(l2)))
s += max(s1, s2)
return s
Your algorithm is actually fast, just your implementation is slow.
The two things that make it take overall O(n²) time:
l1.index(item) always searches from the start of the list. Should be l1.index(item, new_start1).
itertools.islice(l1, new_start1, ...) creates an iterator for l1 and iterates over the first new_start1 elements before it reaches the elements you want. So just use a normal list slice instead.
Then it's just O(n log n) for the sorting and O(n) for everything else. And the sorting's O(n log n) is fast, might easily take less time than the O(n) part for any allowed input and even larger ones.
Here's the rewritten version, gets accepted in about 6 seconds, just like the solutions from the other answers.
def max_sum_path(l1:list, l2:list):
common_items = list(set(l1).intersection(l2))
if not common_items:
return max(sum(l1), sum(l2))
common_items.sort()
s = 0
new_start1 = 0
new_start2 = 0
s1 = 0
s2 = 0
for item in common_items:
next_start1 = l1.index(item, new_start1) # changed
next_start2 = l2.index(item, new_start2) # changed
s1 = sum(l1[new_start1 : next_start1]) # changed
s2 = sum(l2[new_start2 : next_start2]) # changed
new_start1 = next_start1 # changed
new_start2 = next_start2 # changed
s += max(s1, s2)
s1 = sum(l1[new_start1:]) # changed
s2 = sum(l2[new_start2:]) # changed
s += max(s1, s2)
return s
Or you could use iterators instead of indexes. Here's your solution rewritten to do that, also gets accepted in about 6 seconds:
def max_sum_path(l1:list, l2:list):
common_items = sorted(set(l1) & set(l2))
s = 0
it1 = iter(l1)
it2 = iter(l2)
for item in common_items:
s1 = sum(iter(it1.__next__, item))
s2 = sum(iter(it2.__next__, item))
s += max(s1, s2) + item
s1 = sum(it1)
s2 = sum(it2)
s += max(s1, s2)
return s
I'd combine the last four lines into one, just left it like you had so it's easier to compare.
The problem says "aim for linear time complexity", which is a pretty big hint that things like nested loops won't fly (index are nested O(n) loops here and sort() is O(n log(n)) when there are many duplicate values between the input lists).
This answer shows how you can cache the repeated .index calls and use start offsets from the last chunk to bring the complexity down.
As the linked answer also states, itertools.islice isn't appropriate here because it traverses from the start of the list. Instead, use native slicing. This, coupled with the modifications to index above, gives you linearithmic complexity overall, linear on most input.
For context, here's my approach, which isn't that different from yours, although I cache indices and avoid sorting.
I started by formulating the problem as a directed acyclic graph with the idea of searching for the maximum path sum:
+---> [0, 2, 3] ---+ +---> [10, 12]
[0] ---| |---> [7] ---|
+---> [1, 5] ------+ +---> [8]
We might as well also sum the values of each node for clarity:
+---> 5 ---+ +---> 22
0 ---| |---> 7 ---|
+---> 6 ---+ +---> 8
The diagram above reveals that a greedy solution will be optimal, given the uniqueness constraints. For example, starting from the root, we can only pick the 5 or 6 value path to get to 7. The larger of the two, 6, is guaranteed to be part of the maximum-weight path, so we take it.
Now, the question is only how to implement this logic. Going back to the lists, here's a more substantial input with formatting and annotations to help motivate an approach:
[1, 2, 4, 7, 8, 10, 14, 15 ]
[ 4, 8, 9, 11, 12, 15, 90]
^ ^ ^
| | |
This illustrates how the linked indices line up. Our goal is to iterate over each chunk between the links, taking the larger of the two sublist sums:
[1, 2, 4, 7, 8, 10, 14, 15 ]
[ 4, 8, 9, 11, 12, 15, 90]
^~~^ ^ ^~~~~~~~~~~~~~~~^ ^^
0 1 2 3 <-- chunk number
The expected result for the above input should be 3 + 4 + 7 + 8 + 32 + 15 + 90 = 159, taking all of the link values plus the top list's sublist sum for chunks 0 and 1 and the bottom list for chunks 2 and 3.
Here's a rather verbose, but hopefully easy to understand, implementation; you can visit the thread to see more elegant solutions:
def max_sum_path(a, b):
b_idxes = {k: i for i, k in enumerate(b)}
link_to_a = {}
link_to_b = {}
for i, e in enumerate(a):
if e in b_idxes:
link_to_a[e] = i
link_to_b[e] = b_idxes[e]
total = 0
start_a = 0
start_b = 0
for link in link_to_a: # dicts assumed sorted, Python 3.6+
end_a = link_to_a[link]
end_b = link_to_b[link]
total += max(sum(a[start_a:end_a]), sum(b[start_b:end_b])) + link
start_a = end_a + 1
start_b = end_b + 1
return total + max(sum(a[start_a:]), sum(b[start_b:]))
This can be done in a single pass in O(n) runtime and O(1) space complexity. All you need is two pointers to traverse both arrays in parallel and two path values.
You increment the pointer to the smaller element and add its value to its path. When you find a common element, you add it to both paths and then set both paths to the max value.
def max_sum_path(l1, l2):
path1 = 0
path2 = 0
i = 0
j = 0
while i < len(l1) and j < len(l2):
if l1[i] < l2[j]:
path1 += l1[i]
i += 1
elif l2[j] < l1[i]:
path2 += l2[j]
j += 1
else:
# Same element in both paths
path1 += l1[i]
path2 += l1[i]
path1 = max(path1, path2)
path2 = path1
i += 1
j += 1
while i < len(l1):
path1 += l1[i]
i += 1
while j < len(l2):
path2 += l2[j]
j += 1
return max(path1, path2)
Once you know the items shared between the two lists, you can iterate over each list separately to sum up the items in between the shared items, thus constructing a list of partial sums. These lists will have the same length for both input lists, because the number of shared items is the same.
The maximum path sum can then be found by taking the maximum between the two lists for each stretch between shared values:
def max_sum_path(l1, l2):
shared_items = set(l1) & set(l2)
def partial_sums(lst):
result = []
partial_sum = 0
for item in lst:
partial_sum += item
if item in shared_items:
result.append(partial_sum)
partial_sum = 0
result.append(partial_sum)
return result
return sum(map(max, partial_sums(l1),
partial_sums(l2)))
Time complexity: We only iterate once over each list (the iteration over the shorter lists of partial sums is irrelevant here), so this code is linear in the length of the input lists. However, as you and Kelly Bundy have noted, your own algorithm actually has the same time complexity, except for the sorting the common items part, which does not appear to be too relevant for the given test cases.
So as a general conclusion, if your goal is just to make your code fast enough for passing certain test cases, it can be better to profile the execution to find the time sinks in the actual implementation rather than worry about theoretical worst case scenarios.
Benchmarks
On the Discourse tab you can click "Show Kata Test Cases" (once you solved the kata) to see their test case generator. I used that to benchmark the solutions posted so far as well as one from me. A few dozen rounds, since the test cases are pretty random, causing big runtime fluctuation. In each round, all test cases generated were given to all solutions (so in each round, all solutions got the same test cases).
And also Kelly Bundy's worst case for sorting the set of common values:
Code shall follow.

How to make sure that a list of generated numbers follow a uniform distribution

I have a list of 150 numbers from 0 to 149. I would like to use a for loop with 150 iterations in order to generate 150 lists of 6 numbers such that,t in each iteration k, the number k is included as well as 5 different random numbers. For example:
S0 = [0, r1, r2, r3, r4, r5] # r1, r2,..., r5 are random numbers between 0 and 150
S1 = [1, r1', r2', r3', r4', r5'] # r1', r2',..., r5' are new random numbers between 0 and 150
...
S149 = [149, r1'', r2'', r3'', r4'', r5'']
In addition, the numbers in each list have to be different and with a minimum distance of 5. This is the code I am using:
import random
import numpy as np
final_list = []
for k in range(150):
S = [k]
for it in range(5):
domain = [ele for ele in range(150) if ele not in S]
d = 0
x = k
while d < 5:
d = np.Infinity
x = random.sample(domain, 1)[0]
for ch in S:
if np.abs(ch - x) < d:
d = np.abs(ch - x)
S.append(x)
final_list.append(S)
Output:
[[0, 149, 32, 52, 39, 126],
[1, 63, 16, 50, 141, 79],
[2, 62, 21, 42, 35, 71],
...
[147, 73, 38, 115, 82, 47],
[148, 5, 78, 115, 140, 43],
[149, 36, 3, 15, 99, 23]]
Now, the code is working but I would like to know if it's possible to force that number of repetitions that each number has through all the iterations is approximately the same. For example, after using the previous code, this plot indicates how many times each number has appeared in the generated lists:
As you can see, there are numbers that have appeared more than 10 times while there are others that have appeared only 2 times. Is it possible to reduce this level of variation so that this plot can be approximated as a uniform distribution? Thanks.
First, I am not sure that your assertion that the current results are not uniformly distributed is necessarily correct. It would seem prudent to me to try and examine the histogram over several repetitions of the process, rather than just one.
I am not a statistician, but when I want to approximate uniform distribution (and assuming that the functions in random provide uniform distribution), what I try to do is to simply accept all results returned by random functions. For that, I need to limit the choices given to these functions ahead of calling them. This is how I would go about your task:
import random
import numpy as np
N = 150
def random_subset(n):
result = []
cands = set(range(N))
for i in range(6):
result.append(n) # Initially, n is the number that must appear in the result
cands -= set(range(n - 4, n + 5)) # Remove candidates less than 5 away
n = random.choice(list(cands)) # Select next number
return result
result = np.array([random_subset(n) for n in range(N)])
print(result)
Simply put, whenever I add a number n to the result set, I take out of the selection candidates, an environment of the proper size, to ensure no number of a distance of less than 5 can be selected in the future.
The code is not optimized (multiple set to list conversions) but it works (as per my uderstanding).
You can force it to be precisely uniform, if you so desire.
Apologies for the mix of globals and locals, this seemed the most readable. You would want to rewrite according to how variable your constants are =)
import random
SIZE = 150
SAMPLES = 5
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
item = [i] + selection
items.append(item)
return items
Then you will have exactly 5 of each (and one more in the leading position, which is a weird data structure).
>>> set(collections.Counter(vv for v in get_samples() for vv in v).values())
{6}
The method above does not guarantee the last 5 numbers are unique, in fact, you would expect ~10/150 to have a duplicate. If that is important, you need to filter your distribution a little more and decide how well you value tight uniformity, duplicates, etc.
If your numbers are approximately what you gave above, you also can patch up the results (fairly) and hope to avoid long search times (not the case for SAMPLES sizes closer to OPTIONS size)
def get_samples():
pool = list(range(SIZE)) * SAMPLES
random.shuffle(pool)
i = 0
while i < len(pool):
if i % SAMPLES == 0:
seen = set()
v = pool[i]
if v in seen: # swap
dst = random.choice(range(SIZE))
pool[dst], pool[i] = pool[i], pool[dst]
i = dst - dst % SAMPLES # Restart from swapped segment
else:
seen.add(v)
i += 1
items = []
for i in range(SIZE):
selection, pool = pool[:SAMPLES], pool[SAMPLES:]
assert len(set(selection)) == SAMPLES, selection
item = [i] + selection
items.append(item)
return items
This will typically take less than 5 passes through to clean up any duplicates, and should leave all arrangements satisfying your conditions equally likely.

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)
I pretty much screwed up on the first interview problem...
Question: Given a string of a million numbers (Pi for example), write
a function/program that returns all repeating 3 digit numbers and number of
repetition greater than 1
For example: if the string was: 123412345123456 then the function/program would return:
123 - 3 times
234 - 3 times
345 - 2 times
They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:
000 --> 999
Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?
You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)
There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)-1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++ - '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d -> %d\n", i, freq[i]);
return 0;
}
Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.
A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n-3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s - %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.
The simple O(n) solution would be to count each 3-digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count.
Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than #Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms
I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num) - 3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.
As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))-2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.
As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."
Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)-2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)-2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)-2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than #paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
Image as answer:
Looks like a sliding window.
Here is my solution:
from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}
With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point.
Hope it helps :)
-Telling from the perspective of C.
-You can have an int 3-d array results[10][10][10];
-Go from 0th location to n-4th location, where n being the size of the string array.
-On each location, check the current, next and next's next.
-Increment the cntr as resutls[current][next][next's next]++;
-Print the values of
results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]
-It is O(n) time, there is no comparisons involved.
-You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr) - 2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

Counting number of values between interval

Is there any efficient way in python to count the times an array of numbers is between certain intervals? the number of intervals i will be using may get quite large
like:
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
some function(mylist, startpoints):
# startpoints = [0,10,20]
count values in range [0,9]
count values in range [10-19]
output = [9,10]
you will have to iterate the list at least once.
The solution below works with any sequence/interval that implements comparision (<, >, etc) and uses bisect algorithm to find the correct point in the interval, so it is very fast.
It will work with floats, text, or whatever. Just pass a sequence and a list of the intervals.
from collections import defaultdict
from bisect import bisect_left
def count_intervals(sequence, intervals):
count = defaultdict(int)
intervals.sort()
for item in sequence:
pos = bisect_left(intervals, item)
if pos == len(intervals):
count[None] += 1
else:
count[intervals[pos]] += 1
return count
data = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
print count_intervals(data, [10, 20])
Will print
defaultdict(<type 'int'>, {10: 10, 20: 9})
Meaning that you have 10 values <10 and 9 values <20.
I don't know how large your list will get but here's another approach.
import numpy as np
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
np.histogram(mylist, bins=[0,9,19])
You can also use a combination of value_counts() and pd.cut() to help you get the job done.
import pandas as pd
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
split_mylist = pd.cut(mylist, [0, 9, 19]).value_counts(sort = False)
print(split_mylist)
This piece of code will return this:
(0, 10] 10
(10, 20] 9
dtype: int64
Then you can utilise the to_list() function to get what you want
split_mylist = split_mylist.tolist()
print(split_mylist)
Output: [10, 9]
If the numbers are integers, as in your example, representing the intervals as frozensets can perhaps be fastest (worth trying). Not sure if the intervals are guaranteed to be mutually exclusive -- if not, then
intervals = [frozenzet(range(10)), frozenset(range(10, 20))]
counts = [0] * len(intervals)
for n in mylist:
for i, inter in enumerate(intervals):
if n in inter:
counts[i] += 1
if the intervals are mutually exclusive, this code could be sped up a bit by breaking out of the inner loop right after the increment. However for mutually exclusive intervals of integers >= 0, there's an even more attractive option: first, prepare an auxiliary index, e.g. given your startpoints data structure that could be
indices = [sum(i > x for x in startpoints) - 1 for i in range(max(startpoints))]
and then
counts = [0] * len(intervals)
for n in mylist:
if 0 <= n < len(indices):
counts[indices[n]] += 1
this can be adjusted if the intervals can be < 0 (everything needs to be offset by -min(startpoints) in that case.
If the "numbers" can be arbitrary floats (or decimal.Decimals, etc), not just integer, the possibilities for optimization are more restricted. Is that the case...?

Categories