large array searching with numpy - python

I have a two arrays of integers
a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])
where len(a) ~ 15,000 and len(b) ~ 2 million.
What I want is to find the indices of array b elements which match those in array a. Now, I'm using list comprehension and numpy.argwhere() to achieve this:
bInds = [ numpy.argwhere(b == c)[0] for c in a ]
however, obviously, it is taking a long time to complete this. And array a will become larger too, so this is not a sensible route to take.
Is there a better way to achieve this result, considering the large arrays I'm dealing with here? It currently takes around ~5 minutes to do this. Any speed up is needed!
More info: I want the indices to match the order of array a too. (Thanks Charles)

Unless I'm mistaken, your approach searches the entire array b for each element of a again and again.
Alternatively, you could create a dictionary mapping the individual elements from b to their indices.
indices = {}
for i, e in enumerate(b):
indices[e] = i # if elements in b are unique
indices.setdefault(e, []).append(i) # otherwise, use lists
Then you can use this mapping for quickly finding the indices where elements from a can be found in b.
bInds = [ indices[c] for c in a ]

This take about a second to run.
import numpy
#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)
#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
if val in set_a:
result.add(i)
result = numpy.array(list(result))
result.sort()
print result

Related

Calling indices in an expression while iterating over a variable number of indices (Python)

I want to iterate over a variable number of indices given by the length of a list, using the values on the list as the ranges. Further, I want to call the indices in my expression.
For example, if I have a list [2,4,5], I would want something like:
import itertools
for i0, i1, i2 in itertools.product(range(2),range(4),range(5)):
otherlist[i0]**i0 + otherlist[i2]**i2
The closest I can get is
for [i for i in range(len(mylist))] in itertools.product(*[range(i) for i in mylist]):
But I don't know how to call the indices from here.
You were so close already. When you use a for statement, I find it best to
keep the target list simple and access components of the target list in
the for-loop; in this code, the tuples generated by product().
import itertools
mylist = [2,4,5]
otherlist = list(range(5))
for t in itertools.product(*[range(i) for i in mylist]):
print(otherlist[t[0]]**t[0] + otherlist[t[2]]**t[2])
# 2
# 2
# 5
# 28
# 257
# ...

How to form an unique collection with one element taken from each array?

Say I have 3 integer arrays: {1,2,3}, {2,3}, {1}
I must take exactly one element from each array, to form a new array where all numbers are unique. In this example, the correct answers are: {2,3,1} and {3,2,1}. (Since I must take one element from the 3rd array, and I want all numbers to be unique, I must never take the number 1 from the first array.)
What I have done:
for a in array1:
for b in array2:
for c in array3:
if a != b and a != c and b != c:
AddAnswer(a,b,c)
This is brute force, which works, but it doesn't scale well. What if now we are dealing with 20 arrays instead of just 3. I don't think it's good to write a 20 nested for-loops. Is there a clever way to do this?
What about:
import itertools
arrs = [[1,2,3], [2,3], [1]]
for x in itertools.product(*arrs):
if len(set(x)) < len(arrs): continue
AddAnswer(x)
AddAnswer(x) is called twice, with the tuples:
(2, 3, 1)
(3, 2, 1)
You can think of this as finding a matching in a bipartite graph.
You are trying to select one element from each set, but are not allowed to select the same element twice, so you are trying to match sets to numbers.
You can use the matching function in the graph library NetworkX to do this efficiently.
Python example code:
import networkx as nx
A=[ [1,2,3], [2,3], [1] ]
numbers = set()
for s in A:
for n in s:
numbers.add(n)
B = nx.Graph()
for n in numbers:
B.add_node('%d'%n,bipartite=1)
for i,s in enumerate(A):
set_name = 's%d'%i
B.add_node(set_name,bipartite=0)
for n in s:
B.add_edge(set_name,n)
matching = nx.maximal_matching(B)
if len(matching) != len(A):
print 'No complete matching'
else:
for number,set_name in matching:
print 'choose',number,'from',set_name
This is a simple, efficient method for finding a single matching.
If you want to enumerate through all matchings you may wish to read:
Algorithms for Enumerating All Perfect, Maximum and
Maximal Matchings in Bipartite Graphs by Takeaki UNO which gives O(V) complexity per matching.
A recursive solution (not tested):
def max_sets(list_of_sets, excluded=[]):
if not list_of_sets:
return [set()]
else:
res = []
for x in list_of_sets[0]:
if x not in excluded:
for candidate in max_sets(list_of_sets[1:], exclude+[x]):
candidate.add(x)
res.append(candidate)
return res
(You could probably dispense with the set but it's not clear if it was in the question or not...)

Python: Fast way to create image from a list of tuples

I am doing the following.
import numpy as np
import pylab
.....
x = np.zeros([250,200])
for tup in tups:
x[tup[1],tup[0]] = x[tup[1],tup[0]] + 1
pylab.imshow(x)
Where
tups = [(x1,y1),(x2,y2),....]
and xi,yi are integers
This is fine for tup with a low number of points. For a large number of points ~10^6 it is taking hours.
Can you think of a faster way of doing this?
One small improvement i can easily see, instead of the next:
for tup in tups:
x[tup[1],tup[0]] = x[tup[1],tup[0]] + 1
try doing
for tup in tups:
x[tup[1],tup[0]] += 1
Since this overwrites the same memory-adress, instead of creating a new memory-spot to put 'old value + 1' (note: this will probably only result in a marginal speedup in this case, but if you do this same trick A+=B instead of C = A + B, in the case where A and B are numpy ndarrays of a Gb each or so, it actually is a massive speedup)
Why do you read in something as tuples? shouldnt you try to read it in as a numpy ndarray in the first place, instead of reading it in as a list of tuples and than change to a numpy array? Where do you create that big list of tuples? If that can be avoided, it will be much better, to just avoid the list of tuples, instead of creating it and than later swapping to a numpy solution?
Edit: so i just wanted to tell of this speedup that you can get by the +=, and at the same time ask why you have a big list of tuples, but thats too long to put both things in a comment
Another question: am i right in assuming your tuples can have multiple repeats? like
tups = [(1,0), (2,4), (1,0), (1,2), ..., (999, 999), (992, 999)]
so that in your endresult, other values than 0 and 1 will exist? or is your resulting array something in which only ones and zeros exist?
Using numpy you could convert your pairs of indices into a flat index and bincount it:
import numpy as np
import random
rows, cols = 250, 200
n = 1000
tups = [(random.randint(0, rows-1),
random.randint(0, cols-1)) for _ in range(n)]
x = np.zeros((rows, cols))
for tup in tups:
x[tup[0],tup[1]] += 1
flat_idx = np.ravel_multi_index(zip(*tups), (rows, cols))
y = np.bincount(flat_idx, minlength=rows*cols).reshape(rows, cols)
np.testing.assert_equal(x, y)
It will be much faster than any looping solution.

Sort list of lists by unique reversed absolute condition

Context - developing algorithm to determine loop flows in a power flow network.
Issue:
I have a list of lists, each list represents a loop within the network determined via my algorithm. Unfortunately, the algorithm will also pick up the reversed duplicates.
i.e.
L1 = [a, b, c, -d, -a]
L2 = [a, d, c, -b, -a]
(Please note that c should not be negative, it is correct as written due to the structure of the network and defined flows)
Now these two loops are equivalent, simply following the reverse structure throughout the network.
I wish to retain L1, whilst discarding L2 from the list of lists.
Thus if I have a list of 6 loops, of which 3 are reversed duplicates I wish to retain all three.
Additionally, The loop does not have to follow the format specified above. It can be shorter, longer, and the sign structure (e.g. pos pos pos neg neg) will not occur in all instances.
I have been attempting to sort this by reversing the list and comparing the absolute values.
I am completely stumped and any assistance would be appreciated.
Based upon some of the code provided by mgibson I was able to create the following.
def Check_Dup(Loops):
Act = []
while Loops:
L = Loops.pop()
Act.append(L)
Loops = Popper(Loops, L)
return Act
def Popper(Loops, L):
for loop in Loops:
Rev = loop[::-1]
if all (abs(x) == abs(y) for x, y in zip(loop_check, Rev)):
Loops.remove(loop)
return Loops
This code should run until there are no loops left discarding the duplicates each time. I'm accepting mgibsons answers as it provided the necessary keys to create the solution
I'm not sure I get your question, but reversing a list is easy:
a = [1,2]
a_rev = a[::-1] #new list -- if you just want an iterator, reversed(a) also works.
To compare the absolute values of a and a_rev:
all( abs(x) == abs(y) for x,y in zip(a,a_rev) )
which can be simplified to:
all( abs(x) == abs(y) for x,y in zip(a,reversed(a)) )
Now, in order to make this as efficient as possible, I would first sort the arrays based on the absolute value:
your_list_of_lists.sort(key = lambda x : map(abs,x) )
Now you know that if two lists are going to be equal, they have to be adjacent in the list and you can just pull that out using enumerate:
def cmp_list(x,y):
return True if x == y else all( abs(a) == abs(b) for a,b in zip(a,b) )
duplicate_idx = [ idx for idx,val in enumerate(your_list_of_lists[1:])
if cmp_list(val,your_list_of_lists[idx]) ]
#now remove duplicates:
for idx in reversed(duplicate_idx):
_ = your_list_of_lists.pop(idx)
If your (sub) lists are either strictly increasing or strictly decreasing, this becomes MUCH simpler.
lists = list(set( tuple(sorted(x)) for x in your_list_of_lists ) )
I don't see how they can be equivalent if you have c in both directions - one of them must be -c
>>> a,b,c,d = range(1,5)
>>> L1 = [a, b, c, -d, -a]
>>> L2 = [a, d, -c, -b, -a]
>>> L1 == [-x for x in reversed(L2)]
True
now you can write a function to collapse those two loops into a single value
>>> def normalise(loop):
... return min(loop, [-x for x in reversed(L2)])
...
>>> normalise(L1)
[1, 2, 3, -4, -1]
>>> normalise(L2)
[1, 2, 3, -4, -1]
A good way to eliminate duplicates is to use a set, we just need to convert the lists to tuples
>>> L=[L1, L2]
>>> set(tuple(normalise(loop)) for loop in L)
set([(1, 2, 3, -4, -1)])
[pair[0] for pair in frozenset(sorted( (c,negReversed(c)) ) for c in cycles)]
Where:
def negReversed(list):
return tuple(-x for x in list[::-1])
and where cycles must be tuples.
This takes each cycle, computes its duplicate, and sorts them (putting them in a pair that are canonically equivalent). The set frozenset(...) uniquifies any duplicates. Then you extract the canonical element (in this case I arbitrarily chose it to be pair[0]).
Keep in mind that your algorithm might be returning cycles starting in arbitrary places. If this is the case (i.e. your algorithm might return either [1,2,-3] or [-3,1,2]), then you need to consider these as equivalent necklaces
There are many ways to canonicalize necklaces. The above way is less efficient because we don't care about canonicalizing the necklace directly: we just treat the entire equivalence class as the canonical element, by turning each cycle (a,b,c,d,e) into {(a,b,c,d,e), (e,a,b,c,d), (d,e,a,b,c), (c,d,e,a,b), (b,c,d,e,a)}. In your case since you consider negatives to be equivalent, you would turn each cycle into {(a,b,c,d,e), (e,a,b,c,d), (d,e,a,b,c), (c,d,e,a,b), (b,c,d,e,a), (-a,-b,-c,-d,-e), (-e,-a,-b,-c,-d), (-d,-e,-a,-b,-c), (-c,-d,-e,-a,-b), (-b,-c,-d,-e,-a)}. Make sure to use frozenset for performance, as set is not hashable:
eqClass.pop() for eqClass in {frozenset(eqClass(c)) for c in cycles}
where:
def eqClass(cycle):
for rotation in rotations(cycle):
yield rotation
yield (-x for x in rotation)
where rotation is something like Efficient way to shift a list in python but yields a tuple

python union of 2 nested lists with index

I want to get the union of 2 nested lists plus an index to the common values.
I have two lists like A = [[1,2,3],[4,5,6],[7,8,9]] and B = [[1,2,3,4],[3,3,5,7]] but the length of each list is about 100 000. To A belongs an index vector with len(A): I = [2,3,4]
What I want is to find all sublists in B where the first 3 elements are equal to a sublist in A. In this example I want to get B[0] returned ([1,2,3,4]) because its first three elements are equal to A[0]. In addition, I also want the index to A[0] in this example, that is I[0].
I tried different things, but nothing worked so far :(
First I tried this:
Common = []
for i in range(len(B)):
if B[i][:3] in A:
id = [I[x] for x,y in enumerate(A) if y == B[i][:3]][0]
ctdCommon.append([int(id)] + B[i])
But that takes ages, or never finishes
Then I transformed A and B into sets and took the union from both, which was very quick, but then I don't know how to get the corresponding indices
Does anyone have an idea?
Create an auxiliary dict (work is O(len(A)) -- assuming the first three items of a sublist in A uniquely identify it (otherwise you need a dict of lists):
aud = dict((tuple(a[:3]), i) for i, a in enumerate(A))
Use said dict to loop once on B (work is O(len(B))) to get B sublists and A indices:
result = [(b, aud[tuple(b[:3])]) for b in B if tuple(b[:3]) in aud]

Categories