Finding Indices for Repeat Sequences in NumPy Array

Finding Indices for Repeat Sequences in NumPy Array - python

This is a follow up to a previous question. If I have a NumPy array [0, 1, 2, 2, 3, 4, 2, 2, 5, 5, 6, 5, 5, 2, 2], for each repeat sequence (starting at each index), is there a fast way to to then find all matches of that repeat sequence and return the index for those matches?
Here, the repeat sequences are [2, 2] and [5, 5] (note that the length of the repeat is specified by the user but will be the same length and can be much greater than 2). The repeats can be found at [2, 6, 8, 11, 13] via:
def consec_repeat_starts(a, n):
N = n-1
m = a[:-1]==a[1:]
return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1
But for each unique type of repeat sequence (i.e., [2, 2] and [5, 5]) I want to return something like the repeat followed by the indices for where the repeat is located:
[([2, 2], [2, 6, 13]), ([5, 5], [8, 11])]
Update
Additionally, given the repeat sequence, can you return the results from a second array. So, look for [2, 2] and [5, 5] in:
[2, 2, 5, 5, 1, 4, 9, 2, 5, 5, 0, 2, 2, 2]
And the function would return:
[([2, 2], [0, 11, 12]), ([5, 5], [2, 8]))]

Here's a way to do so -
def group_consec(a, n):
idx = consec_repeat_starts(a, n)
b = a[idx]
sidx = b.argsort()
c = b[sidx]
cut_idx = np.flatnonzero(np.r_[True, c[:-1]!=c[1:],True])
idx_s = idx[sidx]
indices = [idx_s[i:j] for (i,j) in zip(cut_idx[:-1],cut_idx[1:])]
return c[cut_idx[:-1]], indices
# Perform lookup in another array, b
n = 2
v_a,indices_a = group_consec(a, n)
v_b,indices_b = group_consec(b, n)
idx = np.searchsorted(v_a, v_b)
idx[idx==len(v_a)] = 0
valid_mask = v_a[idx]==v_b
common_indices = [j for (i,j) in zip(valid_mask,indices_b) if i]
common_val = v_b[valid_mask]
Note that for simplicity and ease of usage, the first output arg off group_consec has the unique values per sequence. If you need them in (val, val,..) format, simply replicate at the end. Similarly, for common_val.

Related

Python - doing maths with a nested list

If I have a nested list, e.g. x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]], how can I calculate the difference between all of them? Let's called the lists inside x - A, B, and C. I want to calculate the difference of A from B & C, then B from A & C, then C from A & B, then put them in a list diff = [].
My problem is correctly indexing the numbers and using them to do maths with corresponding elements in other lists.
This is what I have so far:
for i in range(len(x)):
diff = []
for j in range(len(x)):
if x[i]!=x[j]:
a = x[i]
b = x[j]
for h in range(len(a)):
d = a[h] - b[h]
diff.append(d)
Essentially for the difference of A to B it is ([1-2] + [2-4] + [3-6])
I would like it to return: diff = [[diff(A,B), diff(A,C)], [diff(B,A), diff(B,C)], [diff(C,A), diff(C,B)]] with the correct differences between points.
Thanks in advance!

Your solution is actually not that far off. As Aniketh mentioned, one issue is your use of x[i] != x[j]. Since x[i] and x[j] are arrays, that will actually always evaluate to false.
The reason is that python will not do a useful comparison of arrays by default. It will just check if the array reference is the same. This is obviously not what you want, you are trying to see if the array is at the same index in x. For that use i !=j.
Though there are other solutions posted here, I'll add mine below because I already wrote it. It makes use of python's list comprehensions.
def pairwise_diff(x):
diff = []
for i in range(len(x)):
A = x[i]
for j in range(len(x)):
if i != j:
B = x[j]
assert len(A) == len(B)
item_diff = [A[i] - B[i] for i in range(len(A))]
diff.append(sum(item_diff))
# Take the answers and group them into arrays of length 2
return [diff[i : i + 2] for i in range(0, len(diff), 2)]
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
print(pairwise_diff(x))

This is one of those problems where it's really helpful to know a bit of Python's standard library — especially itertools.
For example to get the pairs of lists you want to operate on, you can reach for itertools.permutations
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
list(permutations(x, r=2))
This gives the pairs of lists your want:
[([1, 2, 3], [2, 4, 6]),
([1, 2, 3], [3, 5, 7]),
([2, 4, 6], [1, 2, 3]),
([2, 4, 6], [3, 5, 7]),
([3, 5, 7], [1, 2, 3]),
([3, 5, 7], [2, 4, 6])]
Now, if you could just group those by the first of each pair...itertools.groupby does just this.
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
list(list(g) for k, g in groupby(permutations(x, r=2), key=lambda p: p[0]))
Which produces a list of lists grouped by the first:
[[([1, 2, 3], [2, 4, 6]), ([1, 2, 3], [3, 5, 7])],
[([2, 4, 6], [1, 2, 3]), ([2, 4, 6], [3, 5, 7])],
[([3, 5, 7], [1, 2, 3]), ([3, 5, 7], [2, 4, 6])]]
Putting it all together, you can make a simple function that subtracts the lists the way you want and pass each pair in:
from itertools import permutations, groupby
def sum_diff(pairs):
return [sum(p - q for p, q in zip(*pair)) for pair in pairs]
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
# call sum_diff for each group of pairs
result = [sum_diff(g) for k, g in groupby(permutations(x, r=2), key=lambda p: p[0])]
# [[-6, -9], [6, -3], [9, 3]]
This reduces the problem to just a couple lines of code and will be performant on large lists. And, since you mentioned the difficulty in keeping indices straight, notice that this uses no indices in the code other than selecting the first element for grouping.

Here is the code I believe you're looking for. I will explain it below:
def diff(a, b):
total = 0
for i in range(len(a)):
total += a[i] - b[i]
return total
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
differences = []
for i in range(len(x)):
soloDiff = []
for j in range(len(x)):
if i != j:
soloDiff.append(diff(x[i],x[j]))
differences.append(soloDiff)
print(differences)
Output:
[[-6, -9], [6, -3], [9, 3]]
First off, in your explanation of your algorithm, you are making it very clear that you should use a function to calculate the differences between two lists since you will be using it repeatedly.
Your for loops start off fine, but you should have a second list to append diff to 3 times. Also, when you are checking for repeats you need to make sure that i != j, not x[i] != x[j]
Let me know if you have any other questions!!

this is the simplest solution i can think:
import numpy as np
x = [[1, 2, 3], [2, 4, 6], [3, 5, 7]]
x = np.array(x)
vectors = ['A','B','C']
for j in range(3):
for k in range(3):
if j!=k:
print(vectors[j],'-',vectors[k],'=', x[j]-x[k])
which will return
A - B = [-1 -2 -3]
A - C = [-2 -3 -4]
B - A = [1 2 3]
B - C = [-1 -1 -1]
C - A = [2 3 4]
C - B = [1 1 1]

How to sum a row and a column in a list of lists?

Given
X = [[3, 2, 3, 2], [3, 2, 2, 3],[3, 2, 2, 2], [3, 2, 2, 10], [3, 3, 3, 3]]
How could I write a function in python that gives me the sum of the values of a column (the same of first values of each inside list) and a row (the same of the values of one inside list)

You can use zip(*some_list_of_lists) to iterate over the columns of the sublists. Note, however, that this will give you sums up to the length of the shortest sublist. If you have uneven lists, you can use itertools.zip_longest with a default value of zero.
l = [[3, 2, 3, 2], [3, 2, 2, 3],[3, 2, 2, 2], [3, 2, 2, 10], [3, 3, 3, 3]]
columns_sums = [sum(col) for col in zip(*l)]
# [15, 11, 12, 20]
For the sum of rows you can just use a regular list comprehension and take the sum() of each item:
row_sums = [sum(row) for row in l]
# [10, 10, 9, 17, 12]

Giving your input represents the matrix row by row ([3, 2, 3, 2] is a first row, [3, 3, 3, 3, 3] is a first column) you can do it either by iterating the rows, getting n-th value from the list and summing them up:
def sum_n_column(n, matrix):
res = 0
for arr in matrix:
res += arr[n]
return res
Or, using functools:
from functools import reduce
def sum_n_column(n, matrix):
return reduce((lambda x, y: x + y), [x[n] for x in matrix])
This, actually can be simplified to:
def sum_n_column(n, matrix):
return sum([x[n] for x in matrix])
In all functions, n means the number of the column to sum and matrix should contain an array of arrays, like in the example variable you provided.
EDIT: To get sum of the row (sum of all values in the array), do:
def sum_n_row(n, matrix):
res = 0
for i in matrix[n]:
res += matrix[n][i]
return res
or, the easy way:
def sum_n_row(n, matrix):
return sum(matrix[n])

def get_sum(arr, indx, ax=0):
if ax==0:
return sum(arr[indx])
return sum(a[indx] for a in arr)
arr = [[3, 2, 3, 2], [3, 2, 2, 3],[3, 2, 2, 2], [3, 2, 2, 10], [3, 3, 3, 3]]
get_sum(arr, 0, 1)

How do I get the index of the common integer element from two separate lists and plug it to another list?

I have 3 lists.
A_set = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Q_act = [2, 3]
dur = [0, 4, 5, 2, 1, 3, 4, 8, 2, 3]
All lists are integers.
What I am trying to do is to compare Q_act with A_set then obtain the indices of the numbers that match from A_set.
(Example:
Q_act has the elements [2,3]
it is located in indices [1,2] from A_set)
Afterwards, I will use those indices to obtain the corresponding value in dur and store this in a list called p_dur_Q_act.
(Example: using the result from the previous example, [1,2]
The values in the dur list corresponding to the indices [1,2] should be stored in another list called p_dur_Q_act
i.e. [4,5] should be the values stored in the list p_dur_Q_act)
So, how do I get the index of the common integer element (which is [1,2]) from two separate lists and plug it to another list?
So far here are the code(s) I used:
This one, I wrote because it returns the index. But not [4,5].
p_Q = set(Q_act).intersection(A_set)
p_dur_Q_act = [i + 1 for i, x in enumerate(p_Q)]
print(p_dur_Q_act)
I also tried this but I receive an error TypeError: argument of type 'int' is not iterable
p_dur_Q_act = [i + 1 for i, x in enumerate(Q_act) if any(elem in x for elem in A_set)]
print(p_dur_Q_act)

Another option is to use the enumerate iterator to generate every index, and then select only the ones you want:
a_set = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
q_act = [2, 3]
dur = [0, 4, 5, 2, 1, 3, 4, 8, 2, 3]
p_dur_q_act = [i for i,v in enumerate(a_set) if v in q_act]
print([dur[p] for p in p_dur_q_act if p in dur]) # [4, 5]
This is more efficient than repeatedly calling index if the number of matches is large, because the number of calls is proportional to the number of matches, but the duration of calls is proportional to the length of a_set. The enumerate approach can be made even more efficient by turning q_act into a set, since in scales better with sets than lists. At these scales, though, there will be no observable difference.
You don't need to map these to index values, though. You can get the same result if you use zip to map a_set to dur and then select the d values whose a values are in q_act.
a_set = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
q_act = {2, 3}
dur = [0, 4, 5, 2, 1, 3, 4, 8, 2, 3]
p_dur_q_act = [d for a, d in zip(a_set, dur) if a in q_act]

Use index function to get the index of the element in the list.
>>> a_set = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> q_act = [2, 3]
>>> dur = [0, 4, 5, 2, 1, 3, 4, 8, 2, 3]
>>>
>>> print([dur[a_set.index(q)] for q in set(a_set).intersection(q_act)])
[4, 5]

How to have an accuracy for comparing 2 lists in python?

I may not explain clearly.
I want to have a comparing function to identify whether the values of the same index from 2 lists are the same or not.
For example, 2 lists A and B, which should be the same(accuracy =100%).
A=[1,2,1,1,3,4,3,2,5]
B=[4,2,4,4,3,1,3,2,5]
since A(0),A(2),A(3) are the same value = 1,and B(0),B(2),B(3) are the same value = 4; A(1),A(7) are the same value = 2, the same as B(1),B(7);
A(4),A(6) are the same value = 3, the same as B(4),B(6);
A(5), unique value in list A, is the same as B(5);
A(8), unique value in list A, is the same as B(8).
And then taking the same rule for list C & D, which accuracy should be 80%.
C=[1,2,2,2,3,4,4,4,5,6]
D=[3,4,4,4,1,5,5,6,5,6]
D(7) should be the same value as D(5),D(6), not the same with D(9), and D(8) should not be the same value as D(5),D(6), which should be a standalone value.
notice: the value in list may not be sequential number. list A can
also be [1,26,1,1,30,4,30,26,5], and B can be[4,22,4,4,3,100,3,22,5].
Which I still take them to be the same.
How can I have an accuracy of a comparing function to check it?
Thanks!

If you want to compare the length of the set intersection to the length of the set union:
How many elements are in both lists? (set intersection &)
How many elements are there in total? (set union |)
This method doesn't take position or distribution into account:
A = [1, 2, 1, 1, 3, 4, 3, 2, 5]
B = [4, 2, 4, 4, 3, 1, 3, 2, 5]
C = [1, 2, 2, 2, 3, 4, 4, 4, 5, 6]
D = [3, 4, 4, 4, 1, 5, 5, 6, 5, 6]
def overlapping_percentage(x, y):
return (100.0 * len(set(x) & set(y))) / len(set(x) | set(y))
print(overlapping_percentage(A, B))
# 100.0
print(overlapping_percentage(C, D))
# 83.3

Here's a different method, which might be closer to what you want. It's not perfect and you'll probably have to optimize it.
To be honest, I don't understand exactly where those 80% come from.
This method extracts a "fingerprint" out of lists: where the elements are placed, independently from their values. The fingerprints are then compared to one another:
from collections import defaultdict
A=[1,2,1,1,3,4,3,2,5]
B=[4,2,4,4,3,1,3,2,5]
C = [1, 2, 2, 2, 3, 4, 4, 4, 5, 6]
D = [3, 4, 4, 4, 1, 5, 5, 6, 5, 6]
def fingerprint(lst):
r = defaultdict(list)
for i,x in enumerate(lst):
r[x].append(i)
return sorted(r.values())
fA = fingerprint(A)
# [[0, 2, 3], [1, 7], [4, 6], [5], [8]]
fB = fingerprint(B)
# [[0, 2, 3], [1, 7], [4, 6], [5], [8]]
fC = fingerprint(C)
# [[0], [1, 2, 3], [4], [5, 6, 7], [8], [9]]
fD = fingerprint(D)
# [[0], [1, 2, 3], [4], [5, 6, 8], [7, 9]]
print((100.0*sum(1 for a,b in zip(fA, fB) if a == b)/len(fB)))
# 100.0
print((100.0*sum(1 for c,d in zip(fC, fD) if c == d)/len(fD)))
# 60.0

It too late to answer ,but I did same thing in another way, and you can calculate each percentage of numbers same with another list in same index.
I'll just give you my code , you may refer to it.
def accuracy(self,*args):
check = final_result['train_num'] == final_result['test_num']
passed = final_result[check]
accuracy = len(passed.index) / len(final_result.index)
analysis = passed['test_num'].value_counts()
analysis = analysis / 50
analysis['accuracy']=round(accuracy,5)
pd.Series.to_csv(analysis,csvpath+"accuracy.csv",sep=',')
print("accuracy：{:.3f}".format(accuracy))
train_num test_numis 2 columns of a dataframe final_result,you can replace it with your data.
be careful with analysis = analysis / 50,50 is total degree of each elements in my data, you should change it.

Fastest way to count identical sub-arrays in a nd-array?

Let's consider a 2d-array A
2 3 5 7
2 3 5 7
1 7 1 4
5 8 6 0
2 3 5 7
The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.
I use an inefficient naive algorithm to do that:
import numpy
A=numpy.array([[2, 3, 5, 7],[2, 3, 5, 7],[1, 7, 1, 4],[5, 8, 6, 0],[2, 3, 5, 7]])
i=0
end = len(A)
while i<end:
print i,
j=i+1
numberID = 1
while j<end:
print j
if numpy.array_equal(A[i,:] ,A[j,:]):
numberID+=1
j+=1
i+=1
print A, len(A)
Expected result:
array([3,1,1]) # number identical arrays per line
My algo looks like using native python within numpy, thus inefficient. Thanks for help.

In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:
b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])
>>> unq_a
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> unq_cnt
array([1, 3, 1])
In an older numpy, you can replicate what np.unique does, which would look something like:
a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)
>>> a_unq
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> a_cnt
array([1, 3, 1])

You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.
In [9]: a
Out[9]:
array([[2, 3, 5, 7],
[2, 3, 5, 7],
[1, 7, 1, 4],
[5, 8, 6, 0],
[2, 3, 5, 7]])
In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])
In [11]: a[lexsort(a.T)]
Out[11]:
array([[5, 8, 6, 0],
[1, 7, 1, 4],
[2, 3, 5, 7],
[2, 3, 5, 7],
[2, 3, 5, 7]])

You can use Counter class from collections module for this.
It works like this :
x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c
Output : Counter({2: 3, 1: 1, 5: 1})
Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure.
If you can convert every value of x in a tuple that it should works as :
x = [(2, 3, 5, 7),(2, 3, 5, 7),(1, 7, 1, 4),(5, 8, 6, 0),(2, 3, 5, 7)]
from collections import Counter
c=Counter(x)
print c
Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding Indices for Repeat Sequences in NumPy Array - python

Related

Python - doing maths with a nested list

How to sum a row and a column in a list of lists?

How do I get the index of the common integer element from two separate lists and plug it to another list?

How to have an accuracy for comparing 2 lists in python?

Fastest way to count identical sub-arrays in a nd-array?

Categories

Resources