Fastest way to count identical sub-arrays in a nd-array? - python

Let's consider a 2d-array A
2 3 5 7
2 3 5 7
1 7 1 4
5 8 6 0
2 3 5 7
The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.
I use an inefficient naive algorithm to do that:
import numpy
A=numpy.array([[2, 3, 5, 7],[2, 3, 5, 7],[1, 7, 1, 4],[5, 8, 6, 0],[2, 3, 5, 7]])
i=0
end = len(A)
while i<end:
print i,
j=i+1
numberID = 1
while j<end:
print j
if numpy.array_equal(A[i,:] ,A[j,:]):
numberID+=1
j+=1
i+=1
print A, len(A)
Expected result:
array([3,1,1]) # number identical arrays per line
My algo looks like using native python within numpy, thus inefficient. Thanks for help.

In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:
b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])
>>> unq_a
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> unq_cnt
array([1, 3, 1])
In an older numpy, you can replicate what np.unique does, which would look something like:
a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)
>>> a_unq
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> a_cnt
array([1, 3, 1])

You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.
In [9]: a
Out[9]:
array([[2, 3, 5, 7],
[2, 3, 5, 7],
[1, 7, 1, 4],
[5, 8, 6, 0],
[2, 3, 5, 7]])
In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])
In [11]: a[lexsort(a.T)]
Out[11]:
array([[5, 8, 6, 0],
[1, 7, 1, 4],
[2, 3, 5, 7],
[2, 3, 5, 7],
[2, 3, 5, 7]])

You can use Counter class from collections module for this.
It works like this :
x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c
Output : Counter({2: 3, 1: 1, 5: 1})
Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure.
If you can convert every value of x in a tuple that it should works as :
x = [(2, 3, 5, 7),(2, 3, 5, 7),(1, 7, 1, 4),(5, 8, 6, 0),(2, 3, 5, 7)]
from collections import Counter
c=Counter(x)
print c
Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})

Related

Pair two 1D arrays by their value and index [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have two 1D arrays of the same length as this:
import numpy as np
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
The value of a is increasing while b is random.
I wanna pair them by their values following the steps below:
Pick the first unique value ([1]) of array a and get the unique numbers ([7, 8]) of array b at the same index.
If some paired numbers ([8]) appear again in b, then pick the number at the same index of a.
Then, some new paired number ([2]) which appears again in a, the numbers in b at the same index are selected.
Finally, the result should be:
[1, 2, 3] is paired with [7, 8, 9]
[4, 5] is paired with [10]
It looks like there is no easy way for a vectorised (no looping) solution since it's a graph theory problem of finding connected components. If you still want to have a performant script that works fast on big data, you could use igraph library which is written in C.
TL;DR
I assume your input corresponds to edges of some graph:
>>> np.transpose([a, b])
array([[ 1, 7],
[ 1, 7],
[ 1, 8],
[ 2, 8],
[ 2, 9],
[ 3, 8],
[ 4, 10],
[ 5, 10]])
So your vertices are:
>>> np.unique(np.transpose([a, b]))
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10])
And you would be quite happy (at least at the beginning) to recognise communities, like:
tags = np.transpose([a, b, communities])
>>> tags
array([[ 1, 7, 0],
[ 1, 7, 0],
[ 1, 8, 0],
[ 2, 8, 0],
[ 2, 9, 0],
[ 3, 8, 0],
[ 4, 10, 1],
[ 5, 10, 1]])
so that you have vertices (1, 2, 3, 7, 8, 9) included in community number 0 and vertices (4, 5, 10) included in community number 1.
Unfortunately, igraph doesn't support labeling graph nodes from 1 to 10 or any gaps of ids in labels. It must start from 0 and have no gaps in ids. So you need to store initial indices and after that relabel vertices so that edges are:
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
>>> vertices_old
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10]) #new ones are: [0, 1, 2, ..., 8]
>>> edges_new
array([[0, 5],
[0, 5],
[0, 6],
[1, 6],
[1, 7],
[2, 6],
[3, 8],
[4, 8]], dtype=int64)
The next step is to find communities using igraph (pip install python-igraph). You can run the following:
import igraph as ig
graph = ig.Graph(edges = edges_new)
communities = graph.clusters().membership #type: list
communities = np.array(communities)
>>> communities
array([0, 0, 0, 1, 1, 0, 0, 0, 1]) #tags of nodes [1 2 3 4 5 7 8 9 10]
And then retrieve tags of source vertices (as well as tags of target vertices):
>>> communities = communities[edges_new[:, 0]] #or [:, 1]
array([0, 0, 0, 0, 0, 0, 1, 1])
After you find communities, the second part of solution appears to be a typical groupby problem. You can do it in pandas:
import pandas as pd
def get_part(source, communities):
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])
Final Code
import igraph as ig
import numpy as np
import pandas as pd
def get_part(source, communities):
'''find set of nodes for each community'''
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
graph = ig.Graph(edges = edges_new)
communities = np.array(graph.clusters().membership)
communities = communities[edges_new[:,0]] #or communities[edges_new[:,1]]
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])
I tried doing this by iterating both the arrays simultaneously and keeping track of what element is associated with which index of the result. Let me know if this works for you?
a = [1, 1, 1, 2, 2, 3, 4, 5]
b = [7, 7, 8, 8, 9, 8, 10, 10]
tracker_a = dict()
tracker_b = dict()
result = []
index = 0
for elem_a, elem_b in zip(a, b):
if elem_a in tracker_a:
result[tracker_a[elem_a]][1].add(elem_b)
tracker_b[elem_b] = tracker_a[elem_a]
elif elem_b in tracker_b:
result[tracker_b[elem_b]][0].add(elem_a)
tracker_a[elem_a] = tracker_b[elem_b]
else:
tracker_a[elem_a] = index
tracker_b[elem_b] = index
result.append([{elem_a}, {elem_b}])
index += 1
print(result)
Output:
[[{1, 2, 3}, {8, 9, 7}], [{4, 5}, {10}]]
Complexity: O(n)

Finding Indices for Repeat Sequences in NumPy Array

This is a follow up to a previous question. If I have a NumPy array [0, 1, 2, 2, 3, 4, 2, 2, 5, 5, 6, 5, 5, 2, 2], for each repeat sequence (starting at each index), is there a fast way to to then find all matches of that repeat sequence and return the index for those matches?
Here, the repeat sequences are [2, 2] and [5, 5] (note that the length of the repeat is specified by the user but will be the same length and can be much greater than 2). The repeats can be found at [2, 6, 8, 11, 13] via:
def consec_repeat_starts(a, n):
N = n-1
m = a[:-1]==a[1:]
return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1
But for each unique type of repeat sequence (i.e., [2, 2] and [5, 5]) I want to return something like the repeat followed by the indices for where the repeat is located:
[([2, 2], [2, 6, 13]), ([5, 5], [8, 11])]
Update
Additionally, given the repeat sequence, can you return the results from a second array. So, look for [2, 2] and [5, 5] in:
[2, 2, 5, 5, 1, 4, 9, 2, 5, 5, 0, 2, 2, 2]
And the function would return:
[([2, 2], [0, 11, 12]), ([5, 5], [2, 8]))]
Here's a way to do so -
def group_consec(a, n):
idx = consec_repeat_starts(a, n)
b = a[idx]
sidx = b.argsort()
c = b[sidx]
cut_idx = np.flatnonzero(np.r_[True, c[:-1]!=c[1:],True])
idx_s = idx[sidx]
indices = [idx_s[i:j] for (i,j) in zip(cut_idx[:-1],cut_idx[1:])]
return c[cut_idx[:-1]], indices
# Perform lookup in another array, b
n = 2
v_a,indices_a = group_consec(a, n)
v_b,indices_b = group_consec(b, n)
idx = np.searchsorted(v_a, v_b)
idx[idx==len(v_a)] = 0
valid_mask = v_a[idx]==v_b
common_indices = [j for (i,j) in zip(valid_mask,indices_b) if i]
common_val = v_b[valid_mask]
Note that for simplicity and ease of usage, the first output arg off group_consec has the unique values per sequence. If you need them in (val, val,..) format, simply replicate at the end. Similarly, for common_val.

Python Pandas rolling aggregate a column of lists

I have a simple dataframe df with a column of lists lists. I would like to generate an additional column based on lists.
The df looks like:
import pandas as pd
lists={1:[[1]],2:[[1,2,3]],3:[[2,9,7,9]],4:[[2,7,3,5]]}
#create test dataframe
df=pd.DataFrame.from_dict(lists,orient='index')
df=df.rename(columns={0:'lists'})
df
lists
1 [1]
2 [1, 2, 3]
3 [2, 9, 7, 9]
4 [2, 7, 3, 5]
I would like df to look like this:
df
Out[9]:
lists rolllists
1 [1] [1]
2 [1, 2, 3] [1, 1, 2, 3]
3 [2, 9, 7, 9] [1, 2, 3, 2, 9, 7, 9]
4 [2, 7, 3, 5] [2, 9, 7, 9, 2, 7, 3, 5]
Basically I want to 'sum'/append the rolling 2 lists. Note that row 1, because I only have 1 list 1, rolllists is that list. But in row 2, I have 2 lists that I want appended. Then for row three, append df[2].lists and df[3].lists etc. I have worked on similar things before, reference this:Pandas Dataframe, Column of lists, Create column of sets of cumulative lists, and record by record differences.
In addition, if we can get this part above, then I want to do this in a groupby (so the example below would be 1 group for example, so for instance the df might look like this in the groupby):
Group lists rolllists
1 A [1] [1]
2 A [1, 2, 3] [1, 1, 2, 3]
3 A [2, 9, 7, 9] [1, 2, 3, 2, 9, 7, 9]
4 A [2, 7, 3, 5] [2, 9, 7, 9, 2, 7, 3, 5]
5 B [1] [1]
6 B [1, 2, 3] [1, 1, 2, 3]
7 B [2, 9, 7, 9] [1, 2, 3, 2, 9, 7, 9]
8 B [2, 7, 3, 5] [2, 9, 7, 9, 2, 7, 3, 5]
I have tried various things like df.lists.rolling(2).sum() and I get this error:
TypeError: cannot handle this type -> object
in Pandas 0.24.1 and unfortunatley in Pandas 0.22.0 the command doesn't error, but instead returns the exact same values as in lists. So Looks like newer versions of Pandas can't sum lists? That's a secondary issue.
Love any help! Have Fun!
You can start with
import pandas as pd
mylists={1:[[1]],2:[[1,2,3]],3:[[2,9,7,9]],4:[[2,7,3,5]]}
mydf=pd.DataFrame.from_dict(mylists,orient='index')
mydf=mydf.rename(columns={0:'lists'})
mydf = pd.concat([mydf, mydf], axis=0, ignore_index=True)
mydf['group'] = ['A']*4 + ['B']*4
# initialize your new series
mydf['newseries'] = mydf['lists']
# define the function that appends lists overs rows
def append_row_lists(data):
for i in data.index:
try: data.loc[i+1, 'newseries'] = data.loc[i, 'lists'] + data.loc[i+1, 'lists']
except: pass
return data
# loop over your groups
for gp in mydf.group.unique():
condition = mydf.group == gp
mydf[condition] = append_row_lists(mydf[condition])
Output
lists Group newseries
0 [1] A [1]
1 [1, 2, 3] A [1, 1, 2, 3]
2 [2, 9, 7, 9] A [1, 2, 3, 2, 9, 7, 9]
3 [2, 7, 3, 5] A [2, 9, 7, 9, 2, 7, 3, 5]
4 [1] B [1]
5 [1, 2, 3] B [1, 1, 2, 3]
6 [2, 9, 7, 9] B [1, 2, 3, 2, 9, 7, 9]
7 [2, 7, 3, 5] B [2, 9, 7, 9, 2, 7, 3, 5]
How about this?
rolllists = [df.lists[1].copy()]
for row in df.iterrows():
index, values = row
if index > 1: # or > 0 if zero-indexed
rolllists.append(df.loc[index - 1, 'lists'] + values['lists'])
df['rolllists'] = rolllists
Or as a slightly more extensible function:
lists={1:[[1]],2:[[1,2,3]],3:[[2,9,7,9]],4:[[2,7,3,5]]}
df=pd.DataFrame.from_dict(lists,orient='index')
df=df.rename(columns={0:'lists'})
def rolling_lists(df, roll_period=2):
new_roll, rolllists = [], [df.lists[1].copy()] * (roll_period - 1)
for row in df.iterrows():
index, values = row
if index > roll_period - 1: # or -2 if zero-indexed
res = []
for i in range(index - roll_period, index):
res.append(df.loc[i + 1, 'lists']) # or i if 0-indexed
rolllists.append(res)
for li in rolllists:
while isinstance(li[0], list):
li = [item for sublist in li for item in sublist] # flatten nested list
new_roll.append(li)
df['rolllists'] = new_roll
return df
Easily extensible to groupby as well, just wrap it in a function and use df.apply(rolling_lists). You can give any number of rolling rows to use as roll_period. Hope this helps!

Create a multidimensional array from 1-D list, numpy

Struggling to describe this issue in words, but have a seemingly simple issue I can't find an answer for.
I want to create an array using values from one list/array and indices from another. I want the shape of the new array to be the same as the index array.
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
result = func(a, b) #some function or operator...
print(result)
[[9, 8], [7, 6, 5], [3, 2, 1, 0, -1]]
Thank you! :)
EDIT:
Good solutions so far, but I would rather do this without a for loop as we are looking at hundreds of thousands of rows and need to keep computing time down. Thanks again :)
You can use a list comprehension:
>>> [a[x[0]:x[-1]+1] for x in b]
[array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
EDIT: Your question indicates that you want a faster option, so you might test the following script to see which is faster for your Python installation:
#!/usr/bin/env python
import timeit
setup = '''
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
'''
test1 = '''
def test():
return [a[x[0]:x[-1]+1] for x in b]
'''
test2 = '''
def test():
return [a[idx] for idx in b]
'''
print(timeit.timeit(setup = setup,
stmt = test1,
number = 1000000))
print(timeit.timeit(setup = setup,
stmt = test2,
number = 1000000))
On my machine, the two approaches given you so far run about the same, but hpaulj's answer might be very slightly faster (unless Python is caching data behind the scenes), which may be of more use to you in production. Test it out locally and see if you get a similar or different answer.
Just apply each indexing sublist to a:
In [483]: a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
...:
...: b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
...:
...:
In [484]: [a[idx] for idx in b]
Out[484]: [array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
The sublists differ in length, so the result cannot be made into a 2d array - it has to remain a list (or if you insist 1d object dtype array).

Pandas: Can you access rolling window items

Can you access pandas rolling window object.
rs = pd.Series(range(10))
rs.rolling(window = 3)
#print's
Rolling [window=3,center=False,axis=0]
Can I get as groups?:
[0,1,2]
[1,2,3]
[2,3,4]
I will start off this by saying this is reaching into the internal impl. But if you really really wanted to compute the indexers the same way as pandas.
You will need v0.19.0rc1 (just about released), you can conda install -c pandas pandas=0.19.0rc1
In [41]: rs = pd.Series(range(10))
In [42]: rs
Out[42]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
# this reaches into an internal implementation
# the first 3 is the window, then second the minimum periods we
# need
In [43]: start, end, _, _, _, _ = pandas._window.get_window_indexer(rs.values,3,3,None,use_mock=False)
# starting index
In [44]: start
Out[44]: array([0, 0, 0, 1, 2, 3, 4, 5, 6, 7])
# ending index
In [45]: end
Out[45]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# windo size
In [3]: end-start
Out[3]: array([1, 2, 3, 3, 3, 3, 3, 3, 3, 3])
# the indexers
In [47]: [np.arange(s, e) for s, e in zip(start, end)]
Out[47]:
[array([0]),
array([0, 1]),
array([0, 1, 2]),
array([1, 2, 3]),
array([2, 3, 4]),
array([3, 4, 5]),
array([4, 5, 6]),
array([5, 6, 7]),
array([6, 7, 8]),
array([7, 8, 9])]
So this is sort of trivial in the fixed window case, this becomes extremely useful in a variable window scenario, e.g. in 0.19.0 you can specify things like 2S for example to aggregate by-time.
All of that said, getting these indexers is not particularly useful. you generally want to do something with the results. That is the point of the aggregation functions, or .apply if you want to generically aggregate.
Here's a workaround, but waiting to see if anyone has pandas solution:
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
rolling_window(rs, 3)
array([[ 0, 1, 2],
[ 1, 2, 3],
[ 2, 3, 4],
[ 3, 4, 5],
[ 4, 5, 6],
[ 5, 6, 7],
[ 6, 7, 8],
[ 7, 8, 9],
[ 8, 9, 10]])
This is solved in pandas 1.1, as the rolling object is now an iterable:
[window.tolist() for window in rs.rolling(window=3) if len(window) == 3]

Categories