Pair two 1D arrays by their value and index [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have two 1D arrays of the same length as this:
import numpy as np
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
The value of a is increasing while b is random.
I wanna pair them by their values following the steps below:
Pick the first unique value ([1]) of array a and get the unique numbers ([7, 8]) of array b at the same index.
If some paired numbers ([8]) appear again in b, then pick the number at the same index of a.
Then, some new paired number ([2]) which appears again in a, the numbers in b at the same index are selected.
Finally, the result should be:
[1, 2, 3] is paired with [7, 8, 9]
[4, 5] is paired with [10]

It looks like there is no easy way for a vectorised (no looping) solution since it's a graph theory problem of finding connected components. If you still want to have a performant script that works fast on big data, you could use igraph library which is written in C.
TL;DR
I assume your input corresponds to edges of some graph:
>>> np.transpose([a, b])
array([[ 1, 7],
[ 1, 7],
[ 1, 8],
[ 2, 8],
[ 2, 9],
[ 3, 8],
[ 4, 10],
[ 5, 10]])
So your vertices are:
>>> np.unique(np.transpose([a, b]))
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10])
And you would be quite happy (at least at the beginning) to recognise communities, like:
tags = np.transpose([a, b, communities])
>>> tags
array([[ 1, 7, 0],
[ 1, 7, 0],
[ 1, 8, 0],
[ 2, 8, 0],
[ 2, 9, 0],
[ 3, 8, 0],
[ 4, 10, 1],
[ 5, 10, 1]])
so that you have vertices (1, 2, 3, 7, 8, 9) included in community number 0 and vertices (4, 5, 10) included in community number 1.
Unfortunately, igraph doesn't support labeling graph nodes from 1 to 10 or any gaps of ids in labels. It must start from 0 and have no gaps in ids. So you need to store initial indices and after that relabel vertices so that edges are:
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
>>> vertices_old
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10]) #new ones are: [0, 1, 2, ..., 8]
>>> edges_new
array([[0, 5],
[0, 5],
[0, 6],
[1, 6],
[1, 7],
[2, 6],
[3, 8],
[4, 8]], dtype=int64)
The next step is to find communities using igraph (pip install python-igraph). You can run the following:
import igraph as ig
graph = ig.Graph(edges = edges_new)
communities = graph.clusters().membership #type: list
communities = np.array(communities)
>>> communities
array([0, 0, 0, 1, 1, 0, 0, 0, 1]) #tags of nodes [1 2 3 4 5 7 8 9 10]
And then retrieve tags of source vertices (as well as tags of target vertices):
>>> communities = communities[edges_new[:, 0]] #or [:, 1]
array([0, 0, 0, 0, 0, 0, 1, 1])
After you find communities, the second part of solution appears to be a typical groupby problem. You can do it in pandas:
import pandas as pd
def get_part(source, communities):
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])
Final Code
import igraph as ig
import numpy as np
import pandas as pd
def get_part(source, communities):
'''find set of nodes for each community'''
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
graph = ig.Graph(edges = edges_new)
communities = np.array(graph.clusters().membership)
communities = communities[edges_new[:,0]] #or communities[edges_new[:,1]]
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])

I tried doing this by iterating both the arrays simultaneously and keeping track of what element is associated with which index of the result. Let me know if this works for you?
a = [1, 1, 1, 2, 2, 3, 4, 5]
b = [7, 7, 8, 8, 9, 8, 10, 10]
tracker_a = dict()
tracker_b = dict()
result = []
index = 0
for elem_a, elem_b in zip(a, b):
if elem_a in tracker_a:
result[tracker_a[elem_a]][1].add(elem_b)
tracker_b[elem_b] = tracker_a[elem_a]
elif elem_b in tracker_b:
result[tracker_b[elem_b]][0].add(elem_a)
tracker_a[elem_a] = tracker_b[elem_b]
else:
tracker_a[elem_a] = index
tracker_b[elem_b] = index
result.append([{elem_a}, {elem_b}])
index += 1
print(result)
Output:
[[{1, 2, 3}, {8, 9, 7}], [{4, 5}, {10}]]
Complexity: O(n)

Related

What is the difference between the results of atom.GetIdx() and Chem.CanonicalRankAtoms()?

In the RDKit docs it is said that the rdkit.Chem.rdChem.GetIdx(Atom atom) returns the atom's index according to the molecule.
https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html
Also rdkit.Chem.rdmolfiles.CanonicalRankAtoms (Mol mol) returns the canonical ranking of the atoms in the molecule.
https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html
My question is how is the atom's canonical rank different from the atom's index? Thank you very much in advance.
`moleculeSmiles = "C1NCN1"
mol1 = Chem.AddHs(Chem.MolFromSmiles(moleculeSmiles))
rankList = list (rdmolfiles.CanonicalRankAtoms(mol1))
atomIdxList = []
for atom in atomIter:
atomIdxList.append(atom.GetIdx())
print (atomIdxList)
print (rankList)`
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[8, 6, 9, 7, 2, 3, 0, 4, 5, 1]

Create a multidimensional array from 1-D list, numpy

Struggling to describe this issue in words, but have a seemingly simple issue I can't find an answer for.
I want to create an array using values from one list/array and indices from another. I want the shape of the new array to be the same as the index array.
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
result = func(a, b) #some function or operator...
print(result)
[[9, 8], [7, 6, 5], [3, 2, 1, 0, -1]]
Thank you! :)
EDIT:
Good solutions so far, but I would rather do this without a for loop as we are looking at hundreds of thousands of rows and need to keep computing time down. Thanks again :)
You can use a list comprehension:
>>> [a[x[0]:x[-1]+1] for x in b]
[array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
EDIT: Your question indicates that you want a faster option, so you might test the following script to see which is faster for your Python installation:
#!/usr/bin/env python
import timeit
setup = '''
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
'''
test1 = '''
def test():
return [a[x[0]:x[-1]+1] for x in b]
'''
test2 = '''
def test():
return [a[idx] for idx in b]
'''
print(timeit.timeit(setup = setup,
stmt = test1,
number = 1000000))
print(timeit.timeit(setup = setup,
stmt = test2,
number = 1000000))
On my machine, the two approaches given you so far run about the same, but hpaulj's answer might be very slightly faster (unless Python is caching data behind the scenes), which may be of more use to you in production. Test it out locally and see if you get a similar or different answer.
Just apply each indexing sublist to a:
In [483]: a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
...:
...: b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
...:
...:
In [484]: [a[idx] for idx in b]
Out[484]: [array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
The sublists differ in length, so the result cannot be made into a 2d array - it has to remain a list (or if you insist 1d object dtype array).

Pose keypoints numpy averaging

I know you're supposed to give examples when you ask questions here, but I can't really think of anything that wouldn't involve pasting a massive project worth of code, so I'll just try to describe this as well as possible.
I'm working on a project that involves using keypoints generated by using OpenPose (after I've done some preprocessing on them to simplify everything, I come up with data formatted like this: [x0, y0, c0, x1, y1, c1...], where there are 18 points total, and the x's and y's represent their coordinates, while the c's represent confidence.) I want to take a nested list that has the keypoints for a single person listed in the above manner for each frame, and output a new nested list of lists, made up of the weighted average x's and y's (the weights would be the confidence values for each point) along with the average confidences by each second (instead of by frame), in the same format as above.
I have already converted the original list into a 3-dimensional list, with each second holding each of its frames, each of which holds its keypoint list. I know that I can write code myself to do all of this without using numpy.average(), but I was hoping that I wouldn't have to, because it quickly becomes confusing. Instead, I was wondering if there were a way I could iterate over each second, using said method, in a reasonably simple manner, and just append the resulting lists to a new list, like this:
out = []
for second in lst:
out.append(average(second, axis=1, weights=?, other params?))
Again, I'm sorry for not giving an example of some sort.
Maybe you could get some inspiration from this code:
import numpy as np
def pose_average(sequence):
x, y, c = sequence[0::3], sequence[1::3], sequence[2::3]
x_avg = np.average(x, weights=c)
y_avg = np.average(y, weights=c)
return x_avg, y_avg
sequence = [2, 4, 1, 5, 6, 3, 5, 2, 1]
pose_average(sequence)
>>> (4.4, 4.8)
For multiple sequences of grouped poses:
data = [[1, 2, 3, 2, 3, 4, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9], [4, 1, 2, 5, 3, 3, 4, 1, 2]]
out = [ pose_average(seq) for seq in data ]
out
>>> [(2.1666666666666665, 3.1666666666666665),
(5.0, 6.0),
(4.428571428571429, 1.8571428571428572)]
Edit
By assuming that:
data is a list of sequence
a sequence is a list of grouped poses (for example grouped by seconds)
a pose is the coordinates of the joins positions: [x1, y1, c1, x2, y2, c2, ...]
the slightly modified code is now:
import numpy as np
data = [
[[1, 2, 3, 2, 3, 4, 3, 4, 5], [9, 2, 3, 4, 5, 6, 7, 8, 9], [4, 1, 2, 5, 3, 3, 4, 1, 2], [5, 3, 4, 1, 10, 6, 5, 0, 0]],
[[6, 9, 11, 0, 8, 6, 1, 5, 11], [3, 5, 4, 2, 0, 2, 0, 8, 8], [1, 5, 9, 5, 1, 0, 6, 6, 6]],
[[9, 4, 7, 0, 2, 1], [9, 4, 7, 0, 2, 1], [9, 4, 7, 0, 2, 1]]
]
def pose_average(sequence):
sequence = np.asarray(sequence)
x, y, c = sequence[:, 0::3], sequence[:, 1::3], sequence[:, 2::3]
x_avg = np.average(x, weights=c, axis=0)
y_avg = np.average(y, weights=c, axis=0)
return x_avg, y_avg
out = [ pose_average(seq) for seq in data ]
out
>>> [(array([4.83333333, 2.78947368, 5.375 ]),
array([2.16666667, 5.84210526, 5.875 ])),
(array([3.625, 0.5 , 1.88 ]), array([6.83333333, 6. , 6.2 ])),
(array([9., 0.]), array([4., 2.]))]
x_avg is now the list of x position averaged over the sequence for each point and weight by c.

Conditional logic with Python ndimage generic_filter

I am trying to write a python function to remove hot-pixels in 2D image data. I am trying to make function that will take the mean for the neighbors around each element in the 2D array and conditionally overwrite that element if its value exceeds the mean of its neighbors by a specific amount (for example 3 sigma). This is where I am:
def myFunction(values):
if np.mean(values) + 3*np.std(values) < origin:
return np.mean(values)
footprint = np.array([[1,1,1],
[1,0,1],
[1,1,1]])
correctedData = ndimage.generic_filter(data, myFunction, footprint = footprint)
'origin' in the above code is demonstrative. I know it isn't correct, I am just trying to show what I am trying to do. Is there a way to pass the value of the current element to the generic_function?
Thanks!
Your footprint is not passing the central value back to your function.
I find it easier to use size (equivalent to using all ones in the footprint), then deal with everything in the callback function. So in your case I'd extract the central value inside the callback function. Something like this:
from scipy.ndimage import generic_filter
def despike(values):
centre = int(values.size / 2)
avg = np.mean([values[:centre], values[centre+1:]])
std = np.std([values[:centre], values[centre+1:]])
if avg + 3 * std < values[centre]:
return avg
else:
return values[centre]
Let's make some fake data:
data = np.random.randint(0, 10, (5, 5))
data[2, 2] = 100
This yields (for example):
array([[ 2, 8, 4, 2, 4],
[ 9, 4, 7, 6, 5],
[ 9, 9, 100, 7, 3],
[ 0, 1, 0, 8, 0],
[ 9, 9, 7, 6, 0]])
Now you can apply the filter:
correctedData = generic_filter(data, despike, size=3)
Which removed the spike I added:
array([[2, 8, 4, 2, 4],
[9, 4, 7, 6, 5],
[9, 9, 5, 7, 3],
[0, 1, 0, 8, 0],
[9, 9, 7, 6, 0]])

Fastest way to count identical sub-arrays in a nd-array?

Let's consider a 2d-array A
2 3 5 7
2 3 5 7
1 7 1 4
5 8 6 0
2 3 5 7
The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.
I use an inefficient naive algorithm to do that:
import numpy
A=numpy.array([[2, 3, 5, 7],[2, 3, 5, 7],[1, 7, 1, 4],[5, 8, 6, 0],[2, 3, 5, 7]])
i=0
end = len(A)
while i<end:
print i,
j=i+1
numberID = 1
while j<end:
print j
if numpy.array_equal(A[i,:] ,A[j,:]):
numberID+=1
j+=1
i+=1
print A, len(A)
Expected result:
array([3,1,1]) # number identical arrays per line
My algo looks like using native python within numpy, thus inefficient. Thanks for help.
In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:
b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])
>>> unq_a
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> unq_cnt
array([1, 3, 1])
In an older numpy, you can replicate what np.unique does, which would look something like:
a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)
>>> a_unq
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> a_cnt
array([1, 3, 1])
You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.
In [9]: a
Out[9]:
array([[2, 3, 5, 7],
[2, 3, 5, 7],
[1, 7, 1, 4],
[5, 8, 6, 0],
[2, 3, 5, 7]])
In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])
In [11]: a[lexsort(a.T)]
Out[11]:
array([[5, 8, 6, 0],
[1, 7, 1, 4],
[2, 3, 5, 7],
[2, 3, 5, 7],
[2, 3, 5, 7]])
You can use Counter class from collections module for this.
It works like this :
x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c
Output : Counter({2: 3, 1: 1, 5: 1})
Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure.
If you can convert every value of x in a tuple that it should works as :
x = [(2, 3, 5, 7),(2, 3, 5, 7),(1, 7, 1, 4),(5, 8, 6, 0),(2, 3, 5, 7)]
from collections import Counter
c=Counter(x)
print c
Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})

Categories