Sort nested lists by list - python

How can I efficiently sort nested lists by the first element of each nested list matching the order given in order list?
List of lists: [[97, 2, 0, 2], [97, 2, 0, 2], [98, 1, 2, 3], [99, 3, 3, 6], [99, 3, 3, 6], [99, 3, 3, 6], [101, 1, 6, 7], [100, 1, 7, 8]]
Order list: [97, 98, 99, 99, 101, 100, 97, 99]
Desired list: [[97, 2, 0, 2], [98, 1, 2, 3], [99, 3, 3, 6], [99, 3, 3, 6], [101, 1, 6, 7], [100, 1, 7, 8], [97, 2, 0, 2], [99, 3, 3, 6]]

Try creating a dict keyed to the first value from your nested list. Then build the output list from that dict:
nl = [[97, 2, 0, 2], [97, 2, 0, 2], [98, 1, 2, 3], [99, 3, 3, 6], [99, 3, 3, 6],
[99, 3, 3, 6], [101, 1, 6, 7], [100, 1, 7, 8]]
# Associate first value in list to the list
d = {v[0]: v for v in nl}
order_lst = [97, 98, 99, 99, 101, 100, 97, 99]
# Grab the list associated to each value in order_list from d
out = [d[v] for v in order_lst]
print(out)
out:
[[97, 2, 0, 2], [98, 1, 2, 3], [99, 3, 3, 6], [99, 3, 3, 6], [101, 1, 6, 7],
[100, 1, 7, 8], [97, 2, 0, 2], [99, 3, 3, 6]]
*Note this assumes that all sub-lists in the nested list are acceptable values as multiple variants of the same key are not supported in a dict.

Related

python parallel paralle process - joblib for nested loops - track original indices for the input

I was using joblib for parallel processing a list (>500k rows) to find out duplicates in the file. Therefore, I needed to track indices of the input list. However, the result returned indices in each thread/processing and they were not originally indices in the list (range 0-500k+). How can I track the original indices of the input in the parallel processing? Thank you.
import time
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from joblib import Parallel, delayed
start_time = time.time()
texts = a_list
def match_name(texts):
result = []
for i, text in enumerate(texts):
for j, name in enumerate(texts[i+1:]):
fratio = fuzz.token_set_ratio(text, name)
if fratio>=75:
result.append([i,j, fratio])
return result
results2 = Parallel(n_jobs=200, verbose=5, backend="loky")(map(delayed(match_name), texts))
print(time.time() - start_time)
The actual result is:
[[[1, 1, 100],
[1, 4, 100],
[1, 6, 100],
[2, 2, 100],
[2, 4, 100],
[3, 2, 100],
[3, 4, 100],
[5, 1, 100],
[6, 1, 100]],
[[0, 14, 100],
[1, 6, 100],
[1, 14, 100],
[2, 9, 100],
[2, 14, 100],
[8, 7, 100],
[9, 0, 100],
[9, 12, 100],
[10, 11, 100],
[12, 4, 100],
[13, 9, 100]],
[[1, 24, 100],
[3, 21, 100],
[5, 7, 100],
[6, 17, 100],
[9, 1, 100],
[9, 9, 100],
[11, 7, 100],
[12, 2, 100],
[17, 4, 100]],
[[0, 18, 100],
[0, 19, 100],
[2, 5, 100],
...]
The expected result ranges 0 to 500k+, which is the length of the list.

How to sum values from two iterator "lists"?

I have two iterators, which consists of a "list" that looks something like this:
[[1, 2, 3, 4, 5, 6],
[2, 4, 6, 8, 10, 12],
[3, 5, 8, 6, 1, 19],
[5, 9, 1, 9, 4, 6]]
Or, that is what it will look like if I just ran a for loop over them.
The reason for the iterator and not a list per se is due to memory. The true lists/arrays are way larger, this is just an example.
What I need to do is take one list and sum the columns of each index inside the list for all "outside" indices and then add them together for both lists like sum(list1) + sum(list2).
So basically:
list1: list2:
[[1, 2, 3, 4, 5, 6], [[5, 4, 3, 2, 1, 9],
[2, 4, 6, 8, 10, 12], [6, 3, 8, 1, 1, 6],
[3, 5, 8, 6, 1, 19], [1, 3, 2, 8, 2, 3],
[5, 9, 1, 9, 4, 6]] [5, 2, 9, 4, 2, 5]]
=> =>
[11, 20, 18, 20, 43] [17, 12, 22, 15, 23]
=>
[28, 32, 40, 35, 66]
So I iterate over the two lists, and for each list I need to sum the columns, and then in the end at the columns of the final two lists into one combined list.
I know how to do this if it were just regular lists, but since this is iterators/generators (don't know the correct term) I am really not sure how it is done.
You can use this to sum each one without loading everything into memory:
def sumIter(iter):
result = [0, 0, 0, 0, 0, 0] #Assuming there are always 6 items in each sub-list
for list in iter:
result = [(result[i] + list[i]) for i in range(6)]
And then:
sum1 = sumIter(iter1)
sum2 = sumIter(iter2)
result = [(sum1[i] + sum2[i]) for i in range(6)]
Using zip
Ex:
l1 = [
[1, 2, 3, 4, 5, 6],
[2, 4, 6, 8, 10, 12],
[3, 5, 8, 6, 1, 19],
[5, 9, 1, 9, 4, 6]
]
l2 = [
[5, 4, 3, 2, 1, 9],
[6, 3, 8, 1, 1, 6],
[1, 3, 2, 8, 2, 3],
[5, 2, 9, 4, 2, 5]
]
l1 = (sum(i) for i in zip(*l1))
l2 = (sum(i) for i in zip(*l2))
print( [sum(i) for i in zip(l1, l2)] )
Output:
[28, 32, 40, 42, 26, 66]
Using reduce since row can be added in numpy array.
reduce is an build-in function in python2
import numpy as np
from functools import reduce # only in python3
def sumup(one_row, another_row):
return one_row + another_row
test_list = np.array([[1, 2, 3, 4, 5, 6],
[2, 4, 6, 8, 10, 12],
[3, 5, 8, 6, 1, 19],
[5, 9, 1, 9, 4, 6]])
reduce(sumup, test_list)
Output
array([11, 20, 18, 27, 20, 43])
using numpy.sum
import numpy as np
l1 = np.sum([[1, 2, 3, 4, 5, 6], [2, 4, 6, 8, 10, 12], [3, 5, 8, 6, 1, 19], [5, 9, 1, 9, 4, 6]], axis=0)
l2 = np.sum([[5, 4, 3, 2, 1, 9],[6, 3, 8, 1, 1, 6], [1, 3, 2, 8, 2, 3],[5, 2, 9, 4, 2, 5]], axis=0)
print(l1 + l2)
Output
[28 32 40 42 26 66]

How do I sum data from certain columns and rows in a dataframe?

I have a bunch of matrices that I stored in a big dataframe. Let's say here is my dataframe.
data = pd.DataFrame([[13, 1, 3, 4, 0, 0], [0, 2, 6, 2, 0, 0], [3, 1, 5, 2, 2, 0], [0, 0, 10, 11, 6, 0], [5, 5, 21, 25, 41, 0],
[11, 1, 3, 2, 0, 1], [3, 1, 7, 3, 1, 1], [1, 1, 6, 5, 3, 1], [1, 1, 6, 7, 6, 1], [6, 6, 21, 24, 42, 1],
[17, 1, 7, 0, 0, 2], [1, 1, 6, 1, 1, 2], [2, 4, 6, 2, 1, 2], [0, 2, 11, 7, 8, 2], [5, 6, 17, 16, 46, 2],
[11, 1, 10, 2, 1, 3], [2, 2, 7, 1, 1, 3], [0, 0, 14, 4, 1, 3], [0, 0, 7, 7, 5, 3], [5, 1, 20, 18, 48, 3],
[16, 3, 7, 1, 2, 4], [1, 2, 4, 1, 0, 4], [2, 4, 7, 5, 3, 4], [3, 0, 4, 4, 7, 4], [7, 2, 13, 12, 58, 4]],
columns=['1', '2', '3', '4', '5', 'iteration'])
print(pd.DataFrame(data))
Each data['iteration'] is a matrix on its own. So, as you can see there are 5 matrices here (iteration-0 to 4). I want to add them all, like in basic matrix addition, to get one single matrix.
I have tried the following, but there's something wrong with it. It doesn't work.
matrix = data[['1','2','3','4','5']]
print(np.sum([matrix[matrix_list['iteration']==i] for i in range(0,9)], axis=0))
How do I do this the right way?
You can use:
In [98]: d = data.set_index('iteration')
In [99]: np.sum(d.loc[i].values for i in d.index.drop_duplicates().values)
Out[99]:
array([[ 68, 7, 30, 9, 3],
[ 7, 8, 30, 8, 3],
[ 8, 10, 38, 18, 10],
[ 4, 3, 38, 36, 32],
[ 28, 20, 92, 95, 235]])
Or alternatively, use groupby():
np.sum(e[1].iloc[:, :-1].values for e in data.groupby('iteration'))
array([[ 68, 7, 30, 9, 3],
[ 7, 8, 30, 8, 3],
[ 8, 10, 38, 18, 10],
[ 4, 3, 38, 36, 32],
[ 28, 20, 92, 95, 235]])

Python Matrix Multiplication Variations

just asked a question about multiplying matrices and that can be found here, I have one more question though about multiplying matrices. Say I have the following matrices:
matrix_a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
matrix_b = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
How could I get a result like this:
[[1, 4, 9], [16, 25, 36], [49, 64, 81]]
...so that each element is basically being multiplied by the single corresponding element of the other array. Does anyone know how to do that?
Thanks guys!
You could express the element-wise product (and matrix product) using list comprehensions, zip, and the * argument-unpacking operator:
matrix_a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
matrix_b = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
elementwise_product = [[ai*bi for ai, bi in zip(*rows)]
for rows in zip(matrix_a, matrix_b)]
print(elementwise_product)
# [[1, 4, 9], [16, 25, 36], [49, 64, 81]]
matrix_product = [[sum([ai*bi for ai, bi in zip(row_a, col_b)])
for col_b in zip(*matrix_b)]
for row_a in matrix_a]
print(matrix_product)
# [[30, 36, 42], [66, 81, 96], [102, 126, 150]]
The numpy package provides an array object that can do both element-wise and matrix-wise calculations:
import numpy as np
matrix_a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix_b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix_a*matrix_b
np.dot(matrix_a, matrix_b)
This outputs:
array([[ 1, 4, 9],
[16, 25, 36],
[49, 64, 81]])
array([[ 30, 36, 42],
[ 66, 81, 96],
[102, 126, 150]])
Numpy is available using pip install numpy or by using one of the numerical python distributions such as anaconda or pythonxy.
Since those lists are equal, you can just multiply it with itself. Here is a slightly verbose way to iterate the matrix and store the result in a new one.
matrix = [[1,2,3],[4,5,6],[7,8,9]]
result_matrix = [[],[],[]]
print (matrix)
for i in range(0, len(matrix)):
for j in range(0,len(matrix[i])):
result_matrix[i].append(matrix[i][j] * matrix[i][j])
print (result_matrix)
Ouput
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
[[1, 4, 9], [16, 25, 36], [49, 64, 81]]

From argwhere to where?

Is there a fast way of getting the output of argwhere in the output of where format ?
Let me show you what I'm doing with a bit of code:
In [123]: filter = np.where(scores[:,:,:,4,:] > 21000)
In [124]: filter
Out[124]:
(array([ 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 23, 23, 23, 23, 23]),
array([13, 13, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5]),
array([0, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2]),
array([44, 44, 0, 1, 2, 3, 6, 8, 12, 14, 22, 31, 58, 76, 82, 41]))
In [125]: filter2 = np.argwhere(scores[:,:,:,4,:] > 21000)
In [126]: filter2
Out[126]:
array([[ 2, 13, 0, 44],
[ 2, 13, 1, 44],
[ 4, 4, 3, 0],
[ 4, 4, 3, 1],
[ 4, 4, 3, 2],
[ 4, 4, 3, 3],
[ 4, 4, 3, 6],
[ 4, 4, 3, 8],
[ 4, 4, 3, 12],
[ 4, 4, 3, 14],
[ 4, 4, 3, 22],
[23, 4, 2, 31],
[23, 4, 2, 58],
[23, 4, 2, 76],
[23, 4, 2, 82],
[23, 5, 2, 41]])
In [150]: scores[:,:,:,4,:][filter]
Out[150]:
array([ 21344., 21344., 24672., 24672., 24672., 24672., 25232.,
25232., 25232., 25232., 24672., 21152., 21152., 21152.,
21152., 21344.], dtype=float16)
In [129]: filter2[np.argsort(scores[:,:,:,4,:][filter])]
Out[129]:
array([[23, 4, 2, 31],
[23, 4, 2, 58],
[23, 4, 2, 76],
[23, 4, 2, 82],
[ 2, 13, 0, 44],
[ 2, 13, 1, 44],
[23, 5, 2, 41],
[ 4, 4, 3, 0],
[ 4, 4, 3, 1],
[ 4, 4, 3, 2],
[ 4, 4, 3, 3],
[ 4, 4, 3, 22],
[ 4, 4, 3, 6],
[ 4, 4, 3, 8],
[ 4, 4, 3, 12],
[ 4, 4, 3, 14]])
129 is my desired output, so my code works, but I'm trying to make it as fast as possible. Should I get filter2 with np.array(filter).transpose() ? Is there something even better ?
Edit, trying to put it more clearly: I want a list of indices ordered by the value they return when applied to an array. To do that, I need both the output of np.where and np.argwhere, and I'm wondering what is the fastest way to switch from one output to the other, or if there's another of getting my result.
Look at the code for argwhere:
return transpose(asanyarray(a).nonzero())
while where docs say:
where(condition, [x, y])
If only condition is given, return condition.nonzero().
In effect, both use a.nonzero(). One uses it as is, the other transposes it.
In [933]: x=np.zeros((2,3),int)
In [934]: x[[0,1,0],[0,1,2]]=1
In [935]: x
Out[935]:
array([[1, 0, 1],
[0, 1, 0]])
In [936]: x.nonzero()
Out[936]: (array([0, 0, 1], dtype=int32), array([0, 2, 1], dtype=int32))
In [937]: np.where(x) # same as nonzero()
Out[937]: (array([0, 0, 1], dtype=int32), array([0, 2, 1], dtype=int32))
In [938]: np.argwhere(x)
Out[938]:
array([[0, 0],
[0, 2],
[1, 1]], dtype=int32)
In [939]: np.argwhere(x).T
Out[939]:
array([[0, 0, 1],
[0, 2, 1]], dtype=int32)
argwhere().T is the same as where except in a 2d rather than a tuple.
np.transpose(filter) and np.array(filter).T look equally good. For a large array the time spent in nonzero is much larger than the time spent on these transformations.

Categories