Filter dictionary whose values are arrays - python

I have data which looks like this:
features_dict = {
'feat1': np.array([[0,1],[2,3],[4,5]]),
'feat2': np.array([[6,7],[8,9],[10,11]]),
'feat3': np.array([1, 0, 0]),
'feat4': np.array([[1],[2],[1]])
}
I want to filter the values of above dictionary based on the first dimension index where feat3 values are 0. Hence, the output I'm looking for is:
features_dict = {
'feat1': np.array([[2,3],[4,5]]),
'feat2': np.array([[8,9],[10,11]]),
'feat3': np.array([0, 0]),
'feat4': np.array([[2],[1]])
}
Notice that I want to have only the 2nd and 3rd elements of each dict value because that's where feat3 values are 0.
Initially, I was thinking of converting the dict to pandas and filter the rows using .loc but it turned out that pandas can't accept arrays.
Can anyone please help? Thanks

import numpy as np
features_dict = {
'feat1': np.array([[0,1],[2,3],[4,5]]),
'feat2': np.array([[6,7],[8,9],[10,11]]),
'feat3': np.array([1, 0, 0]),
'feat4': np.array([[1],[2],[1]])
}
ind = features_dict['feat3'] == 0
features_dict = {k: v[ind] for k,v in features_dict.items()}
After filtering:
{
'feat1': array([[2, 3],[4, 5]]),
'feat2': array([[ 8, 9],[10, 11]]),
'feat3': array([0, 0]),
'feat4': array([[2],[1]])
}

Related

Comparing two dimensional arrays to one another

I want to write a code where it outputs the similarities for the values of arrays a,b,c. I want the code to check if there are any similar values between the arrays. I will be comparing b and c to a. So [ 0, 1624580882] exist when comparing a and b and so on. Both the columns must be equivalent for the comparison to work.
import numpy as np
a= np.array([[ 0, 1624580882],
[ 1, 1624584458],
[ 0, 1624589467],
[ 1, 1624592213],
[ 0, 1624595336],
[ 1, 1624596349]])
b= np.array([[ 1, 1624580882],
[ 1, 1624584460],
[ 1, 1624595336],
[ 1, 1624596349]])
c = np.array([[ 0, 1624580882],
[ 1, 1624584458],
[ 0, 1624589495],
[ 1, 1624592238],
[ 0, 1624595336],
[ 1, 1624596349]])
Expected Output:
b comparison
Similarities= None
c comparison
Similarities= [ 0, 1624580882],[ 1, 1624584464], [ 0, 1624595350],[ 1, 1624596380]
I'm not giving you the actual solution rather I can help you with a simple function. You can design the rest of your code according to that function.
def compare_arrays(arr_1, arr_2):
result = []
for row in arr_1:
result.append(row in arr_2)
return result
Edit:
For getting the index of the duplicate values.
from numpy.lib import recfunctions as rfn
ndtype = [('a', int)]
a = np.ma.array([1, 1, 1, 2, 2, 3, 3],mask=[0, 0, 1, 0, 0, 0, 1]).view(ndtype)
rfn.find_duplicates(a, ignoremask=True, return_index=True)
not the most beautiful solution. But the first thing that comes to mind:
result = []
for row in a:
for irow in c:
if np.all(np.equal(row, irow)):
result.append(row)
break
I note that the proposed by Fatin Ishrak Rafi solution does not work. For example:
>>> [0, 1624589467] in c
>>> True

Create stack of arrays from diagonal values using numpy

I'm trying to do some matrix calculations in python and came across a problem when I tried to speed up my code using stacked arrays instead of simple for loops. I need to create a 2D-array with values (given as 1D-array) on the diagonal, but could't figure out a smart way to do it with stacked arrays.
In the old (loop) version, I used the np.diag() method, which returns exactly what I need (a 2D-array in that case) if I give the values as 1D-array as input. However, when I switched to stacked arrays my input is not a 1D-array anymore, so that the np.diag() method returns a copy of the diagonal of my 2D-input instead.
Old version with 1D input:
import numpy as np
vals = np.array([1,2,3])
mat = np.diag(vals)
print(mat.shape)
Out: (3, 3)
New version with 2D input:
vals_stack = np.repeat(np.expand_dims(vals, axis=0), 5, axis=0)
# btw: is there a better way to repeat/stack my array?
mat_stack = np.diag(vals_stack)
print(mat_stack.shape)
Out: (3,)
So you can see that np.diag() returns a 1D-array (as expected from the documentation), but I actually need a stack of 2D-arrays. So the shape of the mat_stack must be (7,3,3) and not (3,). Is there any function for that in numpy? Or do I have to loop over that additional dimension like this:
def mydiag(stack):
diag = np.zeros([stack.shape[0], stack.shape[1], stack.shape[1]])
for i in np.arange(stack.shape[0]):
diag[i,:,:] = np.diag([stack[i,:].ravel()])
return diag
In numpy you should use apply_along_axis. There is even an example at the end of the doc for your specific case (here). So the answer is :
np.apply_along_axis(np.diag, -1, vals_stack)
A more pythonic way would be something like this:
[np.diag(row) for row in vals_stack]
Is this what you had in mind:
In [499]: x = np.arange(12).reshape(4,3)
In [500]: X = np.zeros((4,3,3),int)
In [501]: X[np.arange(4)[:,None],np.arange(3), np.arange(3)] = x
In [502]: X
Out[502]:
array([[[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 0, 2]],
[[ 3, 0, 0],
[ 0, 4, 0],
[ 0, 0, 5]],
[[ 6, 0, 0],
[ 0, 7, 0],
[ 0, 0, 8]],
[[ 9, 0, 0],
[ 0, 10, 0],
[ 0, 0, 11]]])
X[0,np.arange(3), np.arange(3)] indexes the diagonal on the first plane. np.arange(4)[:,None] is a (4,1) array, which broadcasts with a (3,) to index a (4,3) block, matching the size of x.

how to find number of rows which satisfy different conditions on two numpy array

Let says we have two numpy array as a = [4, 5, 8, 10, 4, 8, 4]
and b = [1, 0, 1, 1, 1, 0, 0].
we have to find number of rows in which first array element is 4 and second array element is 1.
4,1 5,0 8,1 10,1 4,1 8,0 4,0
In this it is 2.since there are two rows in which first element is 4 and second is 1.
You should use something like
np.sum((a == 4) & (b == 1))
You can try the basics of python:-
import numpy as np
a = np.array([4, 5, 8, 10, 4, 8, 4])
b = np.array([1, 0, 1, 1, 1, 0, 0])
new_pair = []
for a_value, b_value in zip(a,b):
if a_value==4 and b_value==1:
new_pair.append([a_value,b_value])
print( len(new_pair) )
I hope it may help you.
It's like filtering your lists into the pairing within the same list.
Have you tried isin() method in pandas?
import pandas as pd
df = pd.DataFrame({'List_1': a, 'List_2':b})
df_list = []
for i in range(0,len(a)):
df = df.loc[df['List_1'].isin([a[i]])]
df = df.loc[df['List_2'].isin([b[i]])]
df_list.append(df)
#your df_list will now have the values as you need
Hope this helps :))

Map a Numpy array into a list of characters

Given a two dim numpy array:
a = array([[-1, -1],
[-1, 1],
[ 1, 1],
[ 1, 1],
[ 1, 0],
[ 0, -1],
[-1, 0],
[ 0, -1],
[-1, 0],
[ 0, 1],
[ 1, 1],
[ 1, 1]])
and a dictionary of conversions:
d = {-1:'a', 0:'b', 1:'c'}
how to map the original array into a list of character combinations?
What I need is the following list (or array)
out_put = ['aa', 'ac', 'cc', 'cc', 'cb', 'ba', ....]
(I am doing some machine learning classification and my classes are labeled by the combination of -1, 0,1 and I need to convert the array of 'labels' into something readable, as 'aa', bc' and so on).
If there is a simple function (binarizer, or one-hot-encoding) within the sklearn package, which can convert the original bumpy array into a set of labels, that would be perfect!
Here's another approach with list comprehension:
my_dict = {-1:'a', 0:'b', 1:'c'}
out_put = ["".join([my_dict[val] for val in row]) for row in a]
i think you ought to be able to do this via a list comprehension
# naming something `dict` is a bad idea
d = {-1:'a', 0:'b', 1:'c'}
out_put = ['%s%s' % (d[x], d[y]) for x, y in a]
I think the following is very readable:
def switch(row):
dic = {
-1:'a',
0:'b',
1:'c'
}
return dic.get(row)
out_put = [switch(x)+switch(y) for x,y in a]

Calculate percentage of count for a list of arrays

Simple problem, but I cannot seem to get it to work. I want to calculate the percentage a number occurs in a list of arrays and output this percentage accordingly.
I have a list of arrays which looks like this:
import numpy as np
# Create some data
listvalues = []
arr1 = np.array([0, 0, 2])
arr2 = np.array([1, 1, 2, 2])
arr3 = np.array([0, 2, 2])
listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)
listvalues
>[array([0, 0, 2]), array([1, 1, 2, 2]), array([0, 2, 2])]
Now I count the occurrences using collections, which returns a a list of collections.Counter:
import collections
counter = []
for i in xrange(len(listvalues)):
counter.append(collections.Counter(listvalues[i]))
counter
>[Counter({0: 2, 2: 1}), Counter({1: 2, 2: 2}), Counter({0: 1, 2: 2})]
The result I am looking for is an array with 3 columns, representing the value 0 to 2 and len(listvalues) of rows. Each cell should be filled with the percentage of that value occurring in the array:
# Result
66.66 0 33.33
0 50 50
33.33 0 66.66
So 0 occurs 66.66% in array 1, 0% in array 2 and 33.33% in array 3, and so on..
What would be the best way to achieve this?
Many thanks!
Here's an approach -
# Get lengths of each element in input list
lens = np.array([len(item) for item in listvalues])
# Form group ID array to ID elements in flattened listvalues
ID_arr = np.repeat(np.arange(len(lens)),lens)
# Extract all values & considering each row as an indexing perform counting
vals = np.concatenate(listvalues)
out_shp = [ID_arr.max()+1,vals.max()+1]
counts = np.bincount(ID_arr*out_shp[1] + vals)
# Finally get the percentages with dividing by group counts
out = 100*np.true_divide(counts.reshape(out_shp),lens[:,None])
Sample run with an additional fourth array in input list -
In [316]: listvalues
Out[316]: [array([0, 0, 2]),array([1, 1, 2, 2]),array([0, 2, 2]),array([4, 0, 1])]
In [317]: print out
[[ 66.66666667 0. 33.33333333 0. 0. ]
[ 0. 50. 50. 0. 0. ]
[ 33.33333333 0. 66.66666667 0. 0. ]
[ 33.33333333 33.33333333 0. 0. 33.33333333]]
The numpy_indexed package has a utility function for this, called count_table, which can be used to solve your problem efficiently as such:
import numpy_indexed as npi
arrs = [arr1, arr2, arr3]
idx = [np.ones(len(a))*i for i, a in enumerate(arrs)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(arrs))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)
You can get a list of all values and then simply iterate over the individual arrays to get the percentages:
values = set([y for row in listvalues for y in row])
print [[(a==x).sum()*100.0/len(a) for x in values] for a in listvalues]
You can create a list with the percentages with the following code :
percentage_list = [((counter[i].get(j) if counter[i].get(j) else 0)*10000)//len(listvalues[i])/100.0 for i in range(len(listvalues)) for j in range(3)]
After that, create a np array from that list :
results = np.array(percentage_list)
Reshape it so we have the good result :
results = results.reshape(3,3)
This should allow you to get what you wanted.
This is most likely not efficient, and not the best way to do this, but it has the merit of working.
Do not hesitate if you have any question.
I would like to use functional-paradigm to resolve this problem. For example:
>>> import numpy as np
>>> import pprint
>>>
>>> arr1 = np.array([0, 0, 2])
>>> arr2 = np.array([1, 1, 2, 2])
>>> arr3 = np.array([0, 2, 2])
>>>
>>> arrays = (arr1, arr2, arr3)
>>>
>>> u = np.unique(np.hstack(arrays))
>>>
>>> result = [[1.0 * c.get(uk, 0) / l
... for l, c in ((len(arr), dict(zip(*np.unique(arr, return_counts=True))))
... for arr in arrays)] for uk in u]
>>>
>>> pprint.pprint(result)
[[0.6666666666666666, 0.0, 0.3333333333333333],
[0.0, 0.5, 0.0],
[0.3333333333333333, 0.5, 0.6666666666666666]]

Categories