Given three lists, e.g.
a = [0.4, 0.6, 0.8]
b = [0.3, 0.2, 0.5]
c = [0.1, 0.6, 0.12]
I want to generate a confusion matrix, which essentially applies a function (e.g. the correlation) between each of the combinations of the lists.
Essentially the calculations then look like this:
confusion_matrix = np.array([
[1,
scipy.stats.pearsonr(a, b)[0],
scipy.stats.pearsonr(a, c)[0]],
[scipy.stats.pearsonr(b, a)[0],
1,
scipy.stats.pearsonr(b, c)[0]],
[scipy.stats.pearsonr(c, a)[0],
scipy.stats.pearsonr(c, b)[0],
1]
])
Does a Python function exist, which is capable of generating such a matrix automatically, without spelling out every element? If this could also generates a heatmap from the matrix, that would be even better.
You can write a list comprehension:
import numpy as np
from scipy.stats import pearsonr
from itertools import product
matrix = [a, b, c]
np.array([
[1 if i1 == i2 else pearsonr(matrix[i1], matrix[i2])[0]
for i2 in range(len(a))] for i1 in range(len(a))
])
This outputs:
[[ 1. 0.65465367 0.03532591]
[ 0.65465367 1. -0.73233089]
[ 0.03532591 -0.73233089 1. ]]
Related
I have a data frame that look like below. Notice that the index is not sequential.
pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
I would like to calculate the cosine distance between each row and those that have 1 in manager (excluding itself), and then take an average and append it to a new column cos_distance. For example, for row0, I will get cosine distance with row 3 and 4 and then take the average. How do I add the condition to restrict it to those with 1 in the manager column only?
I tried running below code, but probably because we don't have sequential indices, it returned an empty list.
from scipy.spatial.distance import cosine as cos
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
lead_cos = []
for i in range(0):
person_cos = []
for j in range(0, len(manager)):
person_cos.append(cos(x.loc[i], manager.loc[j]))
lead_cos.append(np.average(person_cos))
lead_cos
Desired output:
This is what I'm trying. I'm not getting the exact values as your desired output, probably because for each "manager" I include itself in the cosine calculation (maybe you need to avoid that too, not sure).
EDIT: I manage to avoid repeating the current manager. However, index 14 gives me a value different than yours. I also included rounding to 2 decimal places.
from scipy.spatial.distance import cosine as cos
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
n = df.shape[0]
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
n_man = manager.shape[0]
lead_cos = []
for i in range(n):
person_cos = []
for j in range(n_man):
if x.index[i] != manager.index[j]:
person_cos.append(cos(x.values.tolist()[i], manager.values.tolist()[j]))
lead_cos.append(round(np.average(person_cos),2))
df['lead_cos'] = lead_cos
print(df)
Output:
I'm currently writing a code, and I have to extract from a numpy array.
For example: [[1,1] , [0.6,0.6], [0,0]]), given the condition for the extracted points [x,y] must satisfy x>=0.5 and y >= 0.5
I've tried to use numpy extract, with the condition arr[0]>=0.5 & arr[1]>=0.5 however that does not seem to work
It applied the condition on all the elements, and I just want it to apply to the points inside my array.
Thanks in advance!
You can use multiple conditions to slice an array as follows:
import numpy as np
a = np.array([[1, 1] , [0.6, 0.6], [0, 0]])
new = a[(a[:, 0] >= 0.5) & (a[:, 1] >= 0.5)]
Results:
array([[1. , 1. ],
[0.6, 0.6]])
The first condition filters on column 0 and the second condition filters on column 1. Only rows where both conditions are met will be in the results.
I would do it following way: firstly look for rows full-filling condition:
import numpy as np
a = np.array([[1,1] , [0.6,0.6], [0,0]])
rows = np.apply_along_axis(lambda x:x[0]>=0.5 and x[1]>=0.5,1,a)
then use it for indexing:
out = a[rows]
print(out)
output:
[[1. 1. ]
[0.6 0.6]]
It can be solved using python generators.
import numpy as np
p = [[1,1] , [0.6,0.6], [0,0]]
result = np.array([x for x in p if x[0]>0.5 and x[1]>0.5 ])
You can read more about generators from here.
Also you can try this:-
p = np.array(p)
result= p[np.all(p>0.5, axis=1)]
I have a tensor as follows:
arr = [[1.5,0.2],[2.3,0.1],[1.3,0.21],[2.2,0.09],[4.4,0.8]]
I would like to collect small arrays whose difference of first elements are within 0.3 and second elements are within 0.03.
For example [1.5,0.2] and [1.3,0.21] should belong to a same category. The difference of their first elements is 0.2<0.3 and second 0.01<0.03.
I want a tensor looks like this
arr = {[[1.5,0.2],[1.3,0.21]],[[2.3,0.1],[2.2,0.09]]}
How to do this in tensorflow? Eager mode is ok.
I found a way which is a bit ugly and slow:
samples = np.array([[1.5,0.2],[2.3,0.1],[1.3,0.2],[2.2,0.09],[4.4,0.8],[2.3,0.11]],dtype=np.float32)
ini_samples = samples
samples = tf.split(samples,2,1)
a = samples[0]
b = samples[1]
find_match1 = tf.reduce_sum(tf.abs(tf.expand_dims(a,0) - tf.expand_dims(a,1)),2)
a = tf.logical_and(tf.greater(find_match1, tf.zeros_like(find_match1)),tf.less(find_match1, 0.3*tf.ones_like(find_match1)))
find_match2 = tf.reduce_sum(tf.abs(tf.expand_dims(b,0) - tf.expand_dims(b,1)),2)
b = tf.logical_and(tf.greater(find_match2, tf.zeros_like(find_match2)),tf.less(find_match2, 0.03*tf.ones_like(find_match2)))
x,y = tf.unique(tf.reshape(tf.where(tf.logical_or(a,b)),[1,-1])[0])
r = tf.gather(ini_samples, x)
Does tensorflow have more elegant functions?
You cannot get a result composed of "groups" of vectors with different sizes. Instead, you can make a "group id" tensor that classifies each vector into a group according to your criteria. The part that makes this a bit more complicated is that you have to "fuse" groups with common elements, which I think can only be done with a loop. This code does something like that:
import tensorflow as tf
def make_groups(correspondences):
# Multiply each row by its index
m = tf.to_int32(correspondences) * tf.range(tf.shape(correspondences)[0])
# Pick the largest index for each row
r = tf.reduce_max(m, axis=1)
# While loop accounts for transitive correspondences
# (e.g. if A and B go toghether and B and C go together, then A, B and C go together)
# The loop makes sure every element gets the largest common group id
r_prev = -tf.ones_like(r)
r, _ = tf.while_loop(lambda r, r_prev: tf.reduce_any(tf.not_equal(r, r_prev)),
lambda r, r_prev: (tf.gather(r, r), tf.identity(r)),
[r, r_prev])
# Use unique indices to make sequential group ids starting from 0
return tf.unique(r)[1]
# Test
with tf.Graph().as_default(), tf.Session() as sess:
arr = tf.constant([[1.5 , 0.2 ],
[2.3 , 0.1 ],
[1.3 , 0.21],
[2.2 , 0.09],
[4.4 , 0.8 ],
[1.1 , 0.23]])
a = arr[:, 0]
b = arr[:, 0]
cond = (tf.abs(a - a[:, tf.newaxis]) < 0.3) | (tf.abs(b - b[:, tf.newaxis]) < 0.03)
groups = make_groups(cond)
print(sess.run(groups))
# [0 1 0 1 2 0]
So in this case, the groups would be:
[1.5, 0.2], [1.3, 0.21] and [1.1, 0.23]
[2.3, 0.1] and [2.2, 0.09]
[4.4, 0.8]
For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.
You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]
import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)
I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.
I am working on matrix multiplications in NumPy using np.dot(). As the data set is very large, I would like to reduce the overall run time as far as possible - i.e. perform as little as possible np.dot() products.
Specifically, I need to calculate the overall matrix product as well as the associated flow from each element of my values vector.
Is there a way in NumPy to calculate all of this together in one or two np.dot() products?
In the code below, is there a way to reduce the number of np.dot() products and still get the same output?
import pandas as pd
import numpy as np
vector = pd.DataFrame([1, 2, 3],
['A', 'B', 'C'], ["Values"])
matrix = pd.DataFrame([[0.5, 0.4, 0.1],
[0.2, 0.6, 0.2],
[0.1, 0.3, 0.6]],
index = ['A', 'B', 'C'], columns = ['A', 'B', 'C'])
# Can the number of matrix multiplications in this part be reduced?
overall = np.dot(vector.T, matrix)
from_A = np.dot(vector.T * [1,0,0], matrix)
from_B = np.dot(vector.T * [0,1,0], matrix)
from_C = np.dot(vector.T * [0,0,1], matrix)
print("Overall:", overall)
print("From A:", from_A)
print("From B:", from_B)
print("From C:", from_C)
If the vectors you use to select the row are indeed the unit vectors, you are much better off not doing matrix multiplication at all for from_A, from_B, from_C. Matrix multiplication requires a lot more addition and multiplications than you need to just multiply each row of the matrix by it's corresponding entry in the vector:
from_ABC = matrix.values * vector.values
You will only need a single call to np.dot to get overall.
You could define a 3 x 3 shaped 2D array of those scaling values and perform matrix-multiplication, like so -
scale = np.array([[1,0,0],[0,1,0],[0,0,1]])
from_ABC = np.dot(vector.values.ravel()*scale,matrix)
Sample run -
In [901]: from_A
Out[901]: array([[ 0.5, 0.4, 0.1]])
In [902]: from_B
Out[902]: array([[ 0.9, 1.6, 0.5]])
In [903]: from_C
Out[903]: array([[ 0.8, 1.3, 1.9]])
In [904]: from_ABC
Out[904]:
array([[ 0.5, 0.4, 0.1],
[ 0.9, 1.6, 0.5],
[ 0.8, 1.3, 1.9]])
Here's an alternative with np.einsum to do all those in one step -
np.einsum('ij,ji,ik->jk',vector.values,scale,matrix)
Sample run -
In [915]: np.einsum('ij,ji,ik->jk',vector.values,scale,matrix)
Out[915]:
array([[ 0.5, 0.4, 0.1],
[ 0.9, 1.6, 0.5],
[ 0.8, 1.3, 1.9]])