I am working on matrix multiplications in NumPy using np.dot(). As the data set is very large, I would like to reduce the overall run time as far as possible - i.e. perform as little as possible np.dot() products.
Specifically, I need to calculate the overall matrix product as well as the associated flow from each element of my values vector.
Is there a way in NumPy to calculate all of this together in one or two np.dot() products?
In the code below, is there a way to reduce the number of np.dot() products and still get the same output?
import pandas as pd
import numpy as np
vector = pd.DataFrame([1, 2, 3],
['A', 'B', 'C'], ["Values"])
matrix = pd.DataFrame([[0.5, 0.4, 0.1],
[0.2, 0.6, 0.2],
[0.1, 0.3, 0.6]],
index = ['A', 'B', 'C'], columns = ['A', 'B', 'C'])
# Can the number of matrix multiplications in this part be reduced?
overall = np.dot(vector.T, matrix)
from_A = np.dot(vector.T * [1,0,0], matrix)
from_B = np.dot(vector.T * [0,1,0], matrix)
from_C = np.dot(vector.T * [0,0,1], matrix)
print("Overall:", overall)
print("From A:", from_A)
print("From B:", from_B)
print("From C:", from_C)
If the vectors you use to select the row are indeed the unit vectors, you are much better off not doing matrix multiplication at all for from_A, from_B, from_C. Matrix multiplication requires a lot more addition and multiplications than you need to just multiply each row of the matrix by it's corresponding entry in the vector:
from_ABC = matrix.values * vector.values
You will only need a single call to np.dot to get overall.
You could define a 3 x 3 shaped 2D array of those scaling values and perform matrix-multiplication, like so -
scale = np.array([[1,0,0],[0,1,0],[0,0,1]])
from_ABC = np.dot(vector.values.ravel()*scale,matrix)
Sample run -
In [901]: from_A
Out[901]: array([[ 0.5, 0.4, 0.1]])
In [902]: from_B
Out[902]: array([[ 0.9, 1.6, 0.5]])
In [903]: from_C
Out[903]: array([[ 0.8, 1.3, 1.9]])
In [904]: from_ABC
Out[904]:
array([[ 0.5, 0.4, 0.1],
[ 0.9, 1.6, 0.5],
[ 0.8, 1.3, 1.9]])
Here's an alternative with np.einsum to do all those in one step -
np.einsum('ij,ji,ik->jk',vector.values,scale,matrix)
Sample run -
In [915]: np.einsum('ij,ji,ik->jk',vector.values,scale,matrix)
Out[915]:
array([[ 0.5, 0.4, 0.1],
[ 0.9, 1.6, 0.5],
[ 0.8, 1.3, 1.9]])
Related
Given three lists, e.g.
a = [0.4, 0.6, 0.8]
b = [0.3, 0.2, 0.5]
c = [0.1, 0.6, 0.12]
I want to generate a confusion matrix, which essentially applies a function (e.g. the correlation) between each of the combinations of the lists.
Essentially the calculations then look like this:
confusion_matrix = np.array([
[1,
scipy.stats.pearsonr(a, b)[0],
scipy.stats.pearsonr(a, c)[0]],
[scipy.stats.pearsonr(b, a)[0],
1,
scipy.stats.pearsonr(b, c)[0]],
[scipy.stats.pearsonr(c, a)[0],
scipy.stats.pearsonr(c, b)[0],
1]
])
Does a Python function exist, which is capable of generating such a matrix automatically, without spelling out every element? If this could also generates a heatmap from the matrix, that would be even better.
You can write a list comprehension:
import numpy as np
from scipy.stats import pearsonr
from itertools import product
matrix = [a, b, c]
np.array([
[1 if i1 == i2 else pearsonr(matrix[i1], matrix[i2])[0]
for i2 in range(len(a))] for i1 in range(len(a))
])
This outputs:
[[ 1. 0.65465367 0.03532591]
[ 0.65465367 1. -0.73233089]
[ 0.03532591 -0.73233089 1. ]]
I have a data frame that look like below. Notice that the index is not sequential.
pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
I would like to calculate the cosine distance between each row and those that have 1 in manager (excluding itself), and then take an average and append it to a new column cos_distance. For example, for row0, I will get cosine distance with row 3 and 4 and then take the average. How do I add the condition to restrict it to those with 1 in the manager column only?
I tried running below code, but probably because we don't have sequential indices, it returned an empty list.
from scipy.spatial.distance import cosine as cos
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
lead_cos = []
for i in range(0):
person_cos = []
for j in range(0, len(manager)):
person_cos.append(cos(x.loc[i], manager.loc[j]))
lead_cos.append(np.average(person_cos))
lead_cos
Desired output:
This is what I'm trying. I'm not getting the exact values as your desired output, probably because for each "manager" I include itself in the cosine calculation (maybe you need to avoid that too, not sure).
EDIT: I manage to avoid repeating the current manager. However, index 14 gives me a value different than yours. I also included rounding to 2 decimal places.
from scipy.spatial.distance import cosine as cos
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0.1, 0.2, 0.1, 1], [0.4, 0.5, 0, 0], [0.2, 0.4, 0.2,0],[0.3, 0.1, 0.2,1],[0.4, 0.2, 0.2,1]]),
columns=['a', 'b', 'c','manager'])
df=df.set_index([pd.Index([0, 2, 10, 14,16])], 'id')
n = df.shape[0]
x=df.iloc[:, :3]
manager=df[df['manager']==1].iloc[:, :3]
n_man = manager.shape[0]
lead_cos = []
for i in range(n):
person_cos = []
for j in range(n_man):
if x.index[i] != manager.index[j]:
person_cos.append(cos(x.values.tolist()[i], manager.values.tolist()[j]))
lead_cos.append(round(np.average(person_cos),2))
df['lead_cos'] = lead_cos
print(df)
Output:
numpy.random has the following function to generate multinomial random samples.
multinomial(n, p, size)
But I wonder if there is an efficient way to generate multinomial samples for different parameters n and p. For example,
n = np.array([[10],
[20]])
p = np.array([[0.1, 0.2, 0.7],
[0.4, 0.4, 0.2]])
and even for higher dimension n and p like these:
n = np.array([[[10],
[20]],
[[10],
[20]]])
p = np.array([[[0.1, 0.2, 0.7],
[0.1, 0.2, 0.7]],
[[0.3, 0.2, 0.5],
[0.4, 0.1, 0.5]]])
I know for the univariate random variable, we can do this kind of things, but don't know how to do it for multinomial in python.
Say that I have an 2d array ar like this:
0.9, 0.1, 0.3
0.4, 0.5, 0.1
0.5, 0.8, 0.5
And I want to sample from [1, 0] according to this probability array.
rdchoice = lambda x: numpy.random.choice([1, 0], p=[x, 1-x])
I have tried two methods:
1) reshape it into a 1d array first and use numpy.random.choice and then reshape it back to 2d:
np.array(list(map(rdchoice, ar.reshape((-1,))))).reshape(ar.shape)
2) use the vectorize function.
func = numpy.vectorize(rdchoice)
func(ar)
But these two ways are all too slow, and I learned that the nature of the vectorize is a for-loop and in my experiments, I found that map is no faster than vectorize.
I thought this can be done faster. If the 2d array is large it would be unbearably slow.
You should be able to do this like so:
>>> p = np.array([[0.9, 0.1, 0.3], [0.4, 0.5, 0.1], [0.5, 0.8, 0.5]])
>>> (np.random.rand(*p.shape) < p).astype(int)
Actually I can use the np.random.binomial:
import numpy as np
p = [[0.9, 0.1, 0.3],
[0.4, 0.5, 0.1],
[0.5, 0.8, 0.5]]
np.random.binomial(1, p)
I'm supposed to normalize an array. I've read about normalization and come across a formula:
I wrote the following function for it:
def normalize_list(list):
max_value = max(list)
min_value = min(list)
for i in range(0, len(list)):
list[i] = (list[i] - min_value) / (max_value - min_value)
That is supposed to normalize an array of elements.
Then I have come across this: https://stackoverflow.com/a/21031303/6209399
Which says you can normalize an array by simply doing this:
def normalize_list_numpy(list):
normalized_list = list / np.linalg.norm(list)
return normalized_list
If I normalize this test array test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9] with my own function and with the numpy method, I get these answers:
My own function: [0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
The numpy way: [0.059234887775909233, 0.11846977555181847, 0.17770466332772769, 0.23693955110363693, 0.29617443887954614, 0.35540932665545538, 0.41464421443136462, 0.47387910220727386, 0.5331139899831830
Why do the functions give different answers? Is there others way to normalize an array of data? What does numpy.linalg.norm(list) do? What do I get wrong?
There are different types of normalization. You are using min-max normalization. The min-max normalization from scikit learn is as follows.
import numpy as np
from sklearn.preprocessing import minmax_scale
# your function
def normalize_list(list_normal):
max_value = max(list_normal)
min_value = min(list_normal)
for i in range(len(list_normal)):
list_normal[i] = (list_normal[i] - min_value) / (max_value - min_value)
return list_normal
#Scikit learn version
def normalize_list_numpy(list_numpy):
normalized_list = minmax_scale(list_numpy)
return normalized_list
test_array = [1, 2, 3, 4, 5, 6, 7, 8, 9]
test_array_numpy = np.array(test_array)
print(normalize_list(test_array))
print(normalize_list_numpy(test_array_numpy))
Output:
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
[0.0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.0]
MinMaxscaler uses exactly your formula for normalization/scaling:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html
#OuuGiii: NOTE: It is not a good idea to use Python built-in function names as varibale names. list() is a Python builtin function so its use as a variable should be avoided.
The question/answer that you reference doesn't explicitly relate your own formula to the np.linalg.norm(list) version that you use here.
One NumPy solution would be this:
import numpy as np
def normalize(x):
x = np.asarray(x)
return (x - x.min()) / (np.ptp(x))
print(normalize(test_array))
# [ 0. 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1. ]
Here np.ptp is peak-to-peak ie
Range of values (maximum - minimum) along an axis.
This approach scales the values to the interval [0, 1] as pointed out by #phg.
The more traditional definition of normalization would be to scale to a 0 mean and unit variance:
x = np.asarray(test_array)
res = (x - x.mean()) / x.std()
print(res.mean(), res.std())
# 0.0 1.0
Or use sklearn.preprocessing.normalize as a pre-canned function.
Using test_array / np.linalg.norm(test_array) creates a result that is of unit length; you'll see that np.linalg.norm(test_array / np.linalg.norm(test_array)) equals 1. So you're talking about two different fields here, one being statistics and the other being linear algebra.
The power of python is its broadcasting property, which allows you to do vectorizing array operations without explicit looping. So, You do not need to write a function using explicit for loop, which is slow and time-consuming, especially if your dataset is too big.
The pythonic way of doing min-max normalization is
test_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
normalized_test_array = (test_array - min(test_array)) / (max(test_array) - min(test_array))
output >> [ 0., 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1. ]