What is the derivative of Shannon's Entropy? - python

I have the following simple python function that calculates the entropy of a single input X according to Shannon's Theory of Information:
import numpy as np
def entropy(X:'numpy array'):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies/X.shape[0]
return -np.sum(probabilities*np.log2(probabilities))
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"entropy(a): {entropy(a)}")
print(f"entropy(b): {entropy(b)}")
print(f"entropy(c): {entropy(c)}")
With the output being the following:
entropy(a): 1.4591479170272446
entropy(b): 1.0
entropy(c): -0.0
However, I also need to calculate the derivative over dx:
d entropy / dx
This is not an easy task since the main formula
-np.sum(probabilities*np.log2(probabilities))
takes in probabilities, not x values, therefore it is not clear how to differentiate over dx.
Does anyone have an idea on how to do this?

One way to solve this is to use finite differences to compute the derivative numerically.
In this context, we can define a small constant to help us compute the numerical derivative. This function takes a one-argument function and computes its derivative for input x:
ε = 1e-12
def derivative(f, x):
return (f(x + ε) - f(x)) / ε
To make our work easier, let us define a function that computes the innermost operation of the entropy:
def inner(x):
return x * np.log2(x)
Recall that the derivative of the sum is the sum of derivatives. Therefore, the real derivative computation takes place in the inner function we just defined.
So, the numerical derivative of the entropy is:
def numerical_dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([derivative(inner, p) for p in probabilities])
Can we do better? Of course we can! The key insight here is the product rule: (f g)' = fg' + gf', where f=x and g=np.log2(x). (Also notice that d[log_a(x)]/dx = 1/(x ln(a)).)
So, the analytical entropy can be computed as:
import math
def dentropy(X):
_, frequencies = np.unique(X, return_counts=True)
probabilities = frequencies / X.shape[0]
return -np.sum([(1/math.log(2, math.e) + np.log2(p)) for p in probabilities])
Using the sample vectors for testing, we have:
a = np.array([1., 1., 1., 3., 3., 2.])
b = np.array([1., 1., 1., 3., 3., 3.])
c = np.array([1., 1., 1., 1., 1., 1.])
print(f"numerical d[entropy(a)]: {numerical_dentropy(a)}")
print(f"numerical d[entropy(b)]: {numerical_dentropy(b)}")
print(f"numerical d[entropy(c)]: {numerical_dentropy(c)}")
print(f"analytical d[entropy(a)]: {dentropy(a)}")
print(f"analytical d[entropy(b)]: {dentropy(b)}")
print(f"analytical d[entropy(c)]: {dentropy(c)}")
Which, when executed, gives us:
numerical d[entropy(a)]: 0.8417710972707937
numerical d[entropy(b)]: -0.8854028621385623
numerical d[entropy(c)]: -1.4428232973189605
analytical d[entropy(a)]: 0.8418398787754222
analytical d[entropy(b)]: -0.8853900817779268
analytical d[entropy(c)]: -1.4426950408889634
As a bonus, we can test whether this is correct with an automatic differentiation library:
import torch
a, b, c = torch.from_numpy(a), torch.from_numpy(b), torch.from_numpy(c)
def torch_entropy(X):
_, frequencies = torch.unique(X, return_counts=True)
frequencies = frequencies.type(torch.float32)
probabilities = frequencies / X.shape[0]
probabilities.requires_grad_(True)
return -(probabilities * torch.log2(probabilities)).sum(), probabilities
for v in a, b, c:
h, p = torch_entropy(v)
print(f'torch entropy: {h}')
h.backward()
print(f'torch derivative: {p.grad.sum()}')
Which gives us:
torch entropy: 1.4591479301452637
torch derivative: 0.8418397903442383
torch entropy: 1.0
torch derivative: -0.885390043258667
torch entropy: -0.0
torch derivative: -1.4426950216293335

Related

using Numpy for Kmean Clustering

I'm new in machine learning and want to build a Kmean algorithm with k = 2 and I'm struggling by calculate the new centroids. here is my code for kmeans:
def euclidean_distance(x: np.ndarray, y: np.ndarray):
# x shape: (N1, D)
# y shape: (N2, D)
# output shape: (N1, N2)
dist = []
for i in x:
for j in y:
new_list = np.sqrt(sum((i - j) ** 2))
dist.append(new_list)
distance = np.reshape(dist, (len(x), len(y)))
return distance
def kmeans(x, centroids, iterations=30):
assignment = None
for i in iterations:
dist = euclidean_distance(x, centroids)
assignment = np.argmin(dist, axis=1)
for c in range(len(y)):
centroids[c] = np.mean(x[assignment == c], 0) #error here
return centroids, assignment
I have input x = [[1., 0.], [0., 1.], [0.5, 0.5]] and y = [[1., 0.], [0., 1.]] and
distance is an array and look like that:
[[0. 1.41421356]
[1.41421356 0. ]
[0.70710678 0.70710678]]
and when I run kmeans(x,y) then it returns error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last) /tmp/ipykernel_40086/2170434798.py in
5
6 for c in range(len(y)):
----> 7 centroids[c] = (x[classes == c], 0)
8 print(centroids)
TypeError: only integer scalar arrays can be converted to a scalar
index
Does anyone know how to fix it or improve my code? Thank you in advance!
Changing inputs to NumPy arrays should get rid of errors:
x = np.array([[1., 0.], [0., 1.], [0.5, 0.5]])
y = np.array([[1., 0.], [0., 1.]])
Also seems like you must change for i in iterations to for i in range(iterations) in kmeans function.

Need help understanding the gradient function in pytorch

The following code
w = np.array([[2., 2.],[2., 2.]])
x = np.array([[3., 3.],[3., 3.]])
b = np.array([[4., 4.],[4., 4.]])
w = torch.tensor(w, requires_grad=True)
x = torch.tensor(x, requires_grad=True)
b = torch.tensor(b, requires_grad=True)
y = w*x + b
print(y)
# tensor([[10., 10.],
# [10., 10.]], dtype=torch.float64, grad_fn=<AddBackward0>)
y.backward(torch.FloatTensor([[1, 1],[ 1, 1]]))
print(w.grad)
# tensor([[3., 3.],
# [3., 3.]], dtype=torch.float64)
print(x.grad)
# tensor([[2., 2.],
# [2., 2.]], dtype=torch.float64)
print(b.grad)
# tensor([[1., 1.],
# [1., 1.]], dtype=torch.float64)
As the tensor argument inside gradient function is an all ones tensor in the shape of the input tensor, my understanding says that
w.grad means derivative of y w.r.t w, and produces b,
x.grad means derivative of y w.r.t x, and produces b and
b.grad means derivative of y w.r.t b, and produces all ones.
Out of these, only point 3 answer is matching my expected result. Can someone help me in understanding the first two answers. I think I understand the accumulation part, but don't think that is happening here.
To find the correct derivatives in this example, we need to take the sum and product rule into consideration.
Sum rule:
Product rule:
That means the derivatives of your equation are calculated as follows.
With respect to x:
With respect to w:
With respect to b:
The gradients reflect exactly that:
torch.equal(w.grad, x) # => True
torch.equal(x.grad, w) # => True
torch.equal(b.grad, torch.tensor([[1, 1], [1, 1]], dtype=torch.float64)) # => True

Does pytorch do eager pruning of its computational graph?

This is a very simple example:
import torch
x = torch.tensor([1., 2., 3., 4., 5.], requires_grad=True)
y = torch.tensor([2., 2., 2., 2., 2.], requires_grad=True)
z = torch.tensor([1., 1., 0., 0., 0.], requires_grad=True)
s = torch.sum(x * y * z)
s.backward()
print(x.grad)
This will print,
tensor([2., 2., 0., 0., 0.]),
since, of course, ds/dx is zero for the entries where z is zero.
My question is: Is pytorch smart and stop the computations when it reaches a zero? Or does in fact do the calculation "2*5", only to later do "10 * 0 = 0"?
In this simple example it doesn't make a big difference, but in the (bigger) problem I am looking at, this will make a difference.
Thank you for any input.
No, pytorch does no such thing as pruning any subsequent calculations when zero is reached. Even worse, due to how float arithmetic works all subsequent multiplication by zero will take roughly the same time as any regular multiplication.
For some cases there are ways around it though, for example if you want to use a masked loss you can just set the masked outputs to be zero, or detach them from gradients.
This example makes the difference clear:
def time_backward(do_detach):
x = torch.tensor(torch.rand(100000000), requires_grad=True)
y = torch.tensor(torch.rand(100000000), requires_grad=True)
s2 = torch.sum(x * y)
s1 = torch.sum(x * y)
if do_detach:
s2 = s2.detach()
s = s1 + 0 * s2
t = time.time()
s.backward()
print(time.time() - t)
time_backward(do_detach= False)
time_backward(do_detach= True)
outputs:
0.502875089645
0.198422908783

Expanding tensor using native tensorflow ops

I have a single dimensional data (floats) as shown below:
[-8., 18., 9., -3., 12., 11., -13., 38., ...]
I want to replace each negative element with an equivalent number of zeros.
My result would look something like this for the example above:
[0., 0., 0., 0., 0., 0., 0., 0., 18., 9., 0., 0., 0., 12., ...]
I am able to do this in Tensorflow by using tf.py_func().
But it turns out the graph is not serializable if I use that method.
Are there native tensorflow ops that can help me get the same result?
Not a straightforward task! Here is a pure TensorFlow implementation:
import tensorflow as tf
# Input vector
inp = tf.placeholder(tf.int32, [None])
# Find positive and negative indices
mask = inp < 0
num_inputs = tf.size(inp)
pos_idx, neg_idx = tf.dynamic_partition(tf.range(num_inputs), tf.cast(mask, tf.int32), 2)
# Negative values
negs = -tf.gather(inp, neg_idx)
total_neg = tf.reduce_sum(negs)
cum_neg = tf.cumsum(negs)
# Compute the final index of each positive element
pos_neg_idx = tf.cast(pos_idx[:, tf.newaxis] > neg_idx, inp.dtype)
neg_ref = tf.reduce_sum(pos_neg_idx, axis=1)
shifts = tf.gather(tf.concat([[0], cum_neg], axis=0), neg_ref) - neg_ref
final_pos_idx = pos_idx + shifts
# Compute the final size
final_size = num_inputs + total_neg - tf.size(negs)
# Make final vector by scattering positive values
result = tf.scatter_nd(final_pos_idx[:, tf.newaxis], tf.gather(inp, pos_idx), [final_size])
with tf.Session() as sess:
print(sess.run(result, feed_dict={inp: [-1, 1, -2, 2, 1, -3]}))
Output:
[0 1 0 0 2 1 0 0 0]
There is some "more than necessary" computational cost in this solution, namely the computation of final indices of positive elements through pos_neg_idx, which is O(n2), while it could be done iteratively in O(n). However, I cannot think of a way to replicate the loop iteratively, and a TensorFlow loop (using tf.while_loop) would be awkward and slow. In any case, unless you are using quite large vectors (with evenly distributed positive and negative values) it should not be a big issue.

Solve seemingly (but not actually!) overdetermined sparse linear system in Python

I have a sparse matrix A (using scipy.sparse) and a vector b, and want to solve Ax = b for x. A has more rows than columns, so it appears to be overdetermined; however, the rows of A are linearly dependent, so that in actuality the row rank of A is equal to the number of columns. For example, A could be
A = np.array([[1., 1.], [-1., -1.], [1., 0.]])
while b is
b = np.array([0., 0., 1.])
The solution is then x = [1., -1.]. I'm wondering how to solve this system in Python, using the functions available in scipy.sparse.linalg. Thanks!
Is your system possibly underdetermined? If it is not, and there is actually a solution, then the least squares solution will be that solution, so you can try
from scipy.sparse.linalg import lsqr
return_values = lsqr(A, b)
x = return_values[0]
If your system is actually underdetermined, this should find you the minimum L2 norm solution. If it doesn't work, set the parameter damp to something very small (e.g. 1e-5).
If your system is exactly determined (i.e. A is of full rank) and has a solution, and your matrix A is tall, as you describe it, then you can find an equivalent system in the normal equations:
A.T.dot(A).dot(x) == A.T.dot(b)
has a unique solution in x. This is a square linear system and is thus solvable using linear system solvers such as scipy.sparse.linalg.spsolve
The formally correct way of solving your problem is to use SVD. You have a system of the form
A [MxN] * x [Nx1] = b [Mx1]
The SVD decomposes the matrix A into three others, so you get:
U [MxM] * S[MxN] * V[N*N] * x[Nx1] = b[Mx1]
The matrices U and V are both orthogonal (their inverse is their transpose), and S is a diagonal matrix. If we rewrite the above we get:
S[MxN] * V [N * N] * x[Nx1] = U.T [MxM] * b [Mx1]
If M > N then the matrix S will have its last M - N rows full of zeros, and if your system is truly determined, then U.T b should also have the last M - N rows zero. That means that you can solve your system as:
>>> a = np.array([[1., 1.], [-1., -1.], [1., 0.]])
>>> b = np.array([0., 0., 1.])
>>> u, s, v = np.linalg.svd(a)
>>> np.allclose(u.T.dot(b)[-m+n:], 0) #check system is not overdetermined
True
>>> np.linalg.solve(s[:, None] * v, u.T.dot(b)[:n])
array([ 1., -1.])

Categories