I'm learning about neural networks, specifically looking at MLPs with a back-propagation implementation. I'm trying to implement my own network in python and I thought I'd look at some other libraries before I started. After some searching I found Neil Schemenauer's python implementation bpnn.py. (http://arctrix.com/nas/python/bpnn.py)
Having worked through the code and read the first part of Christopher M. Bishops book titled 'Neural Networks for Pattern Recognition' I found an issue in the backPropagate function:
# calculate error terms for output
output_deltas = [0.0] * self.no
for k in range(self.no):
error = targets[k]-self.ao[k]
output_deltas[k] = dsigmoid(self.ao[k]) * error
The line of code that calculates the error is different in Bishops book. On page 145, equation 4.41 he defines the output units error as:
d_k = y_k - t_k
Where y_k are the outputs and t_k are the targets. (I'm using _ to represent subscript)
So my question is should this line of code:
error = targets[k]-self.ao[k]
Be infact:
error = self.ao[k] - targets[k]
I'm most likely completely wrong but could someone help clear up my confusion please. Thanks
It all depends on the error measure you use. To give just a few examples of error measures (for brevity, I'll use ys to mean a vector of n outputs and ts to mean a vector of n targets):
mean squared error (MSE):
sum((y - t) ** 2 for (y, t) in zip(ys, ts)) / n
mean absolute error (MAE):
sum(abs(y - t) for (y, t) in zip(ys, ts)) / n
mean logistic error (MLE):
sum(-log(y) * t - log(1 - y) * (1 - t) for (y, t) in zip(ys, ts)) / n
Which one you use depends entirely on the context. MSE and MAE can be used for when the target outputs can take any values, and MLE gives very good results when your target outputs are either 0 or 1 and when y is in the open range (0, 1).
With that said, I haven't seen the errors y - t or t - y used before (I'm not very experienced in machine learning myself). As far as I can see, the source code you provided doesn't square the difference or use the absolute value, are you sure the book doesn't either? The way I see it y - t or t - y can't be very good error measures and here's why:
n = 2 # We only have two output neurons
ts = [ 0, 1 ] # Our target outputs
ys = [ 0.999, 0.001 ] # Our sigmoid outputs
# Notice that your outputs are the exact opposite of what you want them to be.
# Yet, if you use (y - t) or (t - y) to measure your error for each neuron and
# then sum up to get the total error of the network, you get 0.
t_minus_y = (0 - 0.999) + (1 - 0.001)
y_minus_t = (0.999 - 0) + (0.001 - 1)
Edit: Per alfa's comment, in the book, y - t is actually the derivative of MSE. In that case, t - y is incorrect. Note, however, that the actual derivative of MSE is 2 * (y - t) / n, not simply y - t.
If you don't divide by n (so you actually have a summed squared error (SSE), not a mean squared error), then the derivative would be 2 * (y - t). Furthermore, if you use SSE / 2 as your error measure, then the 1 / 2 and the 2 in the derivative cancel out and you are left with y - t.
You have to backpropagate the derivative of
0.5*(y-t)^2 or 0.5*(t-y)^2 with respect to y
which is always
y-t = (y-t)(+1) = (t-y)(-1)
You can study this implementation of MLP from Padasip library.
And the documentation is here
In actual code, we often calculate NEGATIVE grad(of loss with regard to w), and use w += eta*grad to update weight. Actually its a grad ascent.
In some text book, POSITIVE grad is calculated and w -= eta*grad to update weight.
Related
I have a CSV file of various persons with 8 parameters to determine whether the person is diabetic or not.
You will get the CSV file from here
I am making a model that will train and predict if a person is diabetic or not without using of third-party applications like Tensorlfow Scikitlearn etc. I am making it from scratch.
here is my code:
from numpy import genfromtxt
import numpy as np
my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.zeros((X.shape[1],1))
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
The output which is getting generated is:
It is in this link please take a look.
I have tried to -
initialize the value of W with random numbers.
spent a lot of time to debug but cannot find what I have done wrong
This short answer is that your learning rate is about 500x too big for this problem. Think about it like you're trying to pilot your W vector into a canyon in the cost function. At each step, the gradient tells you which way is down hill, but the steps you take in that direction are so big that you jump over the canyon and end up on the other side. Each time this happens, your cost goes up because you're getting farther and farther out of the canyon until after 2 iterations, it blows up.
If you replace the line
W,b = optimizer(W, b, X, Y, 100, 0.05)
with
W,b = optimizer(W, b, X, Y, 100, 0.0001)
It will converge, though still not at a reasonable speed. (Side note, there's no good way to know the learning rate you need for a given problem. You just try lower and lower values until your cost value doesn't diverge.)
The longer answer is that the problem is that your features are all on different scales.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
yields
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
This means that the variations in the numbers of the second feature are about 100x as large as the variations in the numbers of the second to last feature which in turn means that the number of the second value in your W vector will have to be tuned to about 100x the precision of the value of the second to last number in your W vector.
There are two ways to deal with this in practice. First, you could use a fancier optimizer. Instead of basic gradient descent, you could use gradient descent with momentum, but that would change all your code. The second, simpler, way is just to scale your features so they're all about the same size.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
X -= col_means
X /= col_stds
W, b = optimizer(W, b, X, Y, 100, 1.0)
Here we subtract the mean value of each feature and divide each feature's value by its standard deviation. Sometimes newbies are thrown off by this -- "you can't change your data values, that changes the problem" -- but it makes sense if you realize that it's just another mathematical transformation, just like multiplying by W, adding b, taking the sigmoid, etc. The only catch is that you've got to make sure you do the same thing for any future data. Just like the values of your W vector are learned parameters of your model, the values of the col_means and col_stds are too, so you've got to save them like W and b and use them if you want to perform inference with this model on new data in the future.
That lets us use a much bigger learning rater of 1.0 because now all the features are approximately the same size.
Now if you try, you'll get the following output:
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
0.6931471805599452
0.5902957589079032
0.5481784378158732
0.5254804089153315
...
0.4709931321295562
0.4709931263193595
0.47099312122176273
0.4709931167488006
0.470993112823447
This is what you want. Your cost function is going down at each step and at the end of your 100 iterations, the cost is stable to ~8 significant figures, so dropping it more probably won't do much.
Welcome to machine learning!
The problem is with your initialization of the weight and bias. It’s important that you don’t initialize at least the weights to zero and instead initialize them with some random small numbers. The value of A is coming out to be zero making your cost function undefined
Update:
Try something like this:
from numpy import genfromtxt
import numpy as np
# my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
# X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
# Using random data
n_points = 100
n_neurons = 5
X = np.random.rand(n_points, n_neurons) # 5 dimensional data from uniform distribution [0, 1)
Y = np.random.randint(low=0, high=2, size=(n_points, 1)) # Binary labels
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.random.normal(loc=0, scale=0.01, size=(n_neurons, 1)) # Drawing random initialization from gaussian
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
Your NaN problem is simply due to np.log encountering a zero value. You always want to scale your X values. Statistical (mean, std) normalization will work, but I find min-max scaling works best. Here is code for that:
def minmax_scaler(x):
min = np.nanmin(x, axis=0)
max = np.nanmax(x, axis=0)
return (x-min)/(max-min)
Also, your neural net has only one neuron. When you call np.dot(X, W) these should be matrices of shape (cases, features) and (features, neurons) respectively. So, now your initialization code looks like this:
X = minmax_scaler(X)
neurons = 10
learning_rate = 0.05
W = np.random.random((X.shape[1], neurons))
b = np.zeros((1, neurons)) # b width to match W
I got decent convergence without needing to change the learning rate. See chart:
This is such a small dataset that, even with 10-20 neurons, you are in danger of overfitting it. Ordinarily, you would code a predict() method and an accuracy check, and then set aside some of the data to test for overfitting.
I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.
As can be seen in my notebook code (Backpropagation through time (BPTT) section),
There is a chain rule like below to update input weight matrix like below:
My problem is this part:
I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.
In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.
I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.
As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python?
Any idea?
Thanks in advance.
You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:
Given , and , we will have:
,
Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.
import torch
torch.random.manual_seed(0)
x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)
h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)
h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()
loss = 0.5 * (y - y_predicted).pow(2).sum()
loss.backward()
delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())
# 16 x 8
weight_ho_grad = g_2.t() * delta_1
# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)
# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3
atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)
So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.
I have the following equations:
sqrt((x0 - x)^2 + (y0 - y)^2) - sqrt((x1 - x)^2 + (y1 - y)^2) = c1
sqrt((x3 - x)^2 + (y3 - y)^2) - sqrt((x4 - x)^2 + (y4 - y)^2) = c2
And I would like to find the intersection. I tried using fsolve, and transforming the equations into linear f(x) functions, and it worked for small numbers. I am working with huge numbers and to solve the linear equation there are lots of calculations performed, specifically the calculations reach to a square root of a subtraction, and when handling huge numbers precision is lost, and the left operand is smaller than the right one getting to a math value domain error trying to solve the square root of a negative number.
I am trying to solve this issue in different manners:
Trying to use bigger precision floats. Tried using numpy.float128 but fsolve wont allow using this.
Currently searching for a library that allows to solve non linear equations system, but no luck so far.
Any help/guidance/tip I will appreciate!!
Thanks!!
Taking all advice, i ended using code like the following:
for the the system:
0 = x + y - 8
0 = sqrt((-6 - x)^2 + (4 - y)^2) - sqrt((1 - x)^2 + y^) - 5
from math import sqrt
import numpy as np
from scipy.optimize import fsolve
def f(x):
y = np.zeros(2)
y[0] = x[1] + x[0] - 8
y[1] = sqrt((-6 - x[0]) ** 2 + (4 - x[1]) ** 2) - sqrt((1 - x[0]) ** 2 + x[1] ** 2) - 5
return y
x0 = np.array([0, 0])
solution = fsolve(f, x0)
print "(x, y) = (" + str(solution[0]) + ", " + str(solution[1]) + ")"
Note: the line x0 = np.array([0, 0]) corresponds to the seed that the method uses in fsolve in order to get to a solution. It is important to have a close seed to reach for a solution.
The example provided works :)
You might find some use in SymPy, which is a symbolic algebra manipulation in Python.
From it's home page:
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.
As you have a non-linear equation you need some kind of optimizer to solve it. Probably you can use something like scipy.optimize (https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html). However, as I have no experience with that scipy function can offer you only a solution with the gradient descent method of the tensorflow library. You can find a short guide here: https://learningtensorflow.com/lesson7/ (check out the Gradient descent cahpter). Analog to the method described there you could do something like that:
# These arrays are pseudo code, fill in your values for x0,x1,y0,y1,...
x_array = [x0,x1,x3,x4]
y_array = [y0,y1,y3,y4]
c_array = [c1,c2]
# Tensorflow model starts here
x=tf.placeholder("float")
y=tf.placeholder("float")
z=tf.placeholder("float")
# the array [0,0] are initial guesses for the "correct" x and y that solves the equation
xy_array = tf.Variable([0,0], name="xy_array")
x0 = tf.constant(x_array[0], name="x0")
x1 = tf.constant(x_array[1], name="x1")
x3 = tf.constant(x_array[2], name="x3")
x4 = tf.constant(x_array[3], name="x4")
y0 = tf.constant(y_array[0], name="y0")
y1 = tf.constant(y_array[1], name="y1")
y3 = tf.constant(y_array[2], name="y3")
y4 = tf.constant(y_array[3], name="y4")
c1 = tf.constant(c_array[0], name="c1")
c2 = tf.constant(c_array[1], name="c2")
# I took your first line and subtracted c1 from it, same for the second line, and introduced d_1 and d_2
d_1 = tf.sqrt(tf.square(x0 - xy_array[0])+tf.square(y0 - xy_array[1])) - tf.sqrt(tf.square(x1 - xy_array[0])+tf.square(y1 - xy_array[1])) - c_1
d_2 = tf.sqrt(tf.square(x3 - xy_array[0])+tf.square(y3 - xy_array[1])) - tf.sqrt(tf.square(x4 - xy_array[0])+tf.square(y4 - xy_array[1])) - c_2
# this z_model should actually be zero in the end, in that case there is an intersection
z_model = d_1 - d_2
error = tf.square(z-z_model)
# you can try different values for the "learning rate", here 0.01
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
# here you are creating a "training set" of size 1000, you can also make it bigger if you like
for i in range(1000):
x_value = np.random.rand()
y_value = np.random.rand()
d1_value = np.sqrt(np.square(x_array[0]-x_value)+np.square(y_array[0]-y_value)) - np.sqrt(np.square(x_array[1]-x_value)+np.square(y_array[1]-y_value)) - c_array[0]
d2_value = np.sqrt(np.square(x_array[2]-x_value)+np.square(y_array[2]-y_value)) - np.sqrt(np.square(x_array[3]-x_value)+np.square(y_array[3]-y_value)) - c_array[1]
z_value = d1_value - d2_value
session.run(train_op, feed_dict={x: x_value, y: y_value, z: z_value})
xy_value = session.run(xy_array)
print("Predicted model: {a:.3f}x + {b:.3f}".format(a=xy_value[0], b=xy_value[1]))
But be aware: This code will probably run a while... This is why haven't tested it...
Also I am currently not sure what will happen if there is no intersection. Probably you get the coordinates of the closest distance of the functions...
Tensorflow can be somewhat difficult if you haven't used it yet, but it is worth to learn it, as you can also use it for any deep learning application (actual purpose of this library).
My SARSA with gradient-descent keep escalating the weights exponentially. At Episode 4 step 17 the value is already nan
Exception: Qa is nan
e.g:
6) Qa:
Qa = -2.00890180632e+303
7) NEXT Qa:
Next Qa with west = -2.28577776413e+303
8) THETA:
1.78032402991e+303 <= -0.1 + (0.1 * -2.28577776413e+303) - -2.00890180632e+303
9) WEIGHTS (sample)
5.18266630725e+302 <= -1.58305782482e+301 + (0.3 * 1.78032402991e+303 * 1)
I don't know where to look for the mistake I made.
Here's some code FWIW:
def getTheta(self, reward, Qa, QaNext):
""" let t = r + yQw(s',a') - Qw(s,a) """
theta = reward + (self.gamma * QaNext) - Qa
def updateWeights(self, Fsa, theta):
""" wi <- wi + alpha * theta * Fi(s,a) """
for i, w in enumerate(self.weights):
self.weights[i] += (self.alpha * theta * Fsa[i])
I have about 183 binary features.
you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial.
I would do something like this:
self.weights[i] += (self.alpha * theta * Fsa[i])
normalize(self.weights[i],wmin,wmax)
or see the following example (from literature of RL):
You need to write the normalization function by yourself though ;)
I do not have access to the full code in your application, so I might be wrong. But I think that I know where you are going wrong.
First and foremost, normalization should not be necessary here. For weights to get bloated so soon in this situation suggests something wrong with your implementation.
I think your update equation should be:-
self.weights[:, action_i] = self.weights[:, action_i] + (self.alpha * theta * Fsa[i])
That is to say that you should be updating columns instead of rows, because rows are for states and columns for for actions in the weight matrix.