I am new to machine learning and Tensorflow. Currently I am trying to follow the tutorial's logic to create a simple linear regression model of form y = a*x (there is no bias term here) . However, for some reason, the model fail to converge to the correct value "a". The data set is created by me in excel. As shown below:
here is my code that tries to run tensorflow on this dummy data set I generated.
import tensorflow as tf
import pandas as pd
w = tf.Variable([[5]],dtype=tf.float32)
b = tf.Variable([-5],dtype=tf.float32)
x = tf.placeholder(shape=(None,1),dtype=tf.float32)
y = tf.add(tf.matmul(x,w),b)
label = tf.placeholder(dtype=tf.float32)
loss = tf.reduce_mean(tf.squared_difference(y,label))
data = pd.read_csv("D:\\dat2.csv")
xs = data.iloc[:,:1].as_matrix()
ys = data.iloc[:,1].as_matrix()
optimizer = tf.train.GradientDescentOptimizer(0.000001).minimize(loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for i in range(10000):
sess.run(optimizer,{x:xs,label:ys})
if i%100 == 0: print(i,sess.run(w))
print(sess.run(w))
below is the print out in ipython console, as you can see after 10000th iteration, the value for w is around 4.53 instead of the correct value 6.
I would really appreciate if anyone could shed some light on what is going on wrong here. I have played around with different learning rate from 0.01 to 0.0000001, none of the setting is able to have the w converge to 6. I have read some suggesting to normalize the feature to standard normal distribution, I would like to know if this normalization is a must? without normalization, gradientdescent is not able to find the solution? Thank you very much!
It is a shaping problem: y and label don't have the same shape ([batch_size, 1] vs [batch_size]). In loss = tf.reduce_mean(tf.squared_difference(y, label)), it causes tensorflow to interpret things differently from what you want, probably by using some broadcasting... Anyway, the result is that your loss is not at all the one you want.
To correct that, simply replace
y = tf.add(tf.matmul(x, w), b)
by
y = tf.add(tf.matmul(x, w), b)
y = tf.reshape(y, shape=[-1])
My full working code below:
import tensorflow as tf
import pandas as pd
w = tf.Variable([[4]], dtype=tf.float64)
b = tf.Variable([10.0], dtype=tf.float64, trainable=True)
x = tf.placeholder(shape=(None, 1), dtype=tf.float64)
y = tf.add(tf.matmul(x, w), b)
y = tf.reshape(y, shape=[-1])
label = tf.placeholder(shape=(None), dtype=tf.float64)
loss = tf.reduce_mean(tf.squared_difference(y, label))
my_path = "/media/sf_ShareVM/data2.csv"
data = pd.read_csv(my_path, sep=";")
max_n_samples_to_use = 50
xs = data.iloc[:max_n_samples_to_use, :1].as_matrix()
ys = data.iloc[:max_n_samples_to_use, 1].as_matrix()
lr = 0.000001
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr).minimize(loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
for i in range(100000):
_, loss_value, w_value, b_value, y_val, lab_val = sess.run([optimizer, loss, w, b, y, label], {x: xs, label: ys})
if i % 100 == 0: print(i, loss_value, w_value, b_value)
if (i%2000 == 0 and 0< i < 10000): # We use a smaller LR at first to avoid exploding gradient. It would be MUCH cleaner to use gradient clipping (by global norm)
lr*=2
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr).minimize(loss)
print(sess.run(w))
Related
I try to solve a Ax^2+Bx+C into (ax+b)(cx+d) where A,B,C are known and to solve value of a,b,c,d.
Here are the code:
import tensorflow as tf
a = tf.Variable([.5])
b = tf.Variable([.5])
c = tf.Variable([.5])
d = tf.Variable([.5])
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
fn1 = 2*x**2+3*x+4 #A=2,B=3,C=4
fn2 = (a*x+b)*(c*x+d)
x_train = [1,2,3,4]
y_train = [9,18,31,48]
loss = tf.reduce_sum(tf.square(fn2-y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(1000):
sess.run(train, {x:x_train, y:y_train})
print(sess.run([a,b,c,d]))
the result shows nan for all a,b,c and d.
how to fix that? did i miss something? thanks for help.
Your cost function is failing to converge at the learning rate of 0.01. Set the learning rate to 0.0001 (or lower) and the cost function begins to converge.
optimizer = tf.train.GradientDescentOptimizer(0.0001)
Also, if you modify your fn2 to a * x ** 2 + b * x + c, you will get closer solution to the one you are having of Ax^2+Bx+C. But if you use (ax+b)(cx+d), you might get a different solution which will satisfy the small training dataset with x = [1,2,3,4].
Another small tip is not to initialize same value (0.5 in your case) to all the variables. Randomly initialize it between -1.0 to 1.0.
I was trying to train a simple polynomial linear regression model in pytorch with SGD. I wrote some self contained (what I thought would be extremely simple code), however, for some reason my model does not train as I thought it should.
I have 5 points sampled from a sine curve and try to fit it with a polynomial of degree 4. This is a convex problem so GD or SGD should find a solution with zero train error eventually as long as we have enough iterations and small enough step size. For some reason however my model does not train well (even though it seems that it is changing the parameters of the model. Anyone have an idea why? Here is the code (I tried making it self contained and minimal):
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
import torch
from torch.autograd import Variable
from maps import NamedDict
from plotting_utils import *
def index_batch(X,batch_indices,dtype):
'''
returns the batch indexed/sliced batch
'''
if len(X.shape) == 1: # i.e. dimension (M,) just a vector
batch_xs = torch.FloatTensor(X[batch_indices]).type(dtype)
else:
batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
return batch_xs
def get_batch2(X,Y,M,dtype):
'''
get batch for pytorch model
'''
# TODO fix and make it nicer, there is pytorch forum question
X,Y = X.data.numpy(), Y.data.numpy()
N = len(Y)
valid_indices = np.array( range(N) )
batch_indices = np.random.choice(valid_indices,size=M,replace=False)
batch_xs = index_batch(X,batch_indices,dtype)
batch_ys = index_batch(Y,batch_indices,dtype)
return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
def get_sequential_lifted_mdl(nb_monomials,D_out, bias=False):
return torch.nn.Sequential(torch.nn.Linear(nb_monomials,D_out,bias=bias))
def train_SGD(mdl, M,eta,nb_iter,logging_freq ,dtype, X_train,Y_train):
##
N_train,_ = tuple( X_train.size() )
#print(N_train)
for i in range(nb_iter):
# Forward pass: compute predicted Y using operations on Variables
batch_xs, batch_ys = get_batch2(X_train,Y_train,M,dtype) # [M, D], [M, 1]
## FORWARD PASS
y_pred = mdl.forward(batch_xs)
## LOSS + Regularization
batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
## SGD update
for W in mdl.parameters():
delta = eta*W.grad.data
W.data.copy_(W.data - delta)
## train stats
if i % (nb_iter/10) == 0 or i == 0:
current_train_loss = (1/N_train)*(mdl.forward(X_train) - Y_train).pow(2).sum().data.numpy()
print('i = {}, current_loss = {}'.format(i, current_train_loss ) )
## Manually zero the gradients after updating weights
mdl.zero_grad()
##
logging_freq = 100
dtype = torch.FloatTensor
## SGD params
M = 3
eta = 0.0002
nb_iter = 20*1000
##
lb,ub = 0,1
f_target = lambda x: np.sin(2*np.pi*x)
N_train = 5
X_train = np.linspace(lb,ub,N_train)
Y_train = f_target(X_train)
## degree of mdl
Degree_mdl = 4
## pseudo-inverse solution
c_pinv = np.polyfit( X_train, Y_train , Degree_mdl )[::-1]
## linear mdl to train with SGD
nb_terms = c_pinv.shape[0]
mdl_sgd = get_sequential_lifted_mdl(nb_monomials=nb_terms,D_out=1, bias=False)
## Make polynomial Kernel
poly_feat = PolynomialFeatures(degree=Degree_mdl)
Kern_train = poly_feat.fit_transform(X_train.reshape(N_train,1))
Kern_train_pt, Y_train_pt = Variable(torch.FloatTensor(Kern_train).type(dtype), requires_grad=False), Variable(torch.FloatTensor(Y_train).type(dtype), requires_grad=False)
train_SGD(mdl_sgd, M,eta,nb_iter,logging_freq ,dtype, Kern_train_pt,Y_train_pt)
the error seems to hover on 2ish:
i = 0, current_loss = [ 2.08996224]
i = 2000, current_loss = [ 2.03536892]
i = 4000, current_loss = [ 2.02014995]
i = 6000, current_loss = [ 2.01307297]
i = 8000, current_loss = [ 2.01300406]
i = 10000, current_loss = [ 2.01125693]
i = 12000, current_loss = [ 2.01162267]
i = 14000, current_loss = [ 2.01296973]
i = 16000, current_loss = [ 2.00951076]
i = 18000, current_loss = [ 2.00967121]
which is weird cuz it should be able to reach zero.
I also plotted the learned function:
the code for the plotting:
##
x_horizontal = np.linspace(lb,ub,1000).reshape(1000,1)
X_plot = poly_feat.fit_transform(x_horizontal)
X_plot_pytorch = Variable( torch.FloatTensor(X_plot), requires_grad=False)
##
fig1 = plt.figure()
#plots objs
p_sgd, = plt.plot(x_horizontal, [ float(f_val) for f_val in mdl_sgd.forward(X_plot_pytorch).data.numpy() ])
p_pinv, = plt.plot(x_horizontal, np.dot(X_plot,c_pinv))
p_data, = plt.plot(X_train,Y_train,'ro')
## legend
nb_terms = c_pinv.shape[0]
legend_mdl = f'SGD solution standard parametrization, number of monomials={nb_terms}, batch-size={M}, iterations={nb_iter}, step size={eta}'
plt.legend(
[p_sgd,p_pinv,p_data],
[legend_mdl,f'linear algebra soln, number of monomials={nb_terms}',f'data points = {N_train}']
)
##
plt.xlabel('x'), plt.ylabel('f(x)')
plt.show()
I actually went ahead and implemented a TensorFlow version. That one does seem to train the model. I tried having both of them match by giving them the same initialization:
mdl_sgd[0].weight.data.fill_(0)
but that still didn't work. Tensorflow code:
graph = tf.Graph()
with graph.as_default():
X = tf.placeholder(tf.float32, [None, nb_terms])
Y = tf.placeholder(tf.float32, [None,1])
w = tf.Variable( tf.zeros([nb_terms,1]) )
#w = tf.Variable( tf.truncated_normal([Degree_mdl,1],mean=0.0,stddev=1.0) )
#w = tf.Variable( 1000*tf.ones([Degree_mdl,1]) )
##
f = tf.matmul(X,w) # [N,1] = [N,D] x [D,1]
#loss = tf.reduce_sum(tf.square(Y - f))
loss = tf.reduce_sum( tf.reduce_mean(tf.square(Y-f), 0))
l2loss_tf = (1/N_train)*2*tf.nn.l2_loss(Y-f)
##
learning_rate = eta
#global_step = tf.Variable(0, trainable=False)
#learning_rate = tf.train.exponential_decay(learning_rate=eta, global_step=global_step,decay_steps=nb_iter/2, decay_rate=1, staircase=True)
train_step = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
with tf.Session(graph=graph) as sess:
Y_train = Y_train.reshape(N_train,1)
tf.global_variables_initializer().run()
# Train
for i in range(nb_iter):
#if i % (nb_iter/10) == 0:
if i % (nb_iter/10) == 0 or i == 0:
current_loss = sess.run(fetches=loss, feed_dict={X: Kern_train, Y: Y_train})
print(f'i = {i}, current_loss = {current_loss}')
## train
batch_xs, batch_ys = get_batch(Kern_train,Y_train,M)
sess.run(train_step, feed_dict={X: batch_xs, Y: batch_ys})
I also tried changing the initialization but it didn't change anything, which makes sense cuz it shouldn't make a big difference:
mdl_sgd[0].weight.data.normal_(mean=0,std=0.001)
Original post:
https://discuss.pytorch.org/t/how-to-train-a-simple-linear-regression-model-with-sgd-in-pytorch-successfully/9620
This is how it should look like:
SOLUTION:
it seems that there is an issue with the result being returned as a vector instead of a number causing the issue. i.e. the following code fixed things:
y_pred = model(batch_xs).view(-1) # change this to "y_pred = model(batch_xs)" to get the incorrect results
loss = (y_pred - batch_ys).pow(2).mean()
which seems completely mysterious to me. Does someone know why this fixed the issue? it just seems like magic.
The bug is really subtle but essentially it's because pytorch is using numpy broadcasting rules. So when a column vector (3,1) and an array (i.e. dim is (3,) ) then what happens is that broadcasting produces a (3,3) matrix (note this wouldn't happen when you subtract a row vector (1,3) vector with a (3,) array, I guess arrays are treated as row vectors). This is really bad because it means that we compute the matrix of all pairwise differences between every label and every prediction. Of course this is nonsensical and produces a bug because we don't want the prediction of the first label point to match the prediction of every other label in the data set. Of course that won't produce anything sensible.
So it seems the answer is just to avoid wrong numpy broadcasting by either reshaping things during training or before the data is fed. Either one should work.
To avoid the error one can attach use this code:
def check_vectors_have_same_dimensions(Y,Y_):
'''
Checks that vector Y and Y_ have the same dimensions. If they don't
then there might be an error that could be caused due to wrong broadcasting.
'''
DY = tuple( Y.size() )
DY_ = tuple( Y_.size() )
if len(DY) != len(DY_):
return True
for i in range(len(DY)):
if DY[i] != DY_[i]:
return True
return False
I have been trying to use an LSTM for regression in TensorFlow, but it doesn't fit the data. I have successfully fit the same data in Keras (with the same size network). My code for trying to overfit a sine wave is below:
import tensorflow as tf
import numpy as np
yt = np.cos(np.linspace(0, 2*np.pi, 256))
xt = np.array([yt[i-50:i] for i in range(50, len(yt))])[...,None]
yt = yt[-xt.shape[0]:]
g = tf.Graph()
with g.as_default():
x = tf.constant(xt, dtype=tf.float32)
y = tf.constant(yt, dtype=tf.float32)
lstm = tf.nn.rnn_cell.BasicLSTMCell(32)
outputs, state = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)
pred = tf.layers.dense(outputs[:,-1], 1)
loss = tf.reduce_mean(tf.square(pred-y))
train_op = tf.train.AdamOptimizer().minimize(loss)
init = tf.global_variables_initializer()
sess = tf.InteractiveSession(graph=g)
sess.run(init)
for i in range(200):
_, l = sess.run([train_op, loss])
print(l)
This results in a MSE of 0.436067 (while Keras got to 0.0022 after 50 epochs), and the predictions range from -0.1860 to -0.1798. What am I doing wrong here?
Edit:
When I change my loss function to the following, the model fits properly:
def pinball(y_true, y_pred):
tau = np.arange(1,100).reshape(1,-1)/100
pin = tf.reduce_mean(tf.maximum(y_true[:,None] - y_pred, 0) * tau +
tf.maximum(y_pred - y_true[:,None], 0) * (1 - tau))
return pin
I also change the assignments of pred and loss to
pred = tf.layers.dense(outputs[:,-1], 99)
loss = pinball(y, pred)
This results in a decrease of loss from 0.3 to 0.003 as it trains, and seems to properly fit the data.
Looks like a shape/broadcasting issue. Here's a working version:
import tensorflow as tf
import numpy as np
yt = np.cos(np.linspace(0, 2*np.pi, 256))
xt = np.array([yt[i-50:i] for i in range(50, len(yt))])
yt = yt[-xt.shape[0]:]
g = tf.Graph()
with g.as_default():
x = tf.constant(xt, dtype=tf.float32)
y = tf.constant(yt, dtype=tf.float32)
lstm = tf.nn.rnn_cell.BasicLSTMCell(32)
outputs, state = tf.nn.dynamic_rnn(lstm, x[None, ...], dtype=tf.float32)
pred = tf.squeeze(tf.layers.dense(outputs, 1), axis=[0, 2])
loss = tf.reduce_mean(tf.square(pred-y))
train_op = tf.train.AdamOptimizer().minimize(loss)
init = tf.global_variables_initializer()
sess = tf.InteractiveSession(graph=g)
sess.run(init)
for i in range(200):
_, l = sess.run([train_op, loss])
print(l)
x gets a batch dimension of 1 before going into dynamic_rnn, since with time_major=False the first dimension is expected to be a batch dimension. It's important that the last dimension of the output of tf.layers.dense get squeezed off so that it doesn't broadcast with y (TensorShape([256, 1]) and TensorShape([256]) broadcast to TensorShape([256, 256])). With those fixes it converges:
5.78507e-05
You are not passing-on the state from one call of dynamic_rnn to next. That's the problem for sure.
Also, why take only last item of the output through the dense layer and onward?
I am new to tensorflow and neural networks, and I am trying to create a model that just multiples two float values together.
I wasn't sure how many neurons I would want, but I picked 10 neurons and tried to see where I could go from that. I figured that would probably introduce enough complexity in order to semi-accurately learn that operation.
Anyways, here is my code:
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
# Input data
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
# Use 10 neurons--just one layer for now, but it'll be fully connected
weights_1 = tf.Variable(tf.truncated_normal([2, 10], stddev=.1))
bias_1 = tf.Variable(.1)
# Output of this will be a [None, 10]
hidden_output = tf.nn.relu(tf.matmul(input_data, weights_1) + bias_1)
# Weights
weights_2 = tf.Variable(tf.truncated_normal([10, 1], stddev=.1))
bias_2 = tf.Variable(.1)
# Softmax them together--this will be [None, 1]
calculated_output = tf.nn.softmax(tf.matmul(hidden_output, weights_2) + bias_2)
cross_entropy = tf.reduce_mean(correct_answers * tf.log(calculated_output))
optimizer = tf.train.GradientDescentOptimizer(.5).minimize(cross_entropy)
sess.run(tf.initialize_all_variables())
for i in range(1000):
x, y = generate_data(100)
sess.run(optimizer, feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(calculated_output - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
It seems that the error is always around 7522.1, which very very bad for just 100 data points, so I assume it is not learning.
My questions: Is my machine learning? If so, what can I do to make it more accurate? If not, how can I make it learn?
There are a few major issues with the code. Aaron has already identified some of them, but there's another important one: calculated_output and correct_answers are not the same shape, so you're creating a 2D matrix when you subtract them. (The shape of calculated_output is (100, 1) and the shape of correct_answers is (100).) So you need to adjust the shape (for example, by using tf.squeeze on calculated_output).
This problem also doesn't really require any non-linearities, so you could get by with no activations and only one layer. The following code gets a total error of about 6 (~0.06 error on average for each test point). Hope that helps!
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
weights_1 = tf.Variable(tf.truncated_normal([2, 1], stddev=.1))
bias_1 = tf.Variable(.0)
output_layer = tf.matmul(input_data, weights_1) + bias_1
mean_squared = tf.reduce_mean(tf.square(correct_answers - tf.squeeze(output_layer)))
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(mean_squared)
sess.run(tf.initialize_all_variables())
for i in range(1000):
x, y = generate_data(100)
sess.run(optimizer, feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(tf.squeeze(output_layer) - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
The way you are using softmax is weird. Softmax is normally used when you want to have a probability distribution over a set of classes. In your code it looks like you have a one dimensional output. The softmax is not helping you there.
The cross entropy loss function is appropriate in classification problems but you are doing regression. You should try using a mean squared error loss function instead.
I started with simple implementation of single variable linear gradient descent but don't know to extend it to multivariate stochastic gradient descent algorithm ?
Single variable linear regression
import tensorflow as tf
import numpy as np
# create random data
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.5
# Find values for W that compute y_data = W * x_data
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
y = W * x_data
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
for step in xrange(2001):
sess.run(train)
if step % 200 == 0:
print(step, sess.run(W))
You have two part in your question:
How to change this problem to a higher dimension space.
How to change from the batch gradient descent to a stochastic gradient descent.
To get a higher dimensional setting, you can define your linear problem y = <x, w>. Then, you just need to change the dimension of your Variable W to match the one of w and replace the multiplication W*x_data by a scalar product tf.matmul(x_data, W) and your code should run just fine.
To change the learning method to a stochastic gradient descent, you need to abstract the input of your cost function by using tf.placeholder.
Once you have defined X and y_ to hold your input at each step, you can construct the same cost function. Then, you need to call your step by feeding the proper mini-batch of your data.
Here is an example of how you could implement such behavior and it should show that W quickly converges to w.
import tensorflow as tf
import numpy as np
# Define dimensions
d = 10 # Size of the parameter space
N = 1000 # Number of data sample
# create random data
w = .5*np.ones(d)
x_data = np.random.random((N, d)).astype(np.float32)
y_data = x_data.dot(w).reshape((-1, 1))
# Define placeholders to feed mini_batches
X = tf.placeholder(tf.float32, shape=[None, d], name='X')
y_ = tf.placeholder(tf.float32, shape=[None, 1], name='y')
# Find values for W that compute y_data = <x, W>
W = tf.Variable(tf.random_uniform([d, 1], -1.0, 1.0))
y = tf.matmul(X, W, name='y_pred')
# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y_ - y))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# Before starting, initialize the variables
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
# Fit the line.
mini_batch_size = 100
n_batch = N // mini_batch_size + (N % mini_batch_size != 0)
for step in range(2001):
i_batch = (step % n_batch)*mini_batch_size
batch = x_data[i_batch:i_batch+mini_batch_size], y_data[i_batch:i_batch+mini_batch_size]
sess.run(train, feed_dict={X: batch[0], y_: batch[1]})
if step % 200 == 0:
print(step, sess.run(W))
Two side notes:
The implementation below is called a mini-batch gradient descent as at each step, the gradient is computed using a subset of our data of size mini_batch_size. This is a variant from the stochastic gradient descent that is usually used to stabilize the estimation of the gradient at each step. The stochastic gradient descent can be obtained by setting mini_batch_size = 1.
The dataset can be shuffle at every epoch to get an implementation closer to the theoretical consideration. Some recent work also consider only using one pass through your dataset as it prevent over-fitting. For a more mathematical and detailed explanation, you can see Bottou12. This can be easily change according to your problem setup and the statistic property your are looking for.