How to avoid too many ops from a loop in tensorflow? - python

** Edition: to avoid ambiguity, my problem is "too many" ops, "a large number" of ops. Not about the numerical value of a certain op.
My Old Title: How to avoid a large number of ops from a loop in
tensorflow?
** Following is the main text
Hi, Guys. I'm using tensorflow 1.12 to implement a network.
The background is:
In my loss function, I want to compute a large number (about 5000) of random point pairs that sampled from every input image, which costs me a lot of memory and time. The subscripts of point pairs are chosen by a certain method with randomness, so I think it's impossible to integrate the loop into a matrix. It appears like this: ( in my_loss_func() )
# gt & pred is tensors of [w, h]
# x1, y1, x2, y2 are lists of the point pairs' coordinates which were randomly sampled from input image
# for example, x1 = [1, 2, 3], y1 = [4, 5, 6], x2 = [7, 8, 9], y2 = [10, 11, 12]
# then it will compute loss between gt[1, 4] & pred[7, 10], gt[2, 5] & pred[8, 11] etc.
num_pairs = 5000
...
# compute loss for this batch
for i in range(num_pairs):
diff = gt[x1[i], y1[i]] - pred[x2[i], y2[i]]
loss = some_math_computation(diff)
return loss/num_pairs
The problem is:
If I build the graph like above, it will definitely create the loss computing op for 5k times, which costs me a lot. In fact, every time I run the program, I must wait for about 10 mins before the first batch of data's getting trained. In my DEBUG log, I found that the loop runs for only about 25 times within a second. Thus the large time cost is definitely from here. I feel myself so nooob.
Can you tell me how to avoid building the graph like this? Btw, I'm using tf.estimator, so the sess.run() operation is invalid.
My opinions:
Maybe I can load the total graph for once, and save this graph as file. Then for every time I run the model, I can load the total graph from that file, so I won't need to wait such a long time. Will it work?
Please tell me the correct way to implement this kind of loss. I think maybe this kind of loss is quite common in some fields of computer vision.

Related

Linear regression using gradient descent algorithm, getting unexpected results

I'm trying to create a function which returns the value of θ0 & θ1 of hypothesis function of linear regression. But I'm getting different results for different initial (random) values of θ0 & θ1.
What's wrong in the code?
training_data_set = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
initial_theta = [1, 0]
def gradient_descent(data, theta0, theta1):
def h(x, theta0, theta1):
return theta0 + theta1 * x
m = len(data)
alpha = 0.01
for n in range(m):
cost = 0
for i in range(m):
cost += (h(data[i][0], theta0, theta1) - data[i][1])**2
cost = cost/(2*m)
error = 0
for i in range(m):
error += h(data[i][0], theta0, theta1) - data[i][1]
theta0 -= alpha*error/m
theta1 -= alpha*error*data[n][0]/m
return theta0, theta1
for i in range(5):
initial_theta = gradient_descent(training_data_set, initial_theta[0], initial_theta[1])
final_theta0 = initial_theta[0]
final_theta1 = initial_theta[1]
print(f'theta0 = {final_theta0}\ntheta1 = {final_theta1}')
Output:
When initial_theta = [0, 0]
theta0 = 0.27311526522692103
theta1 = 0.7771301328221445
When initial_theta = [1, 1]
theta0 = 0.8829506006170339
theta1 = 0.6669442287905096
Convergence
You've run five iterations of gradient descent over just 5 training samples with a (probably reasonable) learning rate of 0.01. That is not expected to bring you to a "final" answer of your problem - you'd need to do many iterations of gradient descent just like you implemented, repeating the process until your thetas converge to a stable value. Then it'd make sense to compare the resulting values.
Replace the 5 in for i in range(5) with 5000 and then look at what happens. It might be illustrative to plot the decrease of the error rate / cost function to see how fast the process converges to a solution.
This is not a problem rather a very usual thing. For that you need to understand how gradient decent works.
Every time you randomly initialise your parameters the hypothesis starts it's journey from a random place. With every iteration it updates the parameters so that the cost function converges. In your case as you have ran your gradient decent just for 5 iteration, for different initialisation it ends up with too much different results. Try higher iterations you will see significant similarity even with different initialisation. If i could use visualisation that would be helpful for you.
Here is how I see gradient descent: imagine that you are high up on a rocky mountainside in the fog. Because of the fog, you cannot see the fastest path down the mountain. So, you look around your feet and go down based on what you see nearby. After taking a step, you look around your feet again, and take another step. Sometimes this will trap you in a small low spot where you cannot see any way down (a local minimum) and sometimes this will get you safely to the bottom of the mountain (global minimum). Starting from different random locations on the foggy mountainside might trap you in different local minima, though you might find your way down safely if the random starting location is good.

Poor Accuracy of Gradient Descent Perceptron

I'm trying to make a start with neural networks from the very beginning. This means starting off toying with perceptrons. At the moment I'm trying to implement batch gradient descent. The guide I'm following provided the following pseudocode:
I've tried implementing as below with some dummy data and noticed it isn't particularly accurate. It converges, to what I presume is some local minima.
My question is:
What ways are there for me to check that this is infact a local minima, I've been looking into how to plot this, but I'm unsure how to actually go about doing this. In addition to this, is there a way to achieve a more accurate result using gradient descent? Or would I have to use a more complex approach, or possibly run it numerous times starting from different random weights to try and find the global minimum?
I had a look about the forum before posting this, but didn't find much information that made me feel confident what I'm doing here or what's happening is in fact correct, so any helps would be great.
import pandas as pd
import numpy as np
import random
import math
def main():
learningRate = 0.1
np.random.seed(1)
trainingInput = np.asmatrix([
[1, -1],
[2, 1],
[1.5, 0.5],
[2, -1],
[1, 2]
])
biasAccount = np.ones((5,1))
trainingInput = np.append(biasAccount, trainingInput, axis=1)
trainingOutput = np.asmatrix([
[0],
[1],
[0],
[0],
[1]
])
weights = 1 * np.random.random((3,1))-1
for iteration in range(10000):
prediction = np.dot(trainingInput, weights)
print("Weights: \n" + str(weights))
print("Prediction: \n" + str(prediction))
error = trainingOutput - prediction
print("Error: \n" + str(error))
intermediateResult = np.dot(error.T, trainingInput)
delta = np.dot(learningRate, intermediateResult)
print("Delta: \n" + str(delta))
weights += delta.T
main()
There is no guarantee that you'll find the global minimum. Often, people perform multiple runs and take the best one. Advance approaches include decaying the learning rate, using an adaptive learning rate (e.g. with RMSProp or Adam), or using GD with momentum.
There are multiple ways to monitor convergence:
Use the loss (hint: (t -Xw)X is the derivative), check for small values or for small changes.
Early stopping: check that the error on a (held out) validation set decreases, if it doesn't, stop training.
(Possible, you could even check the distances between weights in consecutive steps to see if anything changes.)

Tensorflow gradients for every item of tensor

I have a network that takes as input an Nx3 matrix and produces an N-dimensional vector. Let's say batch size is 1 and N=1024, so the output would have the shape (1,1024). I want to compute the gradients for every dimension of the output, with respect to the input. That is, dy/dx for every y. However tensorflow's tf.gradients computes d sum(y)/dx, aggregate. I know there's no straightforward way to compute the gradients for every output dimension, so I finally decided to run tf.gradients 1024 times, because I only have to do this once in the project, and never again.
So I do this:
start = datetime.datetime.now()
output_code_split = tf.split(output_code,1024)
#output shape = (1024,)
grad_ops = []
for i in range(1024):
gr = tf.gradients(output_code_split[i],input)
#output shape = (1024,1,16,1024,3) , where 16= batch size
gr = tf.reduce_mean(gr,[0,1,2,3])
#output shape = (1024,)
grad_ops.append(gr)
present = datetime.datetime.now()
print(i,(present-start).seconds,flush=True)
#prints time taken to finish previous computation.
start = datetime.datetime.now()
When the code started running, the time between two iterations was 4 seconds, so I figured it'll run for roughly 4096 seconds. However, as the number of iterations increase, the time taken for subsequent runs keeps increasing. The gap, which was 4 seconds when the code started, eventually got to 30 seconds after about 500 iterations, which is too much.
Is the list holding the gradient ops grad_ops growing bigger and occupying more memory. I'm unfortunately not in the position to do a detailed memory profiling of this code..Any ideas about what causes the iteration time to blow up as time goes on?
(Note that in the code, I'm only creating the gradient ops and not actually evaluating them. That part comes later, but my code doesn't reach there on account of the extreme slowdown mentioned above)
Thanks.
What blows up your execution time is that you define a new operation on the graph in every iteration of your for loop. Every call to tf.gradient and tf.reduce_mean pushes a new node onto the graph. Then it needs to recompile to be run. What should actually work for you is to use tf.gather with an int32 placeholder, which supplies the dimension to your gradient operation. So something like this:
idx_placeholder = tf.placeholder(tf.int32, shape=(None,))
grad_operation = tf.gradients(tf.gather(output_code_split, idx_placeholder))
for i in range(1024):
sess.run(grad_operation, {idx_placeholder: np.array([i])})

Implementing a weighted average on the fly in Python

I have a stream of coming data and I want to implement the moving average on the fly. If all the elements in the moving average have the same weight it is fairly easy to implement using a 'Queue' but I want the most recent elements to have higher weights and the distribution of this weights are linear (not exponential).
For example, if the moving average is of length 5, the current value should have weight '1', the previous one should have weight '0.8' and so on until the fifth element in the queue which should have weight '0.2'; so the weight vector is: [0.2, 0.4, 0.6, 0.8, 1.0].
I was wondering if anybody knows how to implement it is Python. If there is any faster way to do this please recommend that to me; efficiency is important for my specific job.
If you want to keep a weighting vector such as described (linearly decreasing weights), you will need to keep all the information about your past stream. I quickly tried to sketch a mathematical function that would avoid keeping your past scalars in your memory without success. This is where the exponential weighting has a powerful advantage:
average_(t) = x_(t) + aa*average_(t-1)
You only need to keep two variables in your memory.
Anyway, if the memory is not an efficiency parameter, your problem goes down as a vector multiplication. Therefore, I would suggest using the numpy library. [1] [2]. See a solution example below (perhaps you may found a more efficient one):
import numpy as np
stream = np.array((20, 40))
n = len(stream)
latest_scalar = 60
stream = np.append(stream, latest_scalar)
n += 1
# n represent the length of the stream
# I assumed that is more efficient to handle n without calling len() function
# may raise safety issue
weights = np.arange(1, n+1)
# [1, 2, 3]
average = np.dot(stream, weights).sum() / (n*(n+1)/2)
# (n*(n+1)/2): total of the weights
# output: 46.666... ok!

Integer step size in scipy optimize minimize

I have a computer vision algorithm I want to tune up using scipy.optimize.minimize. Right now I only want to tune up two parameters but the number of parameters might eventually grow so I would like to use a technique that can do high-dimensional gradient searches. The Nelder-Mead implementation in SciPy seemed like a good fit.
I got the code all set up but it seems that the minimize function really wants to use floating point values with a step size that is less than one.The current set of parameters are both integers and one has a step size of one and the other has a step size of two (i.e. the value must be odd, if it isn't the thing I am trying to optimize will convert it to an odd number). Roughly one parameter is a window size in pixels and the other parameter is a threshold (a value from 0-255).
For what it is worth I am using a fresh build of scipy from the git repo. Does anyone know how to tell scipy to use a specific step size for each parameter? Is there some way I can roll my own gradient function? Is there a scipy flag that could help me out? I am aware that this could be done with a simple parameter sweep, but I would eventually like to apply this code to much larger sets of parameters.
The code itself is dead simple:
import numpy as np
from scipy.optimize import minimize
from ScannerUtil import straightenImg
import bson
def doSingleIteration(parameters):
# do some machine vision magic
# return the difference between my value and the truth value
parameters = np.array([11,10])
res = minimize( doSingleIteration, parameters, method='Nelder-Mead',options={'xtol': 1e-2, 'disp': True,'ftol':1.0,}) #not sure if these params do anything
print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
print res
This is what my output looks like. As you can see we are repeating a lot of runs and not getting anywhere in the minimization.
*+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.] <-- Output from scipy minimize
{'block_size': 11, 'degree': 10} <-- input to my algorithm rounded and made int
+++++++++++++++++++++++++++++++++++++++++
120 <-- output of the function I am trying to minimize
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.5]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 9.5 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.1375 10.25 ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.25]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 9.75 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
~~~
SNIP
~~~
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.0078125]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
Optimization terminated successfully.
Current function value: 120.000000
Iterations: 7
Function evaluations: 27
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
status: 0
nfev: 27
success: True
fun: 120.0
x: array([ 11., 10.])
message: 'Optimization terminated successfully.'
nit: 7*
Assuming that the function to minimize is arbitrarily complex (nonlinear), this is a very hard problem in general. It cannot be guaranteed to be solved optimal unless you try every possible option. I do not know if there are any integer constrained nonlinear optimizer (somewhat doubt it) and I will assume you know that Nelder-Mead should work fine if it was a contiguous function.
Edit: Considering the comment from #Dougal I will just add here: Set up a coarse+fine grid search first, if you then feel like trying if your Nelder-Mead works (and converges faster), the points below may help...
But maybe some points that help:
Considering how the whole integer constraint is very difficult, maybe it would be an option to do some simple interpolation to help the optimizer. It should still converge to an integer solution. Of course this requires to calculate extra points, but it might solve many other problems. (even in linear integer programming its common to solve the unconstrained system first AFAIK)
Nelder-Mead starts with N+1 points, these are hard wired in scipy (at least older versions) to (1+0.05) * x0[j] (for j in all dimensions, unless x0[j] is 0), which you will see in your first evaluation steps. Maybe these can be supplied in newer versions, otherwise you could just change/copy the scipy code (it is pure python) and set it to something more reasonable. Or if you feel that is simpler, scale all input variables down so that (1+0.05)*x0 is of sensible size.
Maybe you should cache all function evaluations, since if you use Nelder-Mead I would guess you can always run into duplicat evaluation (at least at the end).
You have to check how likely Nelder-Mead will just shrink to a single value and give up, because it always finds the same result.
You generally must check if your function is well behaved at all... This optimization is doomed if the function does not change smooth over the parameter space, and even then it can easily run into local minima if you should have of those. (since you cached all evaluations - see 2. - you could at least plot those and have a look at the error landscape without needing to do any extra evluations)
Unfortunately, Scipy's built-in optimization tools don't easily allow for this. But never fear; it sounds like you have a convex problem, and so you should be able to find a unique optimum, even if it won't be mathematically pretty.
Two options that I've implemented for different problems are creating a custom gradient descent algorithm, and using bisection on a series of univariate problems. If you're doing cross-validation in your tuning, your loss function unfortunately won't be smooth (because of noise from cross-validation on different datasets), but will be generally convex.
To implement gradient descent numerically (without having an analytical method for evaluating the gradient), choose a test point and a second point that is delta away from your test point in all dimensions. Evaluating your loss function at these two points can allow you to numerically compute a local subgradient. It is important that delta be large enough that it steps outside of local minima created by cross-validation noise.
A slower but potentially more robust alternative is to implement bisection for each parameter you're testing. If you know that the problem in jointly convex in your two parameters (or n parameters), you can separate this into n univariate optimization problems, and write a bisection algorithm which recursively hones in on the optimal parameters. This can help handle some types of quasiconvexity (e.g. if your loss function takes a background noise value for part of its domain, and is convex in another region), but requires a good guess as to the bounds for the initial iteration.
If you simply snap the requested x values to an integer grid without fixing xtol to map to that gridsize, you risk having the solver request two points within a grid cell, receiving the same output value, and concluding that it is at a minimum.
No easy answer, unfortunately.
Snap your floats x, y (a.k.a. winsize, threshold) to an integer grid inside your function, like this:
def func( x, y ):
x = round( x )
y = round( (y - 1) / 2 ) * 2 + 1 # 1 3 5 ...
...
Then Nelder-Mead will see function values only on the grid, and should give you near-integer x, y.
(If you'd care to post your code someplace, I'm looking for test cases for a Nelder-Mead
with restarts.)
The Nelder-Mead minimize method now lets you specify the initial simplex vertex points, so you should be able to set the simplex points far apart, and the simplex will then flop around and find the minimum and converge when the simplex size drops below 1.
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
The problem is that the algorithm gets stuck trying to shrink its (N+1) simplex.
I'd highly recommend for anyone new to the concept to learn more about the geographical shape of a simplex and figure out how the input parameters relate to the points on the simplex. Once you get a grasp of that then as I.P. Freeley suggested the problem can be solved by defining strong initial points for your simplex, Note that this is different than defining your x0 and goes into nelder-mead's dedicated options. Here is an example of a higher --4-- dimensional problem. Also note that the initial simplex has to have N+1 points in this case 5 and in your case 3.
init_simplex = np.array([[1, .1, .3, .3], [.1, 1, .3, .3], [.1, .1, 5, .3],
[.1, .1, .3, 5], [1, 1, 5, 5]])
minimum = minimize(Optimize.simplex_objective, x0=np.array([.01, .01, .01, .01]),
method='Nelder-Mead',
options={'adaptive': True, 'xatol': 0.1, 'fatol': .00001,
'initial_simplex': init_simplex})
In this example the x0 gets ignored by the definition of the initial_simplex. Other useful option in high dimensional problems is the 'adaptive' option, which takes the number of parameters into acount while trying to set the models operational coefficients (ie. α, γ,ρ and σ for reflection, expansion, contraction and shrink respectively). And if you haven't already, I'd also recommend familiarizing yourself with the steps of the algorithm.
Now as for the reason this problem is happening its because the method gets no good results in an expansion so it keeps shrinking the simplex smaller and smaller trying to find out a better solution that may or may not exist.

Categories