Hi I was reading the using GPUs page at tensor flow and I was wondering if gpu precision performance was ever a factor in tensor flow. For example given a machine with two cards,
gaming gpu
+
workstation gpu
is there any implementation that would provide the workstation card's higher precision performance could overcome the slower clock speed?
I'm not sure if these situations would exist in the context of gradient decent or network performance after training or elsewhere entirely but I would love to get some more information on the topic!
Thanks in advance.
TL;DR
The opposite is actually the case. Higher precision calculations are less desired by frameworks like TensorFlow. This is due to slower training and larger models (more ram and disc space).
The long version
Neural networks actually benefit from using lower precision representations. This paper is a good introduction to the topic.
The key finding of our exploration is that deep neural networks can
be trained using low-precision fixed-point arithmetic, provided
that the stochastic rounding scheme is applied while operating on
fixed-point numbers.
They use 16 bit fixed point number rather than the much higher precession 32 bit floating point number (more information on their difference here).
The following image was taken from that paper. It shows the test error for different rounding schemes as well as the number of bits dedicated to the integer part of the fixed point representation. As you can see the solid red and blue lines (16 bit fixed) have a very similar error to the black line (32 bit float).
The main benefit/driver for going to a lower precision is computational cost and storage of weights. So the higher precision hardware would not give enough of an accuracy increase to out way the cost of slower computation.
Studies like this I believe are a large driver behind the specs for neural network specific processing hardware, such as Google's new TPU. Even though most GPUs don't support 16 bit floats yet Google is working to support it.
Related
I'm trying to solve a system of equations that is a 1 Million x 1 Million square matrix and one 1 Million solution vector.
To do this, I'm using np.linalg.solve(matrix, answers) but it's taking a very long time.
Is there a way to speed it up?
Thanks #Chris but that doesn't answer the question since I've also tried using the Scipy module and it still takes a very long time to solve. I don't think my computer can hold that much data in RAM
OK for clarity, I've just found out that the name of the matrix that I'm trying to solve is a Hilbert matrix
Please reconsider the need for solving such a HUGE system unless your system is very sparse.
Indeed, this is barely possible to store the input/output on a PC storage device: the input dense matrix takes 8 TB with double-precision values and the output will certainly also takes few TB not to mention a temporary data storage is needed to compute the result (at least 8 TB for a dense matrix). Sparse matrices can help a lot if your input matrix is almost full of zeros but you need the matrix to contain >99.95% of zeros so to store it in your RAM.
Furthermore, the time complexity of solving a system is O(n m min(n,m)) so O(n^3) in your case (see: this post). This means a several billion billions operations. A basic mainstream processor do not exceed 0.5 TFlops. In fact, my relatively good i5-9600KF reach 0.3 TFlops in the LINPACK computationally intensive benchmark. This means the computation will certainly take a month to compute assuming is is bounded only by the speed of a mainstream processor. Actually, solving a large system of equations is known to be memory bound so it will be much slower in practice because modern RAM are a bottleneck in modern computers (see: memory wall). So for a mainstream PC, this should take from from several months to a year assuming the computation can be done in your RAM which is not possible as said before for a dense system. Since high-end SSD are about an order of magnitude slower than the RAM of a good PC, you should expect the computation to take several years. Not to mention a 20 TB high-end SSD is very expensive and it might be a good idea to consider power outages and OS failure for such a long computational time... Again, sparse matrices can help a lot, but note that solving sparse systems is known to be significantly slower than dense one unless the number of zeros is pretty small.
Such systems are solved on supercomputers (or at least large computing clusters), not regular PCs. This requires to use distributed computing and tools likes MPI and distributed linear solvers. A whole field of research is working on this topic to make them efficient on large scale systems.
Note that computing approximations can be faster, but one should solve the space problem in the first place...
Could anyone suggest a way of defining the computational complexity of a neural network after quantization?
I understand computation complexity as the amount of arithmetic “work” needed to calculate the entire network or a single layer. Nevertheless, when a neural network has been quantized, numbers are not represented anymore by the same format (the new format will depend on the quantization method used as described here, e.g.
We proceed from multiplying two real numbers in all operations to multiplying the pairs of int8, float 16, etc. The latter operations are evidently “simpler” than the multiplication of two reals.
Therefore, this has an effect on the time and memory it takes to carry out computations and as a consequence the traditional metrics, as for example "BigO" notation, do not make sense.
Is it recommended to use Python's native floating point implementation, or its decimal implementation for use-cases where precision is important?
I thought this question would be easy to answer: if accumulated error has significant implications, e.g. perhaps in calculating orbital trajectories or the like, then an exact representation might make more sense.
I'm unsure for run of the mill deep learning use-cases, for scientific computing generally (e.g. many people use numpy or scikit-learn which i think use floating point implementations), and for financial computing (e.g. trading strategies) what the norms are.
Does anyone know the norms for floating point vs. Decimal use in python for these three areas?
Finance (Trading Strategies)
Deep Learning
Scientific Computing
Thanks
N.B.: This is /not/ a question about the difference between floating point and fixed-point representations, or why floating point arithmetic produces surprising results. This is a question about what norms are.
I learn more about Deep Learning and Scientific Computing, but since my family is running the financing business, I think I can answer the question.
First and foremost, the float numbers are not evil; all you need to do is to understand how much precision does your project needs.
Finance
In the Financing area, depending on usage, you can use decimal or float number. Plus, different banks have different requirements. Generally, if you are dealing with cash or cash equivalent, you may use decimal since the fractional monetary unit is known. For example, for dollars, the fractional monetary unit is 0.01. So you can use decimal to store it, and in the database, you can just use number(20,2)(oracle) or similar things to store your decimal number. The precision is enough since banks have a systematic way to minimize errors on day one, even before the computers appear. The programmers only need to correctly implement what the bank's guideline says.
For other things in the financing area, like analysis and interest rate, using double is enough. Here the precision is not important, but the simplicity matters. CPUs are optimized to calculate float numbers, so no special methods are needed to calculate float arithmetic. Since arithmetic in computers is a huge topic, using an optimized and stabilized way to perform a calculation is much safer than to create its own methods to do arithmetic. Plus, one or two float calculations will not have a huge compact on the precision. For example, banks usually store the value in decimal and then perform multiplication with a float interest rate and then convert back to decimal. In this way, errors will not accumulate. Considering we only need two digits to the right of the decimal point, the float number's precision is quite enough to do such a computation.
I have heard that in investment banks, they use double in all of their systems since they deal with very large amounts of cash. Thus in these banks, simplicity and performance are more important than precision.
Deep Learning
Deep Learning is one of the fields that do not need high precision but do need high performance. A neural network can have millions of parameters, so the precision of a single weight and bias will not impact the prediction of the network. Instead, the neural network needs to compute very fast to train on a given dataset and give out a prediction in a reasonable time interval. Plus, many accelerators can actually accelerate a specific type of float: half-precision i.e., fp16. Thus, to reduce the size of the network in memory and to accelerate the train and prediction process, many neural networks usually run in hybrid mode. The neural network framework and accelerator driver can decide what parameters can be computed in fp16 with minimum overflow and underflow risk since fp16 has a pretty small range: 10^-8 to 65504. Other parameters are still computed in fp32. In some edge usage, the usable memory is very small (for example, K 210 and edge TPU only has 8MB onboard SRAM), so neural networks need to use 8-bit fixed-point numbers to fit in these devices. The fixed-point numbers are like decimals that they are the opposite of floating-point numbers as they have fixed digits after the decimal point. Usually, they represent themselves in the system as int8 or unit8.
Scientific Computation
The double type (i.e. 64-bit floating number) usually meets the scientist's need in scientific computation. In addition, IEEE 754 also has defined quad precision (128 bit) to facilitate scientific computation. Intel's x86 processors also have an 80-bit extended precision format.
However, some of the scientific computation needs arbitrary precision arithmetic. For example, to compute pi and to do astronomical simulation need high precision computation. Thus, they need something different, which is called arbitrary-precision floating-point number. One of the most famous libraries that support arbitrary-precision floating-point numbers is GNU Multiple Precision Arithmetic Library(GMP). They generally store the number directly across the memory and use stacks to simulate a vertical method to compute a final result.
In general, standard floating-point numbers are designed fairly well and elegantly. As long as you understand your need, floating-point numbers are capable for most usages.
In the Q-learning algorithm used in Reinforcement Learning with replay, one would use a data structure in which it stores previous experience that is used in training (a basic example would be a tuple in Python). For a complex state space, I would need to train the agent in a very large number of different situations to obtain a NN that correctly approximates the Q-values. The experience data will occupy more and more memory and thus I should impose a superior limit for the number of experience to be stored, after which the computer should drop the experience from memory.
Do you think FIFO (first in first out) would be a good way of manipulating the data vanishing procedure in the memory of the agent (that way, after reaching the memory limit I would discard the oldest experience, which may be useful for permitting the agent to adapt quicker to changes in the medium)? How could I compute a good maximum number of experiences in the memory to make sure that Q-learning on the agent's NN converges towards the Q function approximator I need (I know that this could be done empirically, I would like to know if an analytical estimator for this limit exists)?
In the preeminent paper on "Deep Reinforcement Learning", DeepMind achieved their results by randomly selecting which experiences should be stored. The rest of the experiences were dropped.
It's hard to say how a FIFO approach would affect your results without knowing more about the problem you're trying to solve. As dblclik points out, this may cause your learning agent to overfit. That said, it's worth trying. There very well may be a case where using FIFO to saturate the experience replay would result in an accelerated rate of learning. I would try both approaches and see if your agent reaches convergence more quickly with one.
I have a few questions regarding the Attention-OCR model described in this paper: https://arxiv.org/pdf/1704.03549.pdf
Some context
My goal is to let Attention-OCR learn where to look for and read a specific information in a scanned document. It should find a 10 digit number that is (in most of the cases) preceded by a descriptive label. The layout and type of documents vary, thus I concluded that without using an attention mechanism the task is unsolvable due to the variable position...
My first question is: am I interpreting the model's capabilities right? Can it actually solve my problem? (1)
Progress so far
I managed to run the training on my own dataset with about 200k images of size 736x736 (pretty big, though the quality isn't that high and scaling it down more would make the text unrecognizable). Unfortunately, the machine I have at my disposal only has one GPU (Nvidia Quadro M4000), and time is a significant aspect. I need a proof of concept soon, so I figured that I could try to overtrain the model with a significantly smaller dataset, just to see if it is able to learn.
I managed to overtrain it with 5k images - it predicts every image successfully. But I have some concerns regarding my interpretation of this result. It seems like the model hasn't successfully memorized where to look for the desired information, but just memorized all of the strings disregarding whether they are actually written somewhere in the document or not. I mean, it's not very surprising that the model memorized it all, but my question is what image amount threshold must be exceeded for the model to start generalizing and actually learning the attention? (2)
Spatial attention
Another thing I'd like to address is the spatial attention mechanism. In the early stage of implementing the model I assumed that the spatial attention mechanism described in the paper was already included and working. Some time ago I stumbled upon an issue in the tensorflow-repository created by Alexander Gorban (one of the developers of Attention-OCR), where he stated that it was disabled by default.
So I turned it back on and realized that the memory usage became unbelievably high. The spatial dimensions of the Tensor including the encoded coordinates changed from
[batch_size, width, height, features]
to
[batch_siz, width, height, features+width+height]
That caused the memory consumption to jump by the factor of ~10 (taking the magnitude of images into account) -> can't afford that! Resulting in my third question: Is the spatial attention mechanism necessary for my task? (3)
Bonus question
Is it possible to visualize the silency and attention map with disabled coordinate encoding?