tensorflow distribute integers according to probabilities

tensorflow distribute integers according to probabilities - python

I would like to distribute an integer for example 20, into four parts following the probability for each part:p=[0.02,0.5,0.3,0.18]
The corresponding python code is:
frequency=np.random.choice([1,2,3,4],20,p=[0.02,0.5,0.3,0.18])
from collections import Counter
np.fromiter(Counter(frequency).values(), dtype=np.float32)
# Out[86]:
# array([8., 8., 4.], dtype=float32)
However, I have over 1e8~ many parts and the number is not 20 but some 1e10.
So python is really slow.
for example
frequency=np.random.choice([i for i in range (10**7)],16**10,p=[0.0000001 for i in range(10**7)])
from collections import Counter
r=np.fromiter(Counter(frequency).values(), dtype=np.float32)
Now it simply yields MemoryError:
I think tensorflow gpu is able to conquer this issue, since the output result is only of size 10**7.
Does anyone know how to do this?

There are a few issues here to think of.
If you run the code on a GPU, it will never work because GPUs are not made for storage but rather fast computation so the space on the GPU is less than a CPU. However, this code may produce a memory error on a CPU too, as it did on my machine. So we first try to overcome that.
Overcoming the MemoryError on CPU:
The line producing the MemoryError is line 1 itself:
In [1]: frequency = np.random.choice([i for i in range (10**7)],16**10,p=[0.0000
...: 001 for i in range(10**7)])
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
The reason for this is that the output of line 1 is not of size 10**7 but 16**10. Since this is what is causing the MemoryError, the goal should be never to create a list of that size.
To do this, we reduce the size of the sample by a factor and loop over the block factor number of times so that it is storable. On my machine, a factor of 1000000 does the trick. Once we have created the sample, we use Counter to turn it into a dictionary of frequencies. The advantage is that we know that the dictionary of frequencies, when converted to a list or numpy array, will never exceed the size of 10**7, which does not give a memory error.
As some of the elements might not be in the sampled array each time, instead of converting the Counter dictionary into a list directly, we will update this dictionary using the dictionary in the previous iteration to preserve frequencies of the specific elements.
Once the whole loop is done, we convert the created dictionary to a list. I have added a progressbar so as to track the progress since the computation might take a lot of time. Also, you don't need to add the parameter p to the np.random.choice() function in your specific case as the distribution is uniform anyway.
import numpy as np
import tensorflow as tf
from click import progressbar
from collections import Counter
def large_uniform_sample_frequencies(factor=1000000, total_elements=10**7, sample_size=16**10):
# Initialising progressbar
bar = range(factor)
# Initialise an empty dictionary which
# will be updated in each iteration
counter_dict = {}
for iteration in bar:
# Generate a random sample of size (16 ** 10) / factor
frequency = np.random.choice([i for i in range (total_elements)],
sample_size / factor)
# Update the frequency dictionary
new_counter = Counter(frequency)
counter_dict.update(new_counter)
return np.fromiter(counter_dict.values(), dtype=np.float32)
Using tensorflow-gpu:
As you have mentioned tensorflow-gpu I can assume you either want to get rid of the MemoryError using tensorflow-gpu or run this in conjunction with tensorflow-gpu while using a GPU.
To solve the MemoryError, you may try the tf.multinomial() function to the same effect as np.random.choice() as shown here, but it is unlikely that it will help overcome the problem, which is storing data of a certain size and not performing some alternate computation.
If you want to run this as part of training some model for instance, you can use Distributed Tensorflow to place this part of the computation graph on the CPU as a PS Task by using the code given above. Here is the final code for that:
# Mention the devices for PS and worker tasks
ps_dev = '/cpu:0'
worker_dev = '/gpu:0'
# Toggle True to place computation on CPU
# and False to place it on the least loaded GPU
is_ps_task = True
# Set device for a PS task
if (is_ps_task):
device_setter = tf.train.replica_device_setter(worker_device=worker_dev,
ps_device=ps_dev,
ps_tasks=1)
# Allocate the computation to CPU
with tf.device(device_setter):
freqs = large_uniform_sample_frequencies()

Related

In SVC from Sklearn, why is the training time not strictly linear to maximum iteration when label size is big?

I have doing an analysis trying to see the relation between training time and maximum iteration in SVC. The data I use is some randomly generated number and I plotted the training time against max_iter of SVC fit. I checked logs and each binary classifier has reached the max_iter (I output all console logs which showed detailed warning for each binary classifier and count them). However, I was assuming that the training time will be strictly linear to the iteration but actually, in the case that the training data has many labels e.g. say 40, then the plot does not show it's linear.
It seems as the maximum iteration goes up, each iteration takes slight less time than before. While if we changed label_size to be 2 (which means each fit contains only 1 binary classifier), the line is straight.
What causes that to happen?
Here is my source code:
# -*- coding: utf-8 -*-
import numpy as np
from sklearn.svm import SVC
import time
import pandas as pd
def main(row_size, label_size):
np.random.seed(2019)
y = np.array([i for i in range(label_size) for j in range(row_size
/ label_size)])
if len(y) < row_size:
y = np.append(y, [y[-1]] * (row_size - len(y)))
X = np.random.rand(row_size, 300)
print X.shape, y.shape
return (X, y)
def train_svm(X, y, max_iter):
best_params = {'C': 1}
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
)
start = time.time()
clf.fit(X, y)
end = time.time()
return end - start
if __name__ == '__main__':
row_size = 20000
m_iter = range(10, 401, 20)
label_size = [40]
data = {
'label_size': [],
'max_iter': [],
'row_size': [],
'time': [],
}
for it in m_iter:
for l in label_size:
(X, y) = main(row_size, l)
t = train_svm(X, y, max_iter=it)
data['label_size'].append(l)
data['max_iter'].append(it)
data['row_size'].append(row_size)
data['time'].append(t)
df = pd.DataFrame(data)
df.to_csv('svc_iter.csv', index=None)

Well, there could be loads of reasons for that "very slight change". Scikit-Learn doesn't operate natively, it's built upon different libraries and it may be using loads of optimizers..etc!
Besides, your first graph is very close to linear!
Nevertheless, a big noticeable reasonable factor that contributed in those tiny changes is the Decomposition Method in Support Vector Machine.
The idea of decomposition methodology for classification tasks is to break down a complex classification task into several simpler and more manageable sub-tasks that are solvable by using existing induction methods, then joining their solutions together in order to solve the original problem.
This method is an iterative process and in each iteration only few variables are updated.
For more details about the mathematical approach, please refer to this paper, section 6.2 The Decomposition Method..
Moreover and specifically speaking, SVM implements two tricks called shrinking and caching for the decomposition method.
The Shrinking idea is that an optimal solution α of the SVM dual problem may contain some bounded elements (i.e., αi = 0 or C). These elements may have already been bounded in the middle of the decomposition iterations. To save the training time, the shrinking technique tries to identify and remove some bounded elements, so a smaller optimization problem is solved.
The Caching idea is an effective technique for reducing the computational time of the decomposition method, so elements are calculated as needed. We can use available memory (called kernel cache) to store some recently used permutation of the matrix Qij. Then, some kernel elements may not need to be recalculate.
For more details about the mathematical approach, please refer to this paper, section 5 Shrinking and Caching.
Technical Proof:
I repeated your experiment (that's way I asked for your code to follow the same exact approach), with and without using the shrinking and caching optimization.
Using Shrinking and Caching
The default value of the parameter shrinking in sklearn SVC is set to True, keeping that as it is, produced the following output:
Plotting it gives:
Note how at some point, the time drops noticeably reflecting the effect of shrinking and caching.
Without Using Shrinking and Caching
Using the same exact approach but this time, setting the parameter shrinking explicitly to False as follows:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
shrinking=False
)
Produced the following output:
Plotting it gives:
Note how unlike previous plot, there is no noticeable drop in time at some point, it's rather just a very tiny fluctuations along with the entire plot.
Comparing Pearson Correlations
In conclusion:
Without using the Shrinking and Caching (updated later with caching), the linearity improved, although it's not 100% linear, but if you take into account that Scikit-Learn internally uses libsvm library to handle all computations. And this library is wrapped using C and Cython, you would have a higher tolerance to your definition about 'Linear' of the relationship between maximum iterations and time. Also, here is a cool discussion about why algorithms may not give the exact same precise definite running time every time.
And that would be even clearer to you if you plot the interval times, so you can see clearly how the drops happen suddenly noticeably in more than one place.
While it keeps almost the same flow without using the optimization tricks.
Important Update
It turned out that the aforementioned reason for this issue (i.e. Shrinking and Caching) is correct, or more precisely, it's a very big factor of that phenomenon.
But the thing I missed is the following:
I was talking about Shrinking and Caching but I missed the later parameter for caching which is set by default to 200 MB.
Repeating the same simulations more than one time and setting the cache_size parameter to a very small number (because zero is not acceptable and throws an error) in addition to shrinking=False, resulted in an extremely-close-to linear pattern between max_iter and time:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=False,
shrinking=False,
cache_size = 0.000000001
)
By the way, you don't need to set verbose=True, you can check if it reached the maximum iteration via the ConvergenceWarning, so you can redirect those warnings to a file and it'll be million times easier to follow, just add this code:
import warnings, sys
def customwarn(message, category, filename, lineno, file=None, line=None):
with open('warnings.txt', 'a') as the_file:
the_file.write(warnings.formatwarning(message, category, filename, lineno))
warnings.showwarning = customwarn
Also you don't need to re-generate the dataset after each iteration, so take it out the loop like this:
(X, y) = main(row_size, 40)
for it in m_iter:
....
....
Final Conclusion
Shrinking and Caching tricks coming from Decomposition Method in SVM play a big significant role in improving the execution time as the number of iterations increases. Besides, there are other small players that may be contributing in this matter such as internal usage of libsvm library to handle all computations which is wrapped using C and Cython.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!

Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?

Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

more efficient Python scripting in Blender3D

I am basically building a 3D scatter plot using primitive UV spheres and am running into memory issues when attempting to create more than a couple hundred points at one time. I am limited on my laptop with a 2.1Ghz processor but wanted to know if there is a better way to write this:
import bpy
import random
while count < 5:
bpy.ops.mesh.primitive_uv_sphere_add(size=.3,\
location=(random.randint(-9,9), random.randint(-9,9),\
random.randint(-9,9)), rotation=(0,0,0))
count += 1
I realize that with such a simple script any performance increase is likely negligible but wanted to give it a shot anyway.

Some possible suggestions
I would pre-calculate the x,y,z values, store them in a mathutil vector and add it to a dict to be iterated over.
Duplication should provide a smaller memory footprint than
instantiating new objects. bpy.ops.object.duplicate_move(OBJECT_OT_duplicate=(linked:false, TRANSFORM_OT_translate=(transform)
Edit:
Doing further research it appears each time a bpy.ops.* is called the redraw function . One user documentented exponential increase in time taken to genenerate UV sphere.
CoDEmanX provided the following code snippet to another user.
import bpy
bpy.ops.object.select_all(action='DESELECT')
bpy.ops.mesh.primitive_uv_sphere_add()
sphere = bpy.context.object
for i in range(-1000, 1000, 2):
ob = sphere.copy()
ob.location.y = i
#ob.data = sphere.data.copy() # uncomment this, if you want full copies and no linked duplicates
bpy.context.scene.objects.link(ob)
bpy.context.scene.update()
Then it is just a case of adapting the code to set the object locations
obj.location = location_dict[i]

Optimising memory usage in numpy

The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))

This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)

if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.