I have a piece of code which calculates positions of some satellites and planets using Skyfield. For clarity, I use Pandas DataFrame as a container of positions and corresponding time moments. I want to make calculation parallel, but always getting the same error: TypeError: can't pickle Satrec objects. Different parallelizers were tested, like Dask, pandarallel, swifter and Pool.map().
Example of piece of code to be parallelized:
def get_sun_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.earth.at(t).observe(self.sun).apparent().position.m # from skyfield, error is here
return pos
def get_sat_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.sat.at(t).position.m # from skyfield, error is here
return pos
def get_positions(self):
self.df["sat_pos"] = self.df.swifter.apply(self.get_sat_position, axis=1) # all the parallelization goes here
self.df["sun_pos"] = self.df.swifter.apply(self.get_sun_position, axis=1) # and here
# the same implementation but using dask
# self.df["sat_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sat_position(row),axis=1))\
# .compute(scheduler='processes')
# self.df["sun_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sun_position(row),axis=1))\
# .compute(scheduler='processes')
For Dask to avoid Pickle I tried to set serializaton manually like this serializers=['dask', 'pickle'] but it didn't help.
As I understand, Skyfield uses sgp4 which contains Satrec class.
I would be wondering if there is some way to parallelize this .apply(). Or maybe I should not try Skyfield functions for parallel processing at all?
Alas, all of the mechanisms you are using to make the computation parallel do so by creating another process and then sending copies of all of the objects involved in the computation over to the other process — and the Satrec object is written in C++, not Python, to make it faster, and C++ objects have no native way to "serialize" themselves into bytes for transmission to another process. (Python objects have that ability built-in.)
Have you profiled your code to see what the most expensive steps are? My guess is that most of your expense is in the Sun computation, because to achieve its high precision Skyfield needs to compute the Earth's orientation to very high accuracy to give the Sun's position in the sky to high enough precision for even radio astronomers.
But if you yourself don't need that high an accuracy, you could switch to lower-precision sky coordinates for the Sun. Before using t in get_sun_position(), try doing this to it:
t._nutation_angles = iau2000b(t.tt)
That will use a lower precision estimate of the Earth's nutation (print out the values before and after this change to see how big the difference is, and compare that to how much inaccuracy your application can stand), but also hopefully run faster.
Related
I have doing an analysis trying to see the relation between training time and maximum iteration in SVC. The data I use is some randomly generated number and I plotted the training time against max_iter of SVC fit. I checked logs and each binary classifier has reached the max_iter (I output all console logs which showed detailed warning for each binary classifier and count them). However, I was assuming that the training time will be strictly linear to the iteration but actually, in the case that the training data has many labels e.g. say 40, then the plot does not show it's linear.
It seems as the maximum iteration goes up, each iteration takes slight less time than before. While if we changed label_size to be 2 (which means each fit contains only 1 binary classifier), the line is straight.
What causes that to happen?
Here is my source code:
# -*- coding: utf-8 -*-
import numpy as np
from sklearn.svm import SVC
import time
import pandas as pd
def main(row_size, label_size):
np.random.seed(2019)
y = np.array([i for i in range(label_size) for j in range(row_size
/ label_size)])
if len(y) < row_size:
y = np.append(y, [y[-1]] * (row_size - len(y)))
X = np.random.rand(row_size, 300)
print X.shape, y.shape
return (X, y)
def train_svm(X, y, max_iter):
best_params = {'C': 1}
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
)
start = time.time()
clf.fit(X, y)
end = time.time()
return end - start
if __name__ == '__main__':
row_size = 20000
m_iter = range(10, 401, 20)
label_size = [40]
data = {
'label_size': [],
'max_iter': [],
'row_size': [],
'time': [],
}
for it in m_iter:
for l in label_size:
(X, y) = main(row_size, l)
t = train_svm(X, y, max_iter=it)
data['label_size'].append(l)
data['max_iter'].append(it)
data['row_size'].append(row_size)
data['time'].append(t)
df = pd.DataFrame(data)
df.to_csv('svc_iter.csv', index=None)
Well, there could be loads of reasons for that "very slight change". Scikit-Learn doesn't operate natively, it's built upon different libraries and it may be using loads of optimizers..etc!
Besides, your first graph is very close to linear!
Nevertheless, a big noticeable reasonable factor that contributed in those tiny changes is the Decomposition Method in Support Vector Machine.
The idea of decomposition methodology for classification tasks is to break down a complex classification task into several simpler and more manageable sub-tasks that are solvable by using existing induction methods, then joining their solutions together in order to solve the original problem.
This method is an iterative process and in each iteration only few variables are updated.
For more details about the mathematical approach, please refer to this paper, section 6.2 The Decomposition Method..
Moreover and specifically speaking, SVM implements two tricks called shrinking and caching for the decomposition method.
The Shrinking idea is that an optimal solution α of the SVM dual problem may contain some bounded elements (i.e., αi = 0 or C). These elements may have already been bounded in the middle of the decomposition iterations. To save the training time, the shrinking technique tries to identify and remove some bounded elements, so a smaller optimization problem is solved.
The Caching idea is an effective technique for reducing the computational time of the decomposition method, so elements are calculated as needed. We can use available memory (called kernel cache) to store some recently used permutation of the matrix Qij. Then, some kernel elements may not need to be recalculate.
For more details about the mathematical approach, please refer to this paper, section 5 Shrinking and Caching.
Technical Proof:
I repeated your experiment (that's way I asked for your code to follow the same exact approach), with and without using the shrinking and caching optimization.
Using Shrinking and Caching
The default value of the parameter shrinking in sklearn SVC is set to True, keeping that as it is, produced the following output:
Plotting it gives:
Note how at some point, the time drops noticeably reflecting the effect of shrinking and caching.
Without Using Shrinking and Caching
Using the same exact approach but this time, setting the parameter shrinking explicitly to False as follows:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=True,
shrinking=False
)
Produced the following output:
Plotting it gives:
Note how unlike previous plot, there is no noticeable drop in time at some point, it's rather just a very tiny fluctuations along with the entire plot.
Comparing Pearson Correlations
In conclusion:
Without using the Shrinking and Caching (updated later with caching), the linearity improved, although it's not 100% linear, but if you take into account that Scikit-Learn internally uses libsvm library to handle all computations. And this library is wrapped using C and Cython, you would have a higher tolerance to your definition about 'Linear' of the relationship between maximum iterations and time. Also, here is a cool discussion about why algorithms may not give the exact same precise definite running time every time.
And that would be even clearer to you if you plot the interval times, so you can see clearly how the drops happen suddenly noticeably in more than one place.
While it keeps almost the same flow without using the optimization tricks.
Important Update
It turned out that the aforementioned reason for this issue (i.e. Shrinking and Caching) is correct, or more precisely, it's a very big factor of that phenomenon.
But the thing I missed is the following:
I was talking about Shrinking and Caching but I missed the later parameter for caching which is set by default to 200 MB.
Repeating the same simulations more than one time and setting the cache_size parameter to a very small number (because zero is not acceptable and throws an error) in addition to shrinking=False, resulted in an extremely-close-to linear pattern between max_iter and time:
clf = SVC(
C=best_params['C'],
kernel=str('linear'),
probability=False,
class_weight='balanced',
max_iter=max_iter,
random_state=2018,
verbose=False,
shrinking=False,
cache_size = 0.000000001
)
By the way, you don't need to set verbose=True, you can check if it reached the maximum iteration via the ConvergenceWarning, so you can redirect those warnings to a file and it'll be million times easier to follow, just add this code:
import warnings, sys
def customwarn(message, category, filename, lineno, file=None, line=None):
with open('warnings.txt', 'a') as the_file:
the_file.write(warnings.formatwarning(message, category, filename, lineno))
warnings.showwarning = customwarn
Also you don't need to re-generate the dataset after each iteration, so take it out the loop like this:
(X, y) = main(row_size, 40)
for it in m_iter:
....
....
Final Conclusion
Shrinking and Caching tricks coming from Decomposition Method in SVM play a big significant role in improving the execution time as the number of iterations increases. Besides, there are other small players that may be contributing in this matter such as internal usage of libsvm library to handle all computations which is wrapped using C and Cython.
Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.
I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.
I am basically building a 3D scatter plot using primitive UV spheres and am running into memory issues when attempting to create more than a couple hundred points at one time. I am limited on my laptop with a 2.1Ghz processor but wanted to know if there is a better way to write this:
import bpy
import random
while count < 5:
bpy.ops.mesh.primitive_uv_sphere_add(size=.3,\
location=(random.randint(-9,9), random.randint(-9,9),\
random.randint(-9,9)), rotation=(0,0,0))
count += 1
I realize that with such a simple script any performance increase is likely negligible but wanted to give it a shot anyway.
Some possible suggestions
I would pre-calculate the x,y,z values, store them in a mathutil vector and add it to a dict to be iterated over.
Duplication should provide a smaller memory footprint than
instantiating new objects. bpy.ops.object.duplicate_move(OBJECT_OT_duplicate=(linked:false, TRANSFORM_OT_translate=(transform)
Edit:
Doing further research it appears each time a bpy.ops.* is called the redraw function . One user documentented exponential increase in time taken to genenerate UV sphere.
CoDEmanX provided the following code snippet to another user.
import bpy
bpy.ops.object.select_all(action='DESELECT')
bpy.ops.mesh.primitive_uv_sphere_add()
sphere = bpy.context.object
for i in range(-1000, 1000, 2):
ob = sphere.copy()
ob.location.y = i
#ob.data = sphere.data.copy() # uncomment this, if you want full copies and no linked duplicates
bpy.context.scene.objects.link(ob)
bpy.context.scene.update()
Then it is just a case of adapting the code to set the object locations
obj.location = location_dict[i]
I am developing an application for color blind people to enable them smoothly surf the Internet. I have a set of colors, lets say A , which consists of all the colors seen by a color blind person. Set A is calculated using a big calculation involving millions of colors. Set A is independent of inputs taken in my application i.e set A is like a 'constant' to me (just like 'pi' in mathematics). Now I want to store set A so that whenever I run my application, it is available without any added computational cost i.e i don't have to calculate A every time I run my application.
My Try:
I think this can be done by building a class having one constant but can it be done without creating any special class for just a constant?
I am using Python!
No need for a class. You want to store the calculated values on disk and load them back again on startup: for that you will want to look into the shelve or pickle libraries.
Yes, you can certainly do this with Python
If your constant was just a number -- say, you had just discovered tau -- then you would just declare it in a module, and import that module in all of your other source files:
constants.py:
# Define my new super-useful number
TAU = 6.28318530718
everywhere else:
from constants import TAU # Look, no calculations!
Expanding a bit, if you had a more complicated structure, like a dictionary, that took you a long time to compute, then you could just declare that in your module instead:
constants.py:
# Verified results of the national survey
PEPSI_CHALLENGE = {
'Pepsi': 0.57,
'Coke': 0.43,
}
And you can do this for more and more complicated data. The problem, eventually, is that just writing your constants module gets harder and harder, the more complex your data is, and it can be especially hard to update if you occasionally recompute the value you want to cache. In that case, you want to look at pickling the data, possibly as the final step of a python script which calculates it, and then load that data in a module that you import.
To do that, import pickle, and dump a single object out to a disk file:
recalculate.py:
# Here is the script that computes a small value from the hugely complicated domain:
import random
from itertools import groupby
import pickle
# Collect all of the random numbers
random_numbers = [random.randint(0,10) for r in xrange(1000000)]
# TODO: Check this -- this should definitely be 7
most_popular = max(groupby(sorted(random_numbers)),
key=lambda(x, v):(len(list(v)),-L.index(x)))[0]
# Now save the most common random number to disk, using pickle
# Almost any object is picklable like this, but check the docs for the exact details
pickle.dump(most_popular, open('data_cache','w'))
Now, in your constants file, you can simply read the pickled data from the file on disk, and have it available without recalculating it:
constants.py:
import pickle
most_popular = pickle.load(open('data_cache'))
everywhere else:
from constants import most_popular