Multicore programming

Multicore programming - python

At the company where I am interning, I was told about the use of multi-core programming and, in view of a project I am developing for my thesis (I'm not from the area but I'm working on something that involves coding).
I want to know if this is possible:I have a defined function that will be repeated 3x for 3 different variables. Is it possible to put the 3 running at the same time in different core (because they don't need each other information)? Because the calculation process is the same for all of them and instead of running 1 variable at a time, I would like to run all 3 at once (performing all the calculations at the same time) and in the end returning the results.
Some part of what I would like to optimize:
for v in [obj2_v1, obj2_v2, obj2_v3]:
distancia_final_v, \
pontos_intersecao_final_v = calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
v,
criterio
)
def calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
obj2_v,
criterio
):
i = 0
distancia_final_v = []
pontos_intersecao_final_v = []
while i < len(obj2_v):
distancia_relevante_v = []
pontos_intersecao_v = []
distancia_inicial = 1000
for x in range(len(obj1_v1)):
planeNormal = np.array( [obj1_normal[x][0],
obj1_normal[x][1],
obj1_normal[x][2]
] )
planePoint = np.array( [ obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2]
] ) # Any point on the plane
rayDirection = np.array([obj2_normal[i][0],
obj2_normal[i][1],
obj2_normal[i][2]
] ) # Define a ray
rayPoint = np.array([ obj2_v[i][0],
obj2_v[i][1],
obj2_v[i][2]
] ) # Any point along the ray
Psi = Calculos.line_plane_collision( planeNormal,
planePoint,
rayDirection,
rayPoint
)
a = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2]
)
b = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
c = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
d = Calculos.area_trianglo_3d( obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
if float("{:.5f}".format(a)) == float("{:.5f}".format(b + c + d)):
P1 = Ponto( Psi[0][0], Psi[0][1], Psi[0][2] )
P2 = Ponto( obj2_v[i][0], obj2_v[i][1], obj2_v[i][2] )
distancia = Calculos.distancia_pontos( P1, P2 ) * 10
if distancia < distancia_inicial and distancia < criterio:
distancia_inicial = distancia
distancia_relevante_v = []
distancia_relevante_v.append( distancia_inicial )
pontos_intersecao_v = []
pontos_intersecao_v.append( Psi )
x += 1
distancia_final_v.append( distancia_relevante_v )
pontos_intersecao_final_v.append( pontos_intersecao_v )
i += 1
return distancia_final_v, pontos_intersecao_final_v
In this example of my code, I want to make the same process happen for obj2_v1, obj2_v2, obj2_v3.
Is it possible to make them happen at the same time?
Because I will be using a considerable amount of data and it would probably save me some time of processing.

multiprocessing (using processes to avoid the GIL) is the easiest but you're limited to relatively small performance improvements, number of cores speedup is the limit, see Amdahl's law. there's also a bit of latency involved in starting / stopping work which means it's much better for things that take >10ms
in numeric heavy code (like this seems to be) you really want to be moving as much of the it "inside numpy", look at vectorisation and broadcasting. this can give speedups of >50x (just on a single core) while staying easier to understand and reason about
if your algorithm is difficult to express using numpy intrinsics then you could also look at using Cython. this allows you to write Python code that automatically gets compiled down to C, and hence a lot faster. 50x faster is probably also a reasonable speedup, and this is still running on a single core
the numpy and Cython techniques can be combined with multiprocessing (i.e. using multiple cores) to give code that runs hundreds of times faster than naive implementations
Jupyter notebooks have friendly extensions (known affectionately as "magic") that make it easier to get started with this sort of performance work. the %timeit special allows you to easily time parts of the code, and the Cython extension means you can put everything into the same file

It's possible, but use python multiprocessing lib, because the threading lib doesn't delivery parallel execution.
UPDATE
DON'T do something like that (thanks for #user3666197 for pointing the error) :
from multiprocessing.pool import ThreadPool
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio):
#your code
return distancia_final_v,pontos_intersecao_final_v
pool = ThreadPool(processes=3)
async_result1 = pool.apply_async(calculo_vertice, (#your args here))
async_result2 = pool.apply_async(calculo_vertice, (#your args here))
async_result3 = pool.apply_async(calculo_vertice, (#your args here))
result1 = async_result1.get() # result1
result2 = async_result2.get() # result2
result3 = async_result3.get() # result3
Instead, something like this should do the job:
from multiprocessing import Process, Pipe
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio, send_end):
#your code
send_end.send((distancia_final_v,pontos_intersecao_final_v))
numberOfWorkers = 3
jobs = []
pipeList = []
#Start process and build job list
for i in range(numberOfWorkers):
recv_end, send_end = Pipe(False)
process = Process(target=calculo_vertice, args=(#<... your args...>#, send_end))
jobs.append(process)
pipeList.append(recv_end)
process.start()
#Show the results
for job in jobs: job.join()
resultList = [x.recv() for x in pipeList]
print (resultList)
REF.
https://docs.python.org/3/library/multiprocessing.html
https://stackoverflow.com/a/37737985/8738174
This code will create a pool of 3 working process and each of it will async receive the function. It's important to point that in this case you should have 3+ CPU cores, otherwise, your system kernel will just switch between process and things won't real run in parallel.

Q : " Is it possible to make them happen at the same time? "
Yes.
The best results ever will be get if not adding any python ( the multiprocessing module is not necessary at all for doing 3 full-copies ( yes, top-down fully replicated copies ) of the __main__ python process for this so embarrasingly independent processing ).
The reasons for this are explained in detail here and here.
A just-enough tool is GNU's :$ parallel --jobs 3 python job-script.py {} ::: "v1" "v2" "v3"
For all performance-tweaking, read about more configuration details in man parallel.
"Because I will be using a considerable amount of data..."
The Devil is hidden in the details :
The O/P code may be syntactically driving the python interpreter to results, precise (approximate) within some 5 decimal places, yet the core sin is it's ultimately bad chances to demonstrate any reasonably achievable performance in doing that, the worse on "considerable amount of data".
If they, "at the company", expect some "considerable amount of data", you should do at least some elementary research on what is the processing aimed at.
The worst part ( not mentioning the decomposition of once vectorised-ready numpy-arrays back into atomic "float" coordinate values ) is the point-inside-triangle test.
For a brief analysis on how to speed-up this part ( the more if going to pour "considerable amount of data" on doing this ), get inspired from this post and get the job done in fraction of the time it was drafted in the O/P above.
Indirect testing of a point-inside-triangle by comparing an in-equality of a pair of re-float()-ed-strings, received from sums of triangle-areas ( b + c + d ) is just one of the performance blockers, you will find to get removed.

Related

Parallel downloading with urlretrieve

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:
import shutil
import os
import sys
import socket
socket.setdefaulttimeout(5)
file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
try:
import urllib.request
sl = line.strip().split(";")
url = sl[0]
newname = str(sl[1])+".html"
urllib.request.urlretrieve(url, newname)
except:
pass
file_read.close()
This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?

Q :" I regularly have to ...What would be the simplest and best way to speed it up ? "
A :The SIMPLEST ( what the commented approach is not ) &the BEST wayis to at least : (a) minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),(b) harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow(c) go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow
Givenboth the simplicity & performance seem to be the measure of "best"-ness:
Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.
ThereforeI could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),sorather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).
Sketched process-flow framework :
from joblib import Parallel, delayed
MAX_WORKERs = ( n_CPU_cores - 1 )
def main( files_in ):
""" __doc__
.INIT worker-processes, each with a split-scope of tasks
"""
IDs = range( max( 1, MAX_WORKERs ) )
RES_if_need = Parallel( n_jobs = MAX_WORKERs
)( delayed( block_processor_FUN #-- fun CALLABLE
)( my_file, #---------- fun PAR1
wPROC #---------- fun PAR2
)
for wPROC in IDs
)
def block_processor_FUN( file_with_URLs = None,
file_from_PART = 0
):
""" __doc__
.OPEN file_with_URLs
.READ file_from_PART, row-wise - till next part starts
- ref. global MAX_WORKERs
"""
...
This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.
The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.
That simple.
If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.
More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.
The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.
For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

Parallel programming on a nested for loop in Python using PyCuda (or else?)

Part of my python function looks like this:
for i in range(0, len(longitude_aq)):
center = Coordinates(latitude_aq[i], longitude_aq[i])
currentAq = aq[i, :]
for j in range(0, len(longitude_meo)):
currentMeo = meo[j, :]
grid_point = Coordinates(latitude_meo[j], longitude_meo[j])
if is_in_circle(center, RADIUS, grid_point):
if currentAq[TIME_AQ] == currentMeo[TIME_MEO]:
humidity += currentMeo[HUMIDITY_MEO]
pressure += currentMeo[PRESSURE_MEO]
temperature += currentMeo[TEMPERATURE_MEO]
wind_speed += currentMeo[WIND_SPEED_MEO]
wind_direction += currentMeo[WIND_DIRECTION_MEO]
count += 1.0
if count != 0.0:
final_tmp[i, HUMIDITY_FINAL] = humidity/count
final_tmp[i, PRESSURE_FINAL] = pressure/count
final_tmp[i, TEMPERATURE_FINAL] = temperature/count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed/count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction/count
humidity, pressure, temperature, wind_speed, wind_direction, count = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
final.loc[:, :] = final_tmp[:, :]
Problem: len(longitude_aq) is approx. 320k and len(longitude_meo) is 7millions. Which brings this code to somewhere close to 2100 billions iterations...
I have to iterate over one file (the longitude_aq one) and then compute some average value, iterating over the second file (the longitude_meo one) given some features extracted from the first file.
It doesn't seem like I can proceed any differently.
Possible solution: parallel programming. My university allows me access to their HPC. They have several nodes + several GPUs accessible (list here for the GPUs and here for the CPUs)
Target: Having no experience in CUDA programming with Python, I am wondering what would be the easiest way to transform my code into something runnable by the HPC, so that the computation time drops drastically.

Sorry, the read might get hard to read, but reality is cruel and many enthusiasts might easily spoil man*months of "coding" efforts into a principally a-priori lost war. Better carefully re-assess all the a-priori known CONS / PROS of any re-engineering plans, before going to blind spend a single man*day into a principally wrong direction.Most probably I would not post it here, if I were not exposed to a Project, where top-level academicians have spent dozens of man*years, yes, more than a year with a team of 12+, for "producing" a processing taking ~ 26 [hr], which was reproducible in less than ~ 15 [min] ( and way cheaper in HPC/GPU infrastructure costs ), if designed using the proper ( hardware-performance non-devastating ) design methods...
It doesn't seem like I can proceed any differently.
Well, actually pretty hard to tell, if not impossible :
Your post seems to assume a few cardinal things that may pretty avoid getting any real benefit from moving the above sketched idea onto an indeed professional HPC / GPU infastructure.
Possible solution: parallel programming
A way easier to say it / type it, than to actually do it.
A-wish-to-run-in-true-[PARALLEL] process scheduling remains just a wish and ( believe me, or Gene Amdahl, or other C/S veterans or not ) indeed hard process re-design is required, if your code ought get any remarkably better than in a pure-[SERIAL] code-execution flow ( as posted above )
1 ) a pure-[SERIAL] nature of the fileIO can( almost )kill the game :
the not posted part about pure-[SERIAL] file-accesses ( two files with data-points ) ... any fileIO is by nature the most expensive resource and except smart re-engineering a pure-[SERIAL] ( at best a one-stop-cost, but still ) (re-)reading in a sequential manner, so do not expect any Giant-Leap anywhere far from this in the re-engineered code. This will be always slowest and always expensive phase.
BONUS:While this may seem as the least sexy item in the inventory list of the parallel-computing, pycuda, distributed-computing, hpc, parallelism-amdahl or whatever the slang brings next, the rudimentary truth is that making an HPC-computing indeed fast and resources efficient, both the inputs ( yes, the static files ) and the computing strategy are typically optimised for stream-processing and to best also enjoy a non-broken ( collision avoided ) data-locality, if peak performance is to get achieved. Any inefficiency in these two domains can, not just add-on, but actually FACTOR the computing expenses ( so DIVIDE performance ) and differences may easily get into several orders of magnitude ( from [ns] -> [us] -> [ms] -> [s] -> [min] -> [hr] -> [day] -> [week], you name them all ... )
2 ) cost / benefits may get you PAY-WAY-MORE-THAN-YOU-GET
This part is indeed your worst enemy : if a lumpsum of your efforts is higher than a sum of net benefits, GPU will not add any added value at all, or not enough, so as to cover your add-on costs.
Why?
GPU engines are SIMD devices, that are great for using a latency-masking over a vast area of a repetitively the very same block of SMX-instructions, whih needs a certain "weight"-of-"nice"-mathematics to happen locally, if they are to show any processing speedup over other problem implementation strategies - GPU devices ( not the gamers' ones, but the HPC ones, which not all cards in the class are, are they? ) deliver best for indeed small areas of data-locality ( micro-kernel matrix operations, having a very dense, best very small SMX-local "RAM" footprint of such a dense kernel << ~ 100 [kB] as of 2018/Q2 ).
Your "computing"-part of the code has ZERO-RE-USE of any single data-element that was ( rather expensively ) fetched from an original static storage, so almost all the benefits, that the GPU / SMX / SIMD artillery has been invented for is NOT USED AT ALL and you receive a NEGATIVE net-benefit from trying to load that sort of code onto such a heterogeneous ( NUMA complicated ) distributed computing ( yes, each GPU-device is a rather "far", "expensive" ( unless your code will harness it's SMX-resources up until almost smoke comes out of the GPU-silicon ... ) and "remote" asynchronously operated distributed-computing node, inside your global computing strategy ) system.
Any first branching of the GPU code will be devastatingly expensive in SIMD-execution costs of doing so, thus your heavily if-ed code is syntactically fair, but performance-wise an almost killer of the game:
for i in range( 0,
len( longitude_aq )
): #______________________________________ITERATOR #1 ( SEQ-of-I-s )
currentAq = aq[i, :] # .SET
center = Coordinates( latitude_aq[i], # .SET
longitude_aq[i]
) # |
# +-------------> # EASY2VECTORISE in [i]
for j in range( 0,
len( longitude_meo )
): #- - - - - - - - - - - - - - - - - ITERATOR #2 ( SEQ-of-J-s )
currentMeo = meo[j, :] # .SET
grid_point = Coordinates( latitude_meo[j], # .SET
longitude_meo[j]
) # |
# +-------> # EASY2VECTORISE in [j]
if is_in_circle( center,
RADIUS,
grid_point
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/ IF-ed SIMD-KILLER #1
if ( currentAq[TIME_AQ]
== currentMeo[TIME_MEO]
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ IF-ed SIMD-KILLER #2
humidity += currentMeo[ HUMIDITY_MEO] # BEST PERF.
pressure += currentMeo[ PRESSURE_MEO] # IF SMART
temperature += currentMeo[ TEMPERATURE_MEO] # CURATED
wind_speed += currentMeo[ WIND_SPEED_MEO] # AS NON
wind_direction += currentMeo[WIND_DIRECTION_MEO] # ATOMICS
count += 1.0
if count != 0.0: # !!!!!!!!!!!!!!!!!! THIS NEVER HAPPENS
# EXCEPT WHEN ZERO DATA-POINTS WERE AVAILABLE FOR THE i-TH ZONE,
# FILE DID NOT CONTAIN ANY SUCH,
# WHICH IS FAIR,
# BUT SUCH A BLOCK OUGHT NEVER HAVE STARTED ANY COMPUTING AT ALL
# IF ASPIRING FOR INDEED BEING LOADED
# ONTO AN HPC-GRADE COMPUTING INFRASTRUCTURE ( SPONSORED OR NOT )
#
final_tmp[i, HUMIDITY_FINAL] = humidity / count
final_tmp[i, PRESSURE_FINAL] = pressure / count
final_tmp[i, TEMPERATURE_FINAL] = temperature / count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed / count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction / count
If we omit both the iterators over the domain of all-[i,j]-s and the if-ed crossroads, the actual "useful"-part of the computing does a very shallow amount of mathematics -- the job contains a few SLOC-s, where principally independent values are summed ( best having avoided any collision of adding operation, so could be very cheaply operated independently each to the other ( best with well ahead pre-fetched constants ) in less than a few [ns] YES, your computing payload does not require anything more than just a few units [ns] to execute.
The problem is in smart-engineering of the data-flow ( I like to call that a DATA-HYDRAULICS ( how to make a further incompressible flow of DATA into the { CPU | GPU | APU | *** }-processor registers, so as to get 'em processes ) )
All the rest is easy. A smart solution of the HPC-grade DATA-HYDRAULICS typically is not.
No language, no framework will help you in this automatically. Some can release some part of the solution-engineering from your "manual" efforts, some cannot, some can even spoil a possible computing performance, due to "cheap" shortcuts in their internal design decision and compromises made, that do not benefit the same target you have - The Performance.
The Best next step?
A ) Try to better understand the limits of computing infrastructures you expect to use for your extensive ( but not intensive ( yes, just a few SLOC's per [i,j] ), which HPC-supervisors do not like to see flowing onto their operated expensive HPC-resources ).
B ) If in troubles with time + headcount + financial resouces to re-engineer the top-down DATA-HYDRAULICS solution, best re-factor your code so as to get at least into the vectorised, numpy / numba ( not always will numba get remarkably farther than an already smart numpy-vectorised code, but a quantitative test will tell the facts per inciden, not in general )
C ) If your computing-problem is expected to get re-run more often, definitely assess a re-designed pipeline from the early pre-processing of the data-storage ( the slowest part of the processing ), where a stream-based pre-processing of principally static values is possible, which could impact the resulting DATA-HYDRAULICS flow ( performance ) with a pre-computed + smart-aligned values the most. The block of a few ADD-s down the lane will not get improved beyond a few [ns], as reported above, but the slow-flow can jump orders of magnitude faster, if re-arranged into a smarted flow, harnessing all available, yet "just"-[CONCURRENT]-ly operated resources ( any attempt to try to arrange a True-[PARALLEL] scheduling is a pure nonsense here, as the task is principally by no means a [PARALLEL] scheduling problem, but a stream of pure-[SERIAL] (re-)processing of data-points, where a smart, yet "just"-[CONCURRENT] processing re-arrangement may help scale-down the resulting duration of the process ).
BONUS:If interested in deeper reasoning about the achievable performance gains from going into N-CPU operated computing graphs, feel free to learn more about re-formulated Amdahl's Law and related issues, as posted in further details here.

Training sklearn models in parallel with joblib blocks the process

As suggested in this answer, I tried to use joblib to train multiple scikit-learn models in parallel.
import joblib
import numpy
from sklearn import tree, linear_model
classifierParams = {
"Decision Tree": (tree.DecisionTreeClassifier, {}),''
"Logistic Regression" : (linear_model.LogisticRegression, {})
}
XTrain = numpy.array([[1,2,3],[4,5,6]])
yTrain = numpy.array([0, 1])
def trainModel(name, clazz, params, XTrain, yTrain):
print("training ", name)
model = clazz(**params)
model.fit(XTrain, yTrain)
return model
joblib.Parallel(n_jobs=4)(joblib.delayed(trainModel)(name, clazz, params, XTrain, yTrain) for (name, (clazz, params)) in classifierParams.items())
However, the call to the last line takes ages without utilizing the CPU, in fact it just seems to block and never return anything. What is my mistake?
A test with a very small amount of data in XTrain suggests that copying of the numpy array across multiple processes is not the reason for the delay.

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:
Check both the CPU% and also other resources' state figures across this node.
What is my mistake?
Having read your profile was a stunning moment, Sir:
I am a computer scientist specializing on algorithms and data analysis by training, and a generalist by nature. My skill set combines a strong scientific background with experience in software architecture and development, especially on solutions for the analysis of big data. I offer consulting and development services and I am looking for challenging projects in the area of data science.
The problem IS deeply determined by a respect to elementary Computer Science + algorithm rules.
The problem IS NOT demanding a strong scientific background, but a common sense.
The problem IS NOT any especially Big Data but requires to smell how the things actually work.
FactsorEmotions? ... that's The Question! ( The tragedy of Hamlet, Prince of Denmark )
May I be honest? Let's prefer FACTS, always:
Step #1:
Never hire or fire straight each and every Consultant, who does not respect facts ( the answer referred above did not suggest anything, the less granted any promises ). Ignoring facts might be a "successful sin" in PR / MARCOM / Advertisement / media businesses ( in case The Customer tolerates such dishonesty and/or manipulative habit ) , but not in a scientifically fair quantitative domains. This is unforgivable.
Step #2:
Never hire or fire straight each and every Consultant, who claimed having experience in software architecture, especially on solutions for ... big data but pays zero attention to the accumulated lumpsum of all the add-on overhead costs that are going to be introduced by each of the respective elements of the system architecture, once the processing started to go distributed across some pool of hardware and software resources. This is unforgivable.
Step #3:
Never hire or fire straight each and every Consultant, who turns passive aggressive once facts do not fit her/his wishes and starts to accuse other knowledgeable person who have already delivered a helping hand to rather "improve ( their ) communication skills" instead of learning from mistake(s). Sure, skill may help to express the obvious mistakes in some other way, yet, the gigantic mistakes will remain gigantic mistakes and each and every scientist, being fair to her/his scientific title, should NEVER resort to attack on a helping colleague, but rather start searching for the root cause of the mistakes, one after the other. This ---
#sascha ... May I suggest you take little a break from stackoverflow to cool off, work a little on your interpersonal communication skills
--- was nothing but a straight and intellectually unacceptable nasty foul to #sascha.
Next, the toysThe architecture, Resources and Process-scheduling facts that matter:
The imperative form of a syntax-constructor ignites an immense amount of activities to start:
joblib.Parallel( n_jobs = <N> )( joblib.delayed( <aFunction> )
( <anOrderedSetOfFunParameters> )
for ( <anOrderedSetOfIteratorParams> )
in <anIterator>
)
To at least guess what happens, a scientifically fair approach would be to test several representative cases, benchmarking their actual execution, collect quantitatively supported facts and draw a hypothesis on a model of behaviour and its principal dependencies on CPU_core-count, on RAM-size, on <aFunction>-complexity and resources-allocation envelopes etc.
Test case A:
def a_NOP_FUN( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is indeed to do nothing at all,
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
pass
##############################################################
### A NAIVE TEST BENCH
##############################################################
from zmq import Stopwatch; aClk = Stopwatch()
JOBS_TO_SPAWN = 4 # TUNE: 1, 2, 4, 5, 10, ..
RUNS_TO_RUN = 10 # TUNE: 10, 20, 50, 100, 200, 500, 1000, ..
try:
aClk.start()
joblib.Parallel( n_jobs = JOBS_TO_SPAWN
)( joblib.delayed( a_NOP_FUN )
( aSoFunPAR )
for ( aSoFunPAR )
in range( RUNS_TO_RUN )
)
except:
pass
finally:
try:
_ = aClk.stop()
except:
_ = -1
pass
print( "CLK:: {0:_>24d} [us] #{1: >3d} run{2: >5d} RUNS".format( _,
JOBS_TO_SPAWN,
RUNS_TO_RUN
)
)
Having collected representatively enough data on this NOP-case over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to generate at least some first-hand experience of the actual system costs of launching an actually intrinsically empty-processes' overhead workloads, related to the imperatively instructed joblib.Parallel(...)( joblib.delayed(...) )-syntax constructor, spawning into the system-scheduler just a few joblib-managed a_NOP_FUN() instances.
Let's also agree that all the real-world problems, Machine Learning models included, are way more complex tools, that the just tested a_NOP_FUN(), while in both cases, you have to pay the already benchmarked overhead costs ( even if it was paid for getting literally zero product ).
Thus a scientifically fair, rigorous work will follow from this simplest ever case, already showing the benchmarked costs of all the associated setup-overheads a smallest ever joblib.Parallel() penalty sine-qua-non forwards into a direction, where real world algorithms live - best with next adding some larger and larger "payload"-sizes into the testing loop:
Test-case B:
def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
a MEM-allocation
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.zeros( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
return aMemALLOC[2:3,3,4,5]
Again,
collect a representatively enough quantitative data about the costs of actual remote-process MEM-allocations, by running a a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR() over some reasonable wide landscape of SIZE1D-scaling,
again
over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to touch a new dimension in the performance scaling, under an extended black-box PROCESS_under_TEST experimentation inside the joblib.Parallel() tool, leaving its magics yet left un-opened.
Test-case C:
def a_NOP_FUN_WITH_SOME_MEM_DATAFLOW( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
a MEM-allocation plus some Data MOVs
so as to be able to benchmark
all the process-instantiation + MEM OPs
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.ones( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
aMemALLOC[:,:,:,:]*= 0.1234567
aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]
return aMemALLOC[2:3,3,4,5]
Bang, The Architecture related issues start to slowly show up:
One may soon notice, that not only the static-sizing matters, but also the MEM-transport BANDWIDTH ( hardware-hardwired ) will start cause problems, as moving data from/to CPU into/from MEM costs well ~ 100 .. 300 [ns], a way more, than any smart-shuffling of the few bytes "inside" the CPU_core, { CPU_core_private | CPU_core_shared | CPU_die_shared }-cache hierarchy-architecture alone ( and any non-local NUMA-transfer exhibits the same order of magnitude add-on pain ).
All the above Test-Cases have not asked much efforts from CPU yet
So let's start to burn the oil!
If all above was fine for starting to smell how the things under the hood actually work, this will grow to become ugly and dirty.
Test-case D:
def a_CPU_1_CORE_BURNER_FUN( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
add some CPU-load
to a MEM-allocation plus some Data MOVs
so as to be able to benchmark
all the process-instantiation + MEM OPs
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.ones( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
aMemALLOC[:,:,:,:]*= 0.1234567
aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]
aMemALLOC[:,:,:,:]+= int( [ np.math.factorial( x + aMemALLOC[-1,-1,-1] )
for x in range( 1005 )
][-1]
/ [ np.math.factorial( y + aMemALLOC[ 1, 1, 1] )
for y in range( 1000 )
][-1]
)
return aMemALLOC[2:3,3,4,5]
Still nothing extraordinary, compared to the common grade of payloads in the domain of a Machine Learning many-D-space, where all dimensions of the { aMlModelSPACE, aSetOfHyperParameterSPACE, aDataSET }-state-space impact the scope of the processing required ( some having O( N ), some other O( N.logN ) complexity ), where almost immediately, where well engineered-in more than just one CPU_core soon gets harnessed even on a single "job"-being run.
An indeed nasty smell starts, once a naive ( read resources-usage un-coordinated ) CPU-load mixtures get down the road and when mixes of task-related CPU-loads start to get mixed with naive ( read resources-usage un-coordinated ) O/S-scheduler processes happen to fight for common ( resorted to just a naive shared-use policy ) resources - i.e. MEM ( introducing SWAPs as HELL ), CPU ( introducing cache-misses and MEM re-fetches ( yes, with SWAPs penalties added ), not speaking about paying any kind of more than ~ 15+ [ms] latency-fees, if one forgets and lets a process to touch a fileIO-( 5 (!)-orders-of-magnitude slower + shared + being a pure-[SERIAL], by nature )-device. No prayers help here ( SSD included, just a few orders of magnitude less, but still a hell to share & running a device incredibly fast into its wear + tear grave ).
What happens, if all the spawned processes do not fit into the physical RAM?
Virtual memory paging and swaps start to literally deteriorate the rest of the so far somehow "just"-by-coincidence-( read: weakly-co-ordinated )-[CONCURRENTLY]-scheduled processing ( read: further-decreased individual PROCESS-under-TEST performance ).
Things may so soon go wreck havoc, if not under due control & supervision.
Again - fact matters: a light-weight resources-monitor class may help:
aResRECORDER.show_usage_since0() method returns:
ResCONSUMED[T0+ 166036.311 ( 0.000000)]
user= 2475.15
nice= 0.36
iowait= 0.29
irq= 0.00
softirq= 8.32
stolen_from_VM= 26.95
guest_VM_served= 0.00
Similarly a bit richer constructed resources-monitor may report a wider O/S context, to see where additional resource stealing / contention / race-conditions deteriorate the actually achieved process-flow:
>>> psutil.Process( os.getpid()
).memory_full_info()
( rss = 9428992,
vms = 158584832,
shared = 3297280,
text = 2322432,
lib = 0,
data = 5877760,
dirty = 0
)
.virtual_memory()
( total = 25111490560,
available = 24661327872,
percent = 1.8,
used = 1569603584,
free = 23541886976,
active = 579739648,
inactive = 588615680,
buffers = 0,
cached = 1119440896
)
.swap_memory()
( total = 8455712768,
used = 967577600,
free = 7488135168,
percent = 11.4,
sin = 500625227776,
sout = 370585448448
)
Wed Oct 19 03:26:06 2017
166.445 ___VMS______________Virtual Memory Size MB
10.406 ___RES____Resident Set Size non-swapped MB
2.215 ___TRS________Code in Text Resident Set MB
14.738 ___DRS________________Data Resident Set MB
3.305 ___SHR_______________Potentially Shared MB
0.000 ___LIB_______________Shared Memory Size MB
__________________Number of dirty pages 0x
Last but not least, why one can easily pay more than earn in return?
Besides the gradually built records of evidence, how the real-world system-deployment add-on overheads accumulate the costs, the recently re-formulated Amdahl's Law, extended so as to cover both the add-on overhead-costs plus the "process-atomicity" of the further indivisible parts' sizing, defines a maximum add-on costs threshold, that might be reasonable paid, if some distributed processing is to provide any above >= 1.00 computing process speedup.
Dis-obeying the explicit logic of the re-formulated Amdahl's Law causes a process to proceed worse than if having been processed in a pure-[SERIAL] process-scheduling ( and sometimes the results of poor design and/or operations practices may look as if it were a case, when a joblib.Parallel()( joblib.delayed(...) ) method "blocks the process" ).

Python Multiprocessing Slower than Sequential Programming

I looked up for a lot of questions concerning slowness in python multiprocessing, but none of them were able to solve my problem.
Inside my algorithm, I have a for instance from 0 to 2, that runs the most important function of the algorithm (and the most time-consumption one). The 3 iterations of the for instance are independent from each other. So, to take advantage of this feature, I was trying to run my algorithm using parallel processing.
The thing is that when I run my algorithm will parallel processing, the simulation time is higher than the sequential programming. Depending on the input data, my original sequential algorithm can take from ~30ms to ~1500ms to run. Even when I run the ~1500ms cases, the multiprocessing is slower. Is it because the multiprocessing have to deal with really computionally expensive problems to make it worth, or is there something I can do to make it work better for me?
For now I won't post my algorithm because it's really long, but just as an example, what I'm doing is this:
from multiprocessing import Pool
def FUNCTION(A,B,C,f):
R1 = A * B * C * f
R2 = A * B / C * f # The function has several operations, i'm just doing an example here.
return R, S
if __name__ == '__main__':
pool = Pool()
while CP[0] or CP[1] or CP[2] or CPVT[0] or CPVT[1] or CPVT[2]:
f=0
result1 = pool.apply_async(FUNCTION, [A0,B0,C0])
f=1
result2 = pool.apply_async(FUNCTION, [A1,B1,C1])
f=2
result3 = pool.apply_async(FUNCTION, [A2,B2,C2])
[R0,S0] = result1.get(timeout=1)
[R1,S1] = result2.get(timeout=1)
[R2,S2] = result3.get(timeout=1)
Any ideas why is it taking longer then the sequential way to do it, or any solutions to that issue?
Thanks! :)

Improve the speed of the script with threads

I am trying this code, and it works well, however is really slow, because the number of iterations is high.
I am thinking about threads, that should increase the performance of this script, right? Well, the question is how can I change this code to works with synchronized threads.
def get_duplicated(self):
db_pais_origuem = self.country_assoc(int(self.Pais_origem))
db_pais_destino = self.country_assoc(int(self.Pais_destino))
condicao = self.condition_assoc(int(self.Condicoes))
origem = db_pais_origuem.query("xxx")
destino = db_pais_destino.query("xxx")
origem_result = origem.getresult()
destino_result = destino.getresult()
for i in origem_result:
for a in destino_result:
text1 = i[2]
text2 = a[2]
vector1 = self.text_to_vector(text1)
vector2 = self.text_to_vector(text2)
cosine = self.get_cosine(vector1, vector2)
origem_result and destino_result structure:
[(382360, 'name abcd', 'some data'), (361052, 'name abcd', 'some data'), (361088, 'name abcd', 'some data')]

From what I can see you are computing a distance function between pairs of vectors. Given a list of vectors, v1, ..., vn, and a second list w1,...wn you want the distance/similarity between all pairs from v and w. This is usually highly amenable to parallel computations, and is sometimes referred to as an embarassingly parallel computation. IPython works very well for this.
If your distance function distance(a,b) is independent and does not depend on results from other distance function values (this is usually the case that I have seen), then you can easily use ipython parallel computing toolbox. I would recommend it over threads, queues, etc... for a wide variety of tasks, especially exploratory. However, the same principles could be extended to threads or queue module in python.
I recommend following along with http://ipython.org/ipython-doc/stable/parallel/parallel_intro.html#parallel-overview and http://ipython.org/ipython-doc/stable/parallel/parallel_task.html#quick-and-easy-parallelism It provides a very easy, gentle introduction to parallelization.
In the simple case, you simply will use the threads on your computer (or network if you want a bigger speed up), and let each thread compute as many of the distance(a,b) as it can.
Assuming a command prompt that can see the ipcluster executable command type
ipcluster start -n 3
This starts the cluster. You will want to adjust the number of cores/threads depending on your specific circumstances. Consider using n-1 cores, to allow one core to handle the scheduling.
The hello world examples goes as follows:
serial_result = map(lambda z:z**10, range(32))
from IPython.parallel import Client
rc = Client()
rc
rc.ids
dview = rc[:] # use all engines
parallel_result = dview.map_sync(lambda z: z**10, range(32))
#a couple of caveats, are this template will not work directly
#for our use case of computing distance between a matrix (observations x variables)
#because the allV data matrix and the distance function are not visible to the nodes
serial_result == parallel_result
For the sake of simplicity I will show how to compute the distance between all pairs of vectors specified in allV. Assume that each row represents a data point (observation) that has three dimensions.
Also I am not going to present this the "pedagoically corret" way, but the way that I stumbled through it wrestling with the visiblity of my functions and data on the remote nodes. I found that to be the biggest hurdle to entry
dataPoints = 10
allV = numpy.random.rand(dataPoints,3)
mesh = list(itertools.product(arange(dataPoints),arange(dataPoints)))
#given the following distance function we can evaluate locally
def DisALocal(a,b):
return numpy.linalg.norm(a-b)
serial_result = map(lambda z: DisALocal(allV[z[0]],allV[z[1]]),mesh)
parallel_result = dview.map_sync(lambda z: DisALocal(allV[z[0]],allV[z[1]]),mesh)
#will not work as DisALocal is not visible to the nodes
#also will not work as allV is not visible to the nodes
There are a few ways to define remote functions.
Depending on whether we want to send our data matrix to the nodes or not.
There are tradeoffs as to how big the matrix is, whether you want to
send lots of vectors individually to the nodes or send the entire matrix
upfront...
#in first case we send the function def to the nodes via autopx magic
%autopx
def DisARemote(a,b):
import numpy
return numpy.linalg.norm(a-b)
%autopx
#It requires us to push allV. Also note the import numpy in the function
dview.push(dict(allV=allV))
parallel_result = dview.map_sync(lambda z: DisARemote(allV[z[0]],allV[z[1]]),mesh)
serial_result == parallel_result
#here we will generate the vectors to compute differences between
#and pass the vectors only, so we do not need to load allV across the
#nodes. We must pre compute the vectors, but this could, perhaps, be
#done more cleverly
z1,z2 = zip(*mesh)
z1 = array(z1)
z2 = array(z2)
allVectorsA = allV[z1]
allVectorsB = allV[z2]
#dview.parallel(block=True)
def DisB(a,b):
return numpy.linalg.norm(a-b)
parallel_result = DisB.map(allVectorsA,allVectorsB)
serial_result == parallel_result
In the final case we will do the following
#this relies on the allV data matrix being pre loaded on the nodes.
#note with DisC we do not import numpy in the function, but
#import it via sync_imports command
with dview.sync_imports():
import numpy
#dview.parallel(block=True)
def DisC(a):
return numpy.linalg.norm(allV[a[0]]-allV[a[1]])
#the data structure must be passed to all threads
dview.push(dict(allV=allV))
parallel_result = DisC.map(mesh)
serial_result == parallel_result
All the above can be easily extended to work in a load balanced fashion
Of course, the easiest speedup (assuming if distance(a,b) = distance(b,a)) would be the following. It will only cut run time in half, but can be used with the above parallelization ideas to compute only the upper triangle of the distance matrix.
for vIndex,currentV in enumerate(v):
for wIndex,currentW in enumerate(w):
if vIndex > wIndex:
continue#we can skip the other half of the computations
distance[vIndex,wIndex] = get_cosine(currentV, currentW)
#if distance(a,b) = distance(b,a) then use this trick
distance[wIndex,vIndex] = distance[vIndex,wIndex]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.