Training sklearn models in parallel with joblib blocks the process - python

As suggested in this answer, I tried to use joblib to train multiple scikit-learn models in parallel.
import joblib
import numpy
from sklearn import tree, linear_model
classifierParams = {
"Decision Tree": (tree.DecisionTreeClassifier, {}),''
"Logistic Regression" : (linear_model.LogisticRegression, {})
}
XTrain = numpy.array([[1,2,3],[4,5,6]])
yTrain = numpy.array([0, 1])
def trainModel(name, clazz, params, XTrain, yTrain):
print("training ", name)
model = clazz(**params)
model.fit(XTrain, yTrain)
return model
joblib.Parallel(n_jobs=4)(joblib.delayed(trainModel)(name, clazz, params, XTrain, yTrain) for (name, (clazz, params)) in classifierParams.items())
However, the call to the last line takes ages without utilizing the CPU, in fact it just seems to block and never return anything. What is my mistake?
A test with a very small amount of data in XTrain suggests that copying of the numpy array across multiple processes is not the reason for the delay.

Production-grade Machine Learning pipelines have CPU utilisations more like this, almost 24 / 7 / 365:
Check both the CPU% and also other resources' state figures across this node.
What is my mistake?
Having read your profile was a stunning moment, Sir:
I am a computer scientist specializing on algorithms and data analysis by training, and a generalist by nature. My skill set combines a strong scientific background with experience in software architecture and development, especially on solutions for the analysis of big data. I offer consulting and development services and I am looking for challenging projects in the area of data science.
The problem IS deeply determined by a respect to elementary Computer Science + algorithm rules.
The problem IS NOT demanding a strong scientific background, but a common sense.
The problem IS NOT any especially Big Data but requires to smell how the things actually work.
FactsorEmotions? ... that's The Question! ( The tragedy of Hamlet, Prince of Denmark )
May I be honest? Let's prefer FACTS, always:
Step #1:
Never hire or fire straight each and every Consultant, who does not respect facts ( the answer referred above did not suggest anything, the less granted any promises ). Ignoring facts might be a "successful sin" in PR / MARCOM / Advertisement / media businesses ( in case The Customer tolerates such dishonesty and/or manipulative habit ) , but not in a scientifically fair quantitative domains. This is unforgivable.
Step #2:
Never hire or fire straight each and every Consultant, who claimed having experience in software architecture, especially on solutions for ... big data but pays zero attention to the accumulated lumpsum of all the add-on overhead costs that are going to be introduced by each of the respective elements of the system architecture, once the processing started to go distributed across some pool of hardware and software resources. This is unforgivable.
Step #3:
Never hire or fire straight each and every Consultant, who turns passive aggressive once facts do not fit her/his wishes and starts to accuse other knowledgeable person who have already delivered a helping hand to rather "improve ( their ) communication skills" instead of learning from mistake(s). Sure, skill may help to express the obvious mistakes in some other way, yet, the gigantic mistakes will remain gigantic mistakes and each and every scientist, being fair to her/his scientific title, should NEVER resort to attack on a helping colleague, but rather start searching for the root cause of the mistakes, one after the other. This ---
#sascha ... May I suggest you take little a break from stackoverflow to cool off, work a little on your interpersonal communication skills
--- was nothing but a straight and intellectually unacceptable nasty foul to #sascha.
Next, the toysThe architecture, Resources and Process-scheduling facts that matter:
The imperative form of a syntax-constructor ignites an immense amount of activities to start:
joblib.Parallel( n_jobs = <N> )( joblib.delayed( <aFunction> )
( <anOrderedSetOfFunParameters> )
for ( <anOrderedSetOfIteratorParams> )
in <anIterator>
)
To at least guess what happens, a scientifically fair approach would be to test several representative cases, benchmarking their actual execution, collect quantitatively supported facts and draw a hypothesis on a model of behaviour and its principal dependencies on CPU_core-count, on RAM-size, on <aFunction>-complexity and resources-allocation envelopes etc.
Test case A:
def a_NOP_FUN( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is indeed to do nothing at all,
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
pass
##############################################################
### A NAIVE TEST BENCH
##############################################################
from zmq import Stopwatch; aClk = Stopwatch()
JOBS_TO_SPAWN = 4 # TUNE: 1, 2, 4, 5, 10, ..
RUNS_TO_RUN = 10 # TUNE: 10, 20, 50, 100, 200, 500, 1000, ..
try:
aClk.start()
joblib.Parallel( n_jobs = JOBS_TO_SPAWN
)( joblib.delayed( a_NOP_FUN )
( aSoFunPAR )
for ( aSoFunPAR )
in range( RUNS_TO_RUN )
)
except:
pass
finally:
try:
_ = aClk.stop()
except:
_ = -1
pass
print( "CLK:: {0:_>24d} [us] #{1: >3d} run{2: >5d} RUNS".format( _,
JOBS_TO_SPAWN,
RUNS_TO_RUN
)
)
Having collected representatively enough data on this NOP-case over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to generate at least some first-hand experience of the actual system costs of launching an actually intrinsically empty-processes' overhead workloads, related to the imperatively instructed joblib.Parallel(...)( joblib.delayed(...) )-syntax constructor, spawning into the system-scheduler just a few joblib-managed a_NOP_FUN() instances.
Let's also agree that all the real-world problems, Machine Learning models included, are way more complex tools, that the just tested a_NOP_FUN(), while in both cases, you have to pay the already benchmarked overhead costs ( even if it was paid for getting literally zero product ).
Thus a scientifically fair, rigorous work will follow from this simplest ever case, already showing the benchmarked costs of all the associated setup-overheads a smallest ever joblib.Parallel() penalty sine-qua-non forwards into a direction, where real world algorithms live - best with next adding some larger and larger "payload"-sizes into the testing loop:
Test-case B:
def a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
a MEM-allocation
so as to be able to benchmark
all the process-instantiation
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.zeros( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
return aMemALLOC[2:3,3,4,5]
Again,
collect a representatively enough quantitative data about the costs of actual remote-process MEM-allocations, by running a a_NOP_FUN_WITH_JUST_A_MEM_ALLOCATOR() over some reasonable wide landscape of SIZE1D-scaling,
again
over a reasonably scaled 2D-landscape of [ RUNS_TO_RUN, JOBS_TO_SPAWN]-cartesian-space DataPoints, so as to touch a new dimension in the performance scaling, under an extended black-box PROCESS_under_TEST experimentation inside the joblib.Parallel() tool, leaving its magics yet left un-opened.
Test-case C:
def a_NOP_FUN_WITH_SOME_MEM_DATAFLOW( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
a MEM-allocation plus some Data MOVs
so as to be able to benchmark
all the process-instantiation + MEM OPs
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.ones( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
aMemALLOC[:,:,:,:]*= 0.1234567
aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]
return aMemALLOC[2:3,3,4,5]
Bang, The Architecture related issues start to slowly show up:
One may soon notice, that not only the static-sizing matters, but also the MEM-transport BANDWIDTH ( hardware-hardwired ) will start cause problems, as moving data from/to CPU into/from MEM costs well ~ 100 .. 300 [ns], a way more, than any smart-shuffling of the few bytes "inside" the CPU_core, { CPU_core_private | CPU_core_shared | CPU_die_shared }-cache hierarchy-architecture alone ( and any non-local NUMA-transfer exhibits the same order of magnitude add-on pain ).
All the above Test-Cases have not asked much efforts from CPU yet
So let's start to burn the oil!
If all above was fine for starting to smell how the things under the hood actually work, this will grow to become ugly and dirty.
Test-case D:
def a_CPU_1_CORE_BURNER_FUN( aNeverConsumedPAR ):
""" __doc__
The intent of this FUN() is to do nothing but
add some CPU-load
to a MEM-allocation plus some Data MOVs
so as to be able to benchmark
all the process-instantiation + MEM OPs
add-on overhead costs.
"""
import numpy as np # yes, deferred import, libs do defer imports
SIZE1D = 1000 # here, feel free to be as keen as needed
aMemALLOC = np.ones( ( SIZE1D, # so as to set
SIZE1D, # realistic ceilings
SIZE1D, # as how big the "Big Data"
SIZE1D # may indeed grow into
),
dtype = np.float64,
order = 'F'
) # .ALLOC + .SET
aMemALLOC[2,3,4,5] = 8.7654321 # .SET
aMemALLOC[3,3,4,5] = 1.2345678 # .SET
aMemALLOC[:,:,:,:]*= 0.1234567
aMemALLOC[:,3,4,:]+= aMemALLOC[4,5,:,:]
aMemALLOC[2,:,4,:]+= aMemALLOC[:,5,6,:]
aMemALLOC[3,3,:,:]+= aMemALLOC[:,:,6,7]
aMemALLOC[:,3,:,5]+= aMemALLOC[4,:,:,7]
aMemALLOC[:,:,:,:]+= int( [ np.math.factorial( x + aMemALLOC[-1,-1,-1] )
for x in range( 1005 )
][-1]
/ [ np.math.factorial( y + aMemALLOC[ 1, 1, 1] )
for y in range( 1000 )
][-1]
)
return aMemALLOC[2:3,3,4,5]
Still nothing extraordinary, compared to the common grade of payloads in the domain of a Machine Learning many-D-space, where all dimensions of the { aMlModelSPACE, aSetOfHyperParameterSPACE, aDataSET }-state-space impact the scope of the processing required ( some having O( N ), some other O( N.logN ) complexity ), where almost immediately, where well engineered-in more than just one CPU_core soon gets harnessed even on a single "job"-being run.
An indeed nasty smell starts, once a naive ( read resources-usage un-coordinated ) CPU-load mixtures get down the road and when mixes of task-related CPU-loads start to get mixed with naive ( read resources-usage un-coordinated ) O/S-scheduler processes happen to fight for common ( resorted to just a naive shared-use policy ) resources - i.e. MEM ( introducing SWAPs as HELL ), CPU ( introducing cache-misses and MEM re-fetches ( yes, with SWAPs penalties added ), not speaking about paying any kind of more than ~ 15+ [ms] latency-fees, if one forgets and lets a process to touch a fileIO-( 5 (!)-orders-of-magnitude slower + shared + being a pure-[SERIAL], by nature )-device. No prayers help here ( SSD included, just a few orders of magnitude less, but still a hell to share & running a device incredibly fast into its wear + tear grave ).
What happens, if all the spawned processes do not fit into the physical RAM?
Virtual memory paging and swaps start to literally deteriorate the rest of the so far somehow "just"-by-coincidence-( read: weakly-co-ordinated )-[CONCURRENTLY]-scheduled processing ( read: further-decreased individual PROCESS-under-TEST performance ).
Things may so soon go wreck havoc, if not under due control & supervision.
Again - fact matters: a light-weight resources-monitor class may help:
aResRECORDER.show_usage_since0() method returns:
ResCONSUMED[T0+ 166036.311 ( 0.000000)]
user= 2475.15
nice= 0.36
iowait= 0.29
irq= 0.00
softirq= 8.32
stolen_from_VM= 26.95
guest_VM_served= 0.00
Similarly a bit richer constructed resources-monitor may report a wider O/S context, to see where additional resource stealing / contention / race-conditions deteriorate the actually achieved process-flow:
>>> psutil.Process( os.getpid()
).memory_full_info()
( rss = 9428992,
vms = 158584832,
shared = 3297280,
text = 2322432,
lib = 0,
data = 5877760,
dirty = 0
)
.virtual_memory()
( total = 25111490560,
available = 24661327872,
percent = 1.8,
used = 1569603584,
free = 23541886976,
active = 579739648,
inactive = 588615680,
buffers = 0,
cached = 1119440896
)
.swap_memory()
( total = 8455712768,
used = 967577600,
free = 7488135168,
percent = 11.4,
sin = 500625227776,
sout = 370585448448
)
Wed Oct 19 03:26:06 2017
166.445 ___VMS______________Virtual Memory Size MB
10.406 ___RES____Resident Set Size non-swapped MB
2.215 ___TRS________Code in Text Resident Set MB
14.738 ___DRS________________Data Resident Set MB
3.305 ___SHR_______________Potentially Shared MB
0.000 ___LIB_______________Shared Memory Size MB
__________________Number of dirty pages 0x
Last but not least, why one can easily pay more than earn in return?
Besides the gradually built records of evidence, how the real-world system-deployment add-on overheads accumulate the costs, the recently re-formulated Amdahl's Law, extended so as to cover both the add-on overhead-costs plus the "process-atomicity" of the further indivisible parts' sizing, defines a maximum add-on costs threshold, that might be reasonable paid, if some distributed processing is to provide any above >= 1.00 computing process speedup.
Dis-obeying the explicit logic of the re-formulated Amdahl's Law causes a process to proceed worse than if having been processed in a pure-[SERIAL] process-scheduling ( and sometimes the results of poor design and/or operations practices may look as if it were a case, when a joblib.Parallel()( joblib.delayed(...) ) method "blocks the process" ).

Related

How to improve code performance ( using Google Translate API )

import time
start = time.time()
import pandas as pd
from deep_translator import GoogleTranslator
data = pd.read_excel(r"latestdata.xlsx")
translatedata = data['column']. fillna('novalue')
list = []
for i in translatedata:
finaldata = GoogleTranslator(source='auto', target='english').translate(i)
print(finaldata)
list.append(finaldata)
df = pd.DataFrame(list, columns=['Translated_values'])
df.to_csv(r"jobdone.csv", sep= ';')
end = time.time()
print(f"Runtime of the program is {end - start}")
I have data of 220k points and trying to translate a column data At first I tried to use pool method parallel program but got an error that I can not access API several time at once. My question is if there is other way to improve performance of code that I have right now.
# 4066.826668739319 with just 10000 data all together.
# 3809.4675991535187 computation time when I run in 2 batch's of 5000
Q :" ... is ( there ) other way to improve performance of code ...? "
A :Yes, there are a few ways,yet do not expect anything magical, as you have already reported the API-provider's throttling/blocking somewhat higher levels of concurrent API-call from being served
There still might be some positive effects from latency-masking tricks from a just-[CONCURRENT] orchestration of several API-calls, as the End-to-End latencies are principally "long" as going many-times across the over-the-"network"-horizons and having also some remarkable server-side TAT-latency on translation-matching engines.
Details matter, a lot...
A performance boosting code-template to start with( avoiding 220k+ repeated local-side overheads' add-on costs ) :
import time
import pandas as pd
from deep_translator import GoogleTranslator as gXLTe
xltDF = pd.read_excel( r"latestdata.xlsx" )['column'].fillna( 'novalue' )
resDF = xltDF.copy( deep = True )
PROC_ns_START = time.perf_counter_ns()
#________________________________________________________ CRITICAL SECTION: start
for i in range( len( xltDF ) ):
resDF.iloc( i ) = gXLTe( source = 'auto',
target = 'english'
).translate( xltDF.iloc( i ) )
#________________________________________________________ CRITICAL SECTION: end
PROC_ns_END = time.perf_counter_ns()
resDF.to_csv( r"jobdone.csv",
sep = ';'
)
print( f"Runtime was {0:} [ns]".format( PROC_ns_END - PROC_ns_START ) )
Tips for performance boosting :
if Google API-policy permits, we may increase thread-count, that participate on CRITICAL SECTION,
as the Python-interpreter threads are "inside" the same address-space and still are GIL-lock MUTEX-blocked, we may operate all just-[CONCURRENT] accesses to the same DataFrame-objects, best using non-overlapping, separate (thread-private) block-iterators over disjunct halves ( for a pair of threads ) over disjunct thirds ( for 3 threads ) etc...
as the Google API-policy is limiting attempts to overly concurrent access to the API-service, you shall build-in some, even naive-robustness
def thread_hosted_blockCRAWLer( i_start, i_end ):
for i in range( i_start, i_end ):
while True:
try:
resDF.iloc( i ) = gXLTe( source = 'auto',
target = 'english'
).translate( xltDF.iloc( i ) )
# SUCCEDED
break
except:
# FAILED
print( "EXC: _blockCRAWLer() on index ", i )
time.sleep( ... )
# be careful here, not to get on API-provider's BLACK-LIST
continue
if more time-related details per thread, may reuse this
Do not hesitate to go tuning & tweaking - and anyway, keep us posted how fast you managed to get, that's fair, isn't it?

Parallel downloading with urlretrieve

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:
import shutil
import os
import sys
import socket
socket.setdefaulttimeout(5)
file_read = open(my_file, "r")
lines = file_read.readlines()
for line in lines:
try:
import urllib.request
sl = line.strip().split(";")
url = sl[0]
newname = str(sl[1])+".html"
urllib.request.urlretrieve(url, newname)
except:
pass
file_read.close()
This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?
Q :" I regularly have to ...What would be the simplest and best way to speed it up ? "
A :The SIMPLEST ( what the commented approach is not ) &the BEST wayis to at least : (a) minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),(b) harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow(c) go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow
Givenboth the simplicity & performance seem to be the measure of "best"-ness:
Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.
ThereforeI could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),sorather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).
Sketched process-flow framework :
from joblib import Parallel, delayed
MAX_WORKERs = ( n_CPU_cores - 1 )
def main( files_in ):
""" __doc__
.INIT worker-processes, each with a split-scope of tasks
"""
IDs = range( max( 1, MAX_WORKERs ) )
RES_if_need = Parallel( n_jobs = MAX_WORKERs
)( delayed( block_processor_FUN #-- fun CALLABLE
)( my_file, #---------- fun PAR1
wPROC #---------- fun PAR2
)
for wPROC in IDs
)
def block_processor_FUN( file_with_URLs = None,
file_from_PART = 0
):
""" __doc__
.OPEN file_with_URLs
.READ file_from_PART, row-wise - till next part starts
- ref. global MAX_WORKERs
"""
...
This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.
The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.
That simple.
If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.
More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.
The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.
For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

Multicore programming

At the company where I am interning, I was told about the use of multi-core programming and, in view of a project I am developing for my thesis (I'm not from the area but I'm working on something that involves coding).
I want to know if this is possible:I have a defined function that will be repeated 3x for 3 different variables. Is it possible to put the 3 running at the same time in different core (because they don't need each other information)? Because the calculation process is the same for all of them and instead of running 1 variable at a time, I would like to run all 3 at once (performing all the calculations at the same time) and in the end returning the results.
Some part of what I would like to optimize:
for v in [obj2_v1, obj2_v2, obj2_v3]:
distancia_final_v, \
pontos_intersecao_final_v = calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
v,
criterio
)
def calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
obj2_v,
criterio
):
i = 0
distancia_final_v = []
pontos_intersecao_final_v = []
while i < len(obj2_v):
distancia_relevante_v = []
pontos_intersecao_v = []
distancia_inicial = 1000
for x in range(len(obj1_v1)):
planeNormal = np.array( [obj1_normal[x][0],
obj1_normal[x][1],
obj1_normal[x][2]
] )
planePoint = np.array( [ obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2]
] ) # Any point on the plane
rayDirection = np.array([obj2_normal[i][0],
obj2_normal[i][1],
obj2_normal[i][2]
] ) # Define a ray
rayPoint = np.array([ obj2_v[i][0],
obj2_v[i][1],
obj2_v[i][2]
] ) # Any point along the ray
Psi = Calculos.line_plane_collision( planeNormal,
planePoint,
rayDirection,
rayPoint
)
a = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2]
)
b = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
c = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
d = Calculos.area_trianglo_3d( obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
if float("{:.5f}".format(a)) == float("{:.5f}".format(b + c + d)):
P1 = Ponto( Psi[0][0], Psi[0][1], Psi[0][2] )
P2 = Ponto( obj2_v[i][0], obj2_v[i][1], obj2_v[i][2] )
distancia = Calculos.distancia_pontos( P1, P2 ) * 10
if distancia < distancia_inicial and distancia < criterio:
distancia_inicial = distancia
distancia_relevante_v = []
distancia_relevante_v.append( distancia_inicial )
pontos_intersecao_v = []
pontos_intersecao_v.append( Psi )
x += 1
distancia_final_v.append( distancia_relevante_v )
pontos_intersecao_final_v.append( pontos_intersecao_v )
i += 1
return distancia_final_v, pontos_intersecao_final_v
In this example of my code, I want to make the same process happen for obj2_v1, obj2_v2, obj2_v3.
Is it possible to make them happen at the same time?
Because I will be using a considerable amount of data and it would probably save me some time of processing.
multiprocessing (using processes to avoid the GIL) is the easiest but you're limited to relatively small performance improvements, number of cores speedup is the limit, see Amdahl's law. there's also a bit of latency involved in starting / stopping work which means it's much better for things that take >10ms
in numeric heavy code (like this seems to be) you really want to be moving as much of the it "inside numpy", look at vectorisation and broadcasting. this can give speedups of >50x (just on a single core) while staying easier to understand and reason about
if your algorithm is difficult to express using numpy intrinsics then you could also look at using Cython. this allows you to write Python code that automatically gets compiled down to C, and hence a lot faster. 50x faster is probably also a reasonable speedup, and this is still running on a single core
the numpy and Cython techniques can be combined with multiprocessing (i.e. using multiple cores) to give code that runs hundreds of times faster than naive implementations
Jupyter notebooks have friendly extensions (known affectionately as "magic") that make it easier to get started with this sort of performance work. the %timeit special allows you to easily time parts of the code, and the Cython extension means you can put everything into the same file
It's possible, but use python multiprocessing lib, because the threading lib doesn't delivery parallel execution.
UPDATE
DON'T do something like that (thanks for #user3666197 for pointing the error) :
from multiprocessing.pool import ThreadPool
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio):
#your code
return distancia_final_v,pontos_intersecao_final_v
pool = ThreadPool(processes=3)
async_result1 = pool.apply_async(calculo_vertice, (#your args here))
async_result2 = pool.apply_async(calculo_vertice, (#your args here))
async_result3 = pool.apply_async(calculo_vertice, (#your args here))
result1 = async_result1.get() # result1
result2 = async_result2.get() # result2
result3 = async_result3.get() # result3
Instead, something like this should do the job:
from multiprocessing import Process, Pipe
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio, send_end):
#your code
send_end.send((distancia_final_v,pontos_intersecao_final_v))
numberOfWorkers = 3
jobs = []
pipeList = []
#Start process and build job list
for i in range(numberOfWorkers):
recv_end, send_end = Pipe(False)
process = Process(target=calculo_vertice, args=(#<... your args...>#, send_end))
jobs.append(process)
pipeList.append(recv_end)
process.start()
#Show the results
for job in jobs: job.join()
resultList = [x.recv() for x in pipeList]
print (resultList)
REF.
https://docs.python.org/3/library/multiprocessing.html
https://stackoverflow.com/a/37737985/8738174
This code will create a pool of 3 working process and each of it will async receive the function. It's important to point that in this case you should have 3+ CPU cores, otherwise, your system kernel will just switch between process and things won't real run in parallel.
Q : " Is it possible to make them happen at the same time? "
Yes.
The best results ever will be get if not adding any python ( the multiprocessing module is not necessary at all for doing 3 full-copies ( yes, top-down fully replicated copies ) of the __main__ python process for this so embarrasingly independent processing ).
The reasons for this are explained in detail here and here.
A just-enough tool is GNU's :$ parallel --jobs 3 python job-script.py {} ::: "v1" "v2" "v3"
For all performance-tweaking, read about more configuration details in man parallel.
"Because I will be using a considerable amount of data..."
The Devil is hidden in the details :
The O/P code may be syntactically driving the python interpreter to results, precise (approximate) within some 5 decimal places, yet the core sin is it's ultimately bad chances to demonstrate any reasonably achievable performance in doing that, the worse on "considerable amount of data".
If they, "at the company", expect some "considerable amount of data", you should do at least some elementary research on what is the processing aimed at.
The worst part ( not mentioning the decomposition of once vectorised-ready numpy-arrays back into atomic "float" coordinate values ) is the point-inside-triangle test.
For a brief analysis on how to speed-up this part ( the more if going to pour "considerable amount of data" on doing this ), get inspired from this post and get the job done in fraction of the time it was drafted in the O/P above.
Indirect testing of a point-inside-triangle by comparing an in-equality of a pair of re-float()-ed-strings, received from sums of triangle-areas ( b + c + d ) is just one of the performance blockers, you will find to get removed.

Parallel programming on a nested for loop in Python using PyCuda (or else?)

Part of my python function looks like this:
for i in range(0, len(longitude_aq)):
center = Coordinates(latitude_aq[i], longitude_aq[i])
currentAq = aq[i, :]
for j in range(0, len(longitude_meo)):
currentMeo = meo[j, :]
grid_point = Coordinates(latitude_meo[j], longitude_meo[j])
if is_in_circle(center, RADIUS, grid_point):
if currentAq[TIME_AQ] == currentMeo[TIME_MEO]:
humidity += currentMeo[HUMIDITY_MEO]
pressure += currentMeo[PRESSURE_MEO]
temperature += currentMeo[TEMPERATURE_MEO]
wind_speed += currentMeo[WIND_SPEED_MEO]
wind_direction += currentMeo[WIND_DIRECTION_MEO]
count += 1.0
if count != 0.0:
final_tmp[i, HUMIDITY_FINAL] = humidity/count
final_tmp[i, PRESSURE_FINAL] = pressure/count
final_tmp[i, TEMPERATURE_FINAL] = temperature/count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed/count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction/count
humidity, pressure, temperature, wind_speed, wind_direction, count = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
final.loc[:, :] = final_tmp[:, :]
Problem: len(longitude_aq) is approx. 320k and len(longitude_meo) is 7millions. Which brings this code to somewhere close to 2100 billions iterations...
I have to iterate over one file (the longitude_aq one) and then compute some average value, iterating over the second file (the longitude_meo one) given some features extracted from the first file.
It doesn't seem like I can proceed any differently.
Possible solution: parallel programming. My university allows me access to their HPC. They have several nodes + several GPUs accessible (list here for the GPUs and here for the CPUs)
Target: Having no experience in CUDA programming with Python, I am wondering what would be the easiest way to transform my code into something runnable by the HPC, so that the computation time drops drastically.
Sorry, the read might get hard to read, but reality is cruel and many enthusiasts might easily spoil man*months of "coding" efforts into a principally a-priori lost war. Better carefully re-assess all the a-priori known CONS / PROS of any re-engineering plans, before going to blind spend a single man*day into a principally wrong direction.Most probably I would not post it here, if I were not exposed to a Project, where top-level academicians have spent dozens of man*years, yes, more than a year with a team of 12+, for "producing" a processing taking ~ 26 [hr], which was reproducible in less than ~ 15 [min] ( and way cheaper in HPC/GPU infrastructure costs ), if designed using the proper ( hardware-performance non-devastating ) design methods...
It doesn't seem like I can proceed any differently.
Well, actually pretty hard to tell, if not impossible :
Your post seems to assume a few cardinal things that may pretty avoid getting any real benefit from moving the above sketched idea onto an indeed professional HPC / GPU infastructure.
Possible solution: parallel programming
A way easier to say it / type it, than to actually do it.
A-wish-to-run-in-true-[PARALLEL] process scheduling remains just a wish and ( believe me, or Gene Amdahl, or other C/S veterans or not ) indeed hard process re-design is required, if your code ought get any remarkably better than in a pure-[SERIAL] code-execution flow ( as posted above )
1 ) a pure-[SERIAL] nature of the fileIO can( almost )kill the game :
the not posted part about pure-[SERIAL] file-accesses ( two files with data-points ) ... any fileIO is by nature the most expensive resource and except smart re-engineering a pure-[SERIAL] ( at best a one-stop-cost, but still ) (re-)reading in a sequential manner, so do not expect any Giant-Leap anywhere far from this in the re-engineered code. This will be always slowest and always expensive phase.
BONUS:While this may seem as the least sexy item in the inventory list of the parallel-computing, pycuda, distributed-computing, hpc, parallelism-amdahl or whatever the slang brings next, the rudimentary truth is that making an HPC-computing indeed fast and resources efficient, both the inputs ( yes, the static files ) and the computing strategy are typically optimised for stream-processing and to best also enjoy a non-broken ( collision avoided ) data-locality, if peak performance is to get achieved. Any inefficiency in these two domains can, not just add-on, but actually FACTOR the computing expenses ( so DIVIDE performance ) and differences may easily get into several orders of magnitude ( from [ns] -> [us] -> [ms] -> [s] -> [min] -> [hr] -> [day] -> [week], you name them all ... )
2 ) cost / benefits may get you PAY-WAY-MORE-THAN-YOU-GET
This part is indeed your worst enemy : if a lumpsum of your efforts is higher than a sum of net benefits, GPU will not add any added value at all, or not enough, so as to cover your add-on costs.
Why?
GPU engines are SIMD devices, that are great for using a latency-masking over a vast area of a repetitively the very same block of SMX-instructions, whih needs a certain "weight"-of-"nice"-mathematics to happen locally, if they are to show any processing speedup over other problem implementation strategies - GPU devices ( not the gamers' ones, but the HPC ones, which not all cards in the class are, are they? ) deliver best for indeed small areas of data-locality ( micro-kernel matrix operations, having a very dense, best very small SMX-local "RAM" footprint of such a dense kernel << ~ 100 [kB] as of 2018/Q2 ).
Your "computing"-part of the code has ZERO-RE-USE of any single data-element that was ( rather expensively ) fetched from an original static storage, so almost all the benefits, that the GPU / SMX / SIMD artillery has been invented for is NOT USED AT ALL and you receive a NEGATIVE net-benefit from trying to load that sort of code onto such a heterogeneous ( NUMA complicated ) distributed computing ( yes, each GPU-device is a rather "far", "expensive" ( unless your code will harness it's SMX-resources up until almost smoke comes out of the GPU-silicon ... ) and "remote" asynchronously operated distributed-computing node, inside your global computing strategy ) system.
Any first branching of the GPU code will be devastatingly expensive in SIMD-execution costs of doing so, thus your heavily if-ed code is syntactically fair, but performance-wise an almost killer of the game:
for i in range( 0,
len( longitude_aq )
): #______________________________________ITERATOR #1 ( SEQ-of-I-s )
currentAq = aq[i, :] # .SET
center = Coordinates( latitude_aq[i], # .SET
longitude_aq[i]
) # |
# +-------------> # EASY2VECTORISE in [i]
for j in range( 0,
len( longitude_meo )
): #- - - - - - - - - - - - - - - - - ITERATOR #2 ( SEQ-of-J-s )
currentMeo = meo[j, :] # .SET
grid_point = Coordinates( latitude_meo[j], # .SET
longitude_meo[j]
) # |
# +-------> # EASY2VECTORISE in [j]
if is_in_circle( center,
RADIUS,
grid_point
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/ IF-ed SIMD-KILLER #1
if ( currentAq[TIME_AQ]
== currentMeo[TIME_MEO]
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ IF-ed SIMD-KILLER #2
humidity += currentMeo[ HUMIDITY_MEO] # BEST PERF.
pressure += currentMeo[ PRESSURE_MEO] # IF SMART
temperature += currentMeo[ TEMPERATURE_MEO] # CURATED
wind_speed += currentMeo[ WIND_SPEED_MEO] # AS NON
wind_direction += currentMeo[WIND_DIRECTION_MEO] # ATOMICS
count += 1.0
if count != 0.0: # !!!!!!!!!!!!!!!!!! THIS NEVER HAPPENS
# EXCEPT WHEN ZERO DATA-POINTS WERE AVAILABLE FOR THE i-TH ZONE,
# FILE DID NOT CONTAIN ANY SUCH,
# WHICH IS FAIR,
# BUT SUCH A BLOCK OUGHT NEVER HAVE STARTED ANY COMPUTING AT ALL
# IF ASPIRING FOR INDEED BEING LOADED
# ONTO AN HPC-GRADE COMPUTING INFRASTRUCTURE ( SPONSORED OR NOT )
#
final_tmp[i, HUMIDITY_FINAL] = humidity / count
final_tmp[i, PRESSURE_FINAL] = pressure / count
final_tmp[i, TEMPERATURE_FINAL] = temperature / count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed / count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction / count
If we omit both the iterators over the domain of all-[i,j]-s and the if-ed crossroads, the actual "useful"-part of the computing does a very shallow amount of mathematics -- the job contains a few SLOC-s, where principally independent values are summed ( best having avoided any collision of adding operation, so could be very cheaply operated independently each to the other ( best with well ahead pre-fetched constants ) in less than a few [ns] YES, your computing payload does not require anything more than just a few units [ns] to execute.
The problem is in smart-engineering of the data-flow ( I like to call that a DATA-HYDRAULICS ( how to make a further incompressible flow of DATA into the { CPU | GPU | APU | *** }-processor registers, so as to get 'em processes ) )
All the rest is easy. A smart solution of the HPC-grade DATA-HYDRAULICS typically is not.
No language, no framework will help you in this automatically. Some can release some part of the solution-engineering from your "manual" efforts, some cannot, some can even spoil a possible computing performance, due to "cheap" shortcuts in their internal design decision and compromises made, that do not benefit the same target you have - The Performance.
The Best next step?
A ) Try to better understand the limits of computing infrastructures you expect to use for your extensive ( but not intensive ( yes, just a few SLOC's per [i,j] ), which HPC-supervisors do not like to see flowing onto their operated expensive HPC-resources ).
B ) If in troubles with time + headcount + financial resouces to re-engineer the top-down DATA-HYDRAULICS solution, best re-factor your code so as to get at least into the vectorised, numpy / numba ( not always will numba get remarkably farther than an already smart numpy-vectorised code, but a quantitative test will tell the facts per inciden, not in general )
C ) If your computing-problem is expected to get re-run more often, definitely assess a re-designed pipeline from the early pre-processing of the data-storage ( the slowest part of the processing ), where a stream-based pre-processing of principally static values is possible, which could impact the resulting DATA-HYDRAULICS flow ( performance ) with a pre-computed + smart-aligned values the most. The block of a few ADD-s down the lane will not get improved beyond a few [ns], as reported above, but the slow-flow can jump orders of magnitude faster, if re-arranged into a smarted flow, harnessing all available, yet "just"-[CONCURRENT]-ly operated resources ( any attempt to try to arrange a True-[PARALLEL] scheduling is a pure nonsense here, as the task is principally by no means a [PARALLEL] scheduling problem, but a stream of pure-[SERIAL] (re-)processing of data-points, where a smart, yet "just"-[CONCURRENT] processing re-arrangement may help scale-down the resulting duration of the process ).
BONUS:If interested in deeper reasoning about the achievable performance gains from going into N-CPU operated computing graphs, feel free to learn more about re-formulated Amdahl's Law and related issues, as posted in further details here.

Pygmo2: migration between islands in an archipelago during evolution

I'm trying to use the Python library Pygmo2 (https://esa.github.io/pagmo2/index.html) to parallelize an optimization problem.
To my understanding, parallelization can be achieved with an archipelago of islands (in this case, mp_island).
As a minimal working example, one of the tutorials from the official site can serve: https://esa.github.io/pagmo2/docs/python/tutorials/using_archipelago.html
I extracted the code:
class toy_problem:
def __init__(self, dim):
self.dim = dim
def fitness(self, x):
return [sum(x), 1 - sum(x*x), - sum(x)]
def gradient(self, x):
return pg.estimate_gradient(lambda x: self.fitness(x), x)
def get_nec(self):
return 1
def get_nic(self):
return 1
def get_bounds(self):
return ([-1] * self.dim, [1] * self.dim)
def get_name(self):
return "A toy problem"
def get_extra_info(self):
return "\tDimensions: " + str(self.dim)
import pygmo as pg
a_cstrs_sa = pg.algorithm(pg.cstrs_self_adaptive(iters=1000))
p_toy = pg.problem(toy_problem(50))
p_toy.c_tol = [1e-4, 1e-4]
archi = pg.archipelago(n=32,algo=a_cstrs_sa, prob=p_toy, pop_size=70)
print(archi)
archi.evolve()
print(archi)
Looking at the documentation of the old version of the library (http://esa.github.io/pygmo/documentation/migration.html), migration between islands seems to be an essential feature of the island parallelization model.
Also, to my understanding, optimization algorithms like evolutionary algorithms could not work without it.
However, in the documentation of Pygmo2, I can nowhere find how to perform migration.
Is it happening automatically in an archipelago?
Does it depend on the selected algorithm?
Is it not yet implemented in Pygmo2?
Is the documentation on this yet missing or did I just not find it?
Can somebody enlighten me?
pagmo2 is now implementing migration since v2.11, the PR has benn completed and merged into master. Almost all capabilities present in pagmo1.x are restored. We will still add more topologies in the future, but they can already be implemented manually. Refer to docs here: https://esa.github.io/pagmo2/docs/cpp/cpp_docs.html
Tutorial and example are missing and will be added in the near future (help is welcome)
The migration framework has not been fully ported from pagmo1 to pagmo2 yet. There is a long-standing PR open here:
https://github.com/esa/pagmo2/pull/102
We will complete the implementation of the migration framework in the next few months, hopefully by the beginning of the summer.
IMHO, the PyGMO2/pagmo documentation is confirming the migration feature to be present.
The archipelago class is the main parallelization engine of pygmo. It essentially is a container of island able to initiate evolution (optimization tasks) in each island asynchronously while keeping track of the results and of the information exchange (migration) between the tasks ...
With an exception of thread_island-s ( where some automated inference may take place and enforce 'em for thread-safe UDI-s ), all other island types - { mp_island | ipyparallel_island }-s do create a GIL-independent form of a parallelism, yet the computing is performed via an async-operated .evolve() method
In original PyGMO, the archipelago class was auto .__init__()-ed with attribute topology = unconnected(), unless specified explicitly, as documented in PyGMO, having a tuple of call-interfaces for archipelago.__init__() method ( showing just the matching one ):
__init__( <PyGMO.algorithm> algo,
<PyGMO.problem> prob,
<int> n_isl,
<int> n_ind [, topology = unconnected(),
distribution_type = point_to_point,
migration_direction = destination
]
)
But, adding that, one may redefine the default, so as to meet one's PyGMO evolutionary process preferences:
topo = topology.erdos_renyi( nodes = 100,
p = 0.03
) # Erdos-Renyi ( random ) topology
or
set a Clustered Barabási-Albert, with ageing vertices graph topology:
topo = topology.clustered_ba( m0 = 3,
m = 3,
p = 0.5,
a = 1000,
nodes = 0
) # clustered Barabasi-Albert,
# # with Ageing vertices topology
or:
topo = topology.watts_strogatz( nodes = 100,
p = 0.1
) # Watts-Strogatz ( circle
# + links ) topology
and finally, set it by assignment into the class-instance attribute:
archi = pg.archipelago( n = 32,
algo = a_cstrs_sa,
prob = p_toy,
pop_size = 70
) # constructs an archipelago
archi.topology = topo # sets the topology to the
# # above selected, pre-defined <topo>

Categories