Byzantine Consensus Randomized - Monte Carlo Implementation with matrix for value sending

Byzantine Consensus Randomized - Monte Carlo Implementation with matrix for value sending - python

In order to achieve the Bizantine Consensus, these 3 rules must apply: during every round each process must send a bit (with value either 0 or 1) to the others n-1 processes; before the start of a new round each process must have received a bit from every other process; in each round every process has sent his bit to all the other processes.
I need to implement the following Rabin-based, Monte Carlo Byzantine Consensus randomized protocol for 661 non-traitor processes out of 991 total processes:
Variables:
b(i) = value to send to all other processes, initially = input value
p = global random number which can be either 1 and 0, picked after every run
maj(i) = majority value
tally(i) = number of times majority appears in b(i)
Implementation:
loop = TRUE
while (loop)
1. send b(i) to the other n−1 processes
2. receive the values sent by the other n−1 processes
3. set maj(i) = majority value in b(i)
4. set tally(i) = number of times the majority value appears
5. if tally(i) ≥ 2t + 1
then b(i) = maj(i)
else if (p=1) then b(i) = 1
else b(i) ← 0
My question would be: how can I implement the data structure which stores for every process the bits they have sent and received, not to mention how to implement the sending mechanism itself? I was thinking about implementing an array A.i[j] for each process, where j ranges over all process ids. I have heard it is possible to make processes read the values of the other n-1 processes from such a table instead of implemeneting a sending mechanism; how could I implement this?

I solved this problem by creating a class Process with a Type and vote attribute and a getVote() method,
Type: reliable, unreliable
Vote: the vote (v0 at the beginning, it will converge to v following the protocol)
setVote(): sets the vote.
getVote(): if type is reliable then return the bit else return random number between 0 and 1.
To simulate a distributed algo, you can run the protocol in separate threads, each one containing a process class and communicate with a shared common array which you can update each round.
An example here
You can also simulate everything starting from an array of these objects, without the need of separate threads.
I believe that should be enough to handle the model.
something like this:
export class Process {
constructor(public pid: number, public vote: number, public type: ProcessType ) { }
set(bit: number) {
this.vote = this.send(bit);
}
send(b) {
if (this.type === ProcessType.unreliable) {
return Math.round(Math.random());
}
return b;
}
get() {
return { pid: this.pid, vote: this.vote, type: this.type};
}
getVote() {
return this.send(this.vote);
}
}
Good luck!

Related

Implementing multiple branching rules in the same branch and bound tree in PySCIPOpt

I would like to implement a custom branching rule initially (for a few nodes in the top of the tree), and then use Scip's implementation of vanilla full strong branching rule (or some other rule like pseudocost). Is this possible to do using/by extending
PySCIPOpt?
import pyscipopt as scip
import random
class oddevenbranch(scip.Branchrule):
def branchexeclp(self, allowaddcons):
'''
This rule uses the branching rule I have defined if the node number is odd,
and should use strong branching otherwise.
'''
node_ = self.model.getCurrentNode()
num = node_.getNumber()
if num % 2 == 1:
candidate_vars, *_ = self.model.getLPBranchCands()
branch_var_idx = random.randint(0,len(candidate_vars)-1)
branch_var = candidate_vars[branch_var_idx]
self.model.branchVar(branch_var)
result = scip.SCIP_RESULT.BRANCHED
return {"result": result}
else:
print(num, ': Did not branch')
result = scip.SCIP_RESULT.DIDNOTRUN
return {"result": result}
if __name__ == "__main__":
m1 = scip.Model()
m1.readProblem('xyz.mps') # Used to read the instance
m1.setIntParam('branching/fullstrong/priority', 11000)
branchrule = oddevenbranch()
m1.includeBranchrule(branchrule=branchrule,
name="CustomRand", # name of the branching rule
desc="", # description of the branching rule
priority=100000, # priority: set to this to make it default
maxdepth=-1, # maximum depth up to which it will be used, or -1 for no restriction
maxbounddist=1) # maximal relative distance from current node's dual bound to primal
m1.optimize()
I wonder what is causing this behavior. Does it need to call branching multiple times to perform strong branching?
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch
2 : Did not branch

I believe you can achieve this effect. SCIP has several branching rules and executes them one by one according to their priorities until one of them produces a result.
The default rule "relpscost" has a priority of 10000, so you should create a custom rule with a higher priority.
If you do not want to use your own rule at a node (deeper in the tree), you can
decide in the branchexeclp(allowedcons) callback of your rule to return a dict
{'result': pyscipopt.SCIP_RESULT.DIDNOTRUN}
which should inform SCIP that your rule did not execute and should be skipped, in which case the "relpscost" rule should take over (see here).
Edit: It is not quite obvious to me how your instance looks like so I am guessing a bit here: I think you are right in assuming that the multiple calls at node 2 are due to the strong branching. You can check by switching to another backup strategy, such as "mostinf".
Additionally, it is really helpful to check out the statistics by calling model.printStatistics() after the optimization has finished. I tried your code on the "cod105" instance of the MIPlIB benchmarks and similarly found a set of "Did not branch" outputs. The (abbreviated) statistics regarding branching rules are the following:
Branching Rules : ExecTime SetupTime BranchLP BranchExt BranchPS Cutoffs DomReds Cuts Conss Children
CustomRand : 0.00 0.00 1 0 0 0 0 0 0 2
fullstrong : 5.32 0.00 7 0 0 0 7 0 3 0
So the fallback rule is in fact called several times. Apparently, the "fullstrong" rule calls "CustomRand" internally, though this is not reflected in the actual statistics...

Multicore programming

At the company where I am interning, I was told about the use of multi-core programming and, in view of a project I am developing for my thesis (I'm not from the area but I'm working on something that involves coding).
I want to know if this is possible:I have a defined function that will be repeated 3x for 3 different variables. Is it possible to put the 3 running at the same time in different core (because they don't need each other information)? Because the calculation process is the same for all of them and instead of running 1 variable at a time, I would like to run all 3 at once (performing all the calculations at the same time) and in the end returning the results.
Some part of what I would like to optimize:
for v in [obj2_v1, obj2_v2, obj2_v3]:
distancia_final_v, \
pontos_intersecao_final_v = calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
v,
criterio
)
def calculo_vertice( obj1_normal,
obj1_v1,
obj1_v2,
obj1_v3,
obj2_normal,
obj2_v,
criterio
):
i = 0
distancia_final_v = []
pontos_intersecao_final_v = []
while i < len(obj2_v):
distancia_relevante_v = []
pontos_intersecao_v = []
distancia_inicial = 1000
for x in range(len(obj1_v1)):
planeNormal = np.array( [obj1_normal[x][0],
obj1_normal[x][1],
obj1_normal[x][2]
] )
planePoint = np.array( [ obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2]
] ) # Any point on the plane
rayDirection = np.array([obj2_normal[i][0],
obj2_normal[i][1],
obj2_normal[i][2]
] ) # Define a ray
rayPoint = np.array([ obj2_v[i][0],
obj2_v[i][1],
obj2_v[i][2]
] ) # Any point along the ray
Psi = Calculos.line_plane_collision( planeNormal,
planePoint,
rayDirection,
rayPoint
)
a = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2]
)
b = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
c = Calculos.area_trianglo_3d( obj1_v1[x][0],
obj1_v1[x][1],
obj1_v1[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
d = Calculos.area_trianglo_3d( obj1_v2[x][0],
obj1_v2[x][1],
obj1_v2[x][2],
obj1_v3[x][0],
obj1_v3[x][1],
obj1_v3[x][2],
Psi[0][0],
Psi[0][1],
Psi[0][2]
)
if float("{:.5f}".format(a)) == float("{:.5f}".format(b + c + d)):
P1 = Ponto( Psi[0][0], Psi[0][1], Psi[0][2] )
P2 = Ponto( obj2_v[i][0], obj2_v[i][1], obj2_v[i][2] )
distancia = Calculos.distancia_pontos( P1, P2 ) * 10
if distancia < distancia_inicial and distancia < criterio:
distancia_inicial = distancia
distancia_relevante_v = []
distancia_relevante_v.append( distancia_inicial )
pontos_intersecao_v = []
pontos_intersecao_v.append( Psi )
x += 1
distancia_final_v.append( distancia_relevante_v )
pontos_intersecao_final_v.append( pontos_intersecao_v )
i += 1
return distancia_final_v, pontos_intersecao_final_v
In this example of my code, I want to make the same process happen for obj2_v1, obj2_v2, obj2_v3.
Is it possible to make them happen at the same time?
Because I will be using a considerable amount of data and it would probably save me some time of processing.

multiprocessing (using processes to avoid the GIL) is the easiest but you're limited to relatively small performance improvements, number of cores speedup is the limit, see Amdahl's law. there's also a bit of latency involved in starting / stopping work which means it's much better for things that take >10ms
in numeric heavy code (like this seems to be) you really want to be moving as much of the it "inside numpy", look at vectorisation and broadcasting. this can give speedups of >50x (just on a single core) while staying easier to understand and reason about
if your algorithm is difficult to express using numpy intrinsics then you could also look at using Cython. this allows you to write Python code that automatically gets compiled down to C, and hence a lot faster. 50x faster is probably also a reasonable speedup, and this is still running on a single core
the numpy and Cython techniques can be combined with multiprocessing (i.e. using multiple cores) to give code that runs hundreds of times faster than naive implementations
Jupyter notebooks have friendly extensions (known affectionately as "magic") that make it easier to get started with this sort of performance work. the %timeit special allows you to easily time parts of the code, and the Cython extension means you can put everything into the same file

It's possible, but use python multiprocessing lib, because the threading lib doesn't delivery parallel execution.
UPDATE
DON'T do something like that (thanks for #user3666197 for pointing the error) :
from multiprocessing.pool import ThreadPool
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio):
#your code
return distancia_final_v,pontos_intersecao_final_v
pool = ThreadPool(processes=3)
async_result1 = pool.apply_async(calculo_vertice, (#your args here))
async_result2 = pool.apply_async(calculo_vertice, (#your args here))
async_result3 = pool.apply_async(calculo_vertice, (#your args here))
result1 = async_result1.get() # result1
result2 = async_result2.get() # result2
result3 = async_result3.get() # result3
Instead, something like this should do the job:
from multiprocessing import Process, Pipe
def calculo_vertice(obj1_normal,obj1_v1,obj1_v2,obj1_v3,obj2_normal,obj2_v,criterio, send_end):
#your code
send_end.send((distancia_final_v,pontos_intersecao_final_v))
numberOfWorkers = 3
jobs = []
pipeList = []
#Start process and build job list
for i in range(numberOfWorkers):
recv_end, send_end = Pipe(False)
process = Process(target=calculo_vertice, args=(#<... your args...>#, send_end))
jobs.append(process)
pipeList.append(recv_end)
process.start()
#Show the results
for job in jobs: job.join()
resultList = [x.recv() for x in pipeList]
print (resultList)
REF.
https://docs.python.org/3/library/multiprocessing.html
https://stackoverflow.com/a/37737985/8738174
This code will create a pool of 3 working process and each of it will async receive the function. It's important to point that in this case you should have 3+ CPU cores, otherwise, your system kernel will just switch between process and things won't real run in parallel.

Q : " Is it possible to make them happen at the same time? "
Yes.
The best results ever will be get if not adding any python ( the multiprocessing module is not necessary at all for doing 3 full-copies ( yes, top-down fully replicated copies ) of the __main__ python process for this so embarrasingly independent processing ).
The reasons for this are explained in detail here and here.
A just-enough tool is GNU's :$ parallel --jobs 3 python job-script.py {} ::: "v1" "v2" "v3"
For all performance-tweaking, read about more configuration details in man parallel.
"Because I will be using a considerable amount of data..."
The Devil is hidden in the details :
The O/P code may be syntactically driving the python interpreter to results, precise (approximate) within some 5 decimal places, yet the core sin is it's ultimately bad chances to demonstrate any reasonably achievable performance in doing that, the worse on "considerable amount of data".
If they, "at the company", expect some "considerable amount of data", you should do at least some elementary research on what is the processing aimed at.
The worst part ( not mentioning the decomposition of once vectorised-ready numpy-arrays back into atomic "float" coordinate values ) is the point-inside-triangle test.
For a brief analysis on how to speed-up this part ( the more if going to pour "considerable amount of data" on doing this ), get inspired from this post and get the job done in fraction of the time it was drafted in the O/P above.
Indirect testing of a point-inside-triangle by comparing an in-equality of a pair of re-float()-ed-strings, received from sums of triangle-areas ( b + c + d ) is just one of the performance blockers, you will find to get removed.

Parallel programming on a nested for loop in Python using PyCuda (or else?)

Part of my python function looks like this:
for i in range(0, len(longitude_aq)):
center = Coordinates(latitude_aq[i], longitude_aq[i])
currentAq = aq[i, :]
for j in range(0, len(longitude_meo)):
currentMeo = meo[j, :]
grid_point = Coordinates(latitude_meo[j], longitude_meo[j])
if is_in_circle(center, RADIUS, grid_point):
if currentAq[TIME_AQ] == currentMeo[TIME_MEO]:
humidity += currentMeo[HUMIDITY_MEO]
pressure += currentMeo[PRESSURE_MEO]
temperature += currentMeo[TEMPERATURE_MEO]
wind_speed += currentMeo[WIND_SPEED_MEO]
wind_direction += currentMeo[WIND_DIRECTION_MEO]
count += 1.0
if count != 0.0:
final_tmp[i, HUMIDITY_FINAL] = humidity/count
final_tmp[i, PRESSURE_FINAL] = pressure/count
final_tmp[i, TEMPERATURE_FINAL] = temperature/count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed/count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction/count
humidity, pressure, temperature, wind_speed, wind_direction, count = 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
final.loc[:, :] = final_tmp[:, :]
Problem: len(longitude_aq) is approx. 320k and len(longitude_meo) is 7millions. Which brings this code to somewhere close to 2100 billions iterations...
I have to iterate over one file (the longitude_aq one) and then compute some average value, iterating over the second file (the longitude_meo one) given some features extracted from the first file.
It doesn't seem like I can proceed any differently.
Possible solution: parallel programming. My university allows me access to their HPC. They have several nodes + several GPUs accessible (list here for the GPUs and here for the CPUs)
Target: Having no experience in CUDA programming with Python, I am wondering what would be the easiest way to transform my code into something runnable by the HPC, so that the computation time drops drastically.

Sorry, the read might get hard to read, but reality is cruel and many enthusiasts might easily spoil man*months of "coding" efforts into a principally a-priori lost war. Better carefully re-assess all the a-priori known CONS / PROS of any re-engineering plans, before going to blind spend a single man*day into a principally wrong direction.Most probably I would not post it here, if I were not exposed to a Project, where top-level academicians have spent dozens of man*years, yes, more than a year with a team of 12+, for "producing" a processing taking ~ 26 [hr], which was reproducible in less than ~ 15 [min] ( and way cheaper in HPC/GPU infrastructure costs ), if designed using the proper ( hardware-performance non-devastating ) design methods...
It doesn't seem like I can proceed any differently.
Well, actually pretty hard to tell, if not impossible :
Your post seems to assume a few cardinal things that may pretty avoid getting any real benefit from moving the above sketched idea onto an indeed professional HPC / GPU infastructure.
Possible solution: parallel programming
A way easier to say it / type it, than to actually do it.
A-wish-to-run-in-true-[PARALLEL] process scheduling remains just a wish and ( believe me, or Gene Amdahl, or other C/S veterans or not ) indeed hard process re-design is required, if your code ought get any remarkably better than in a pure-[SERIAL] code-execution flow ( as posted above )
1 ) a pure-[SERIAL] nature of the fileIO can( almost )kill the game :
the not posted part about pure-[SERIAL] file-accesses ( two files with data-points ) ... any fileIO is by nature the most expensive resource and except smart re-engineering a pure-[SERIAL] ( at best a one-stop-cost, but still ) (re-)reading in a sequential manner, so do not expect any Giant-Leap anywhere far from this in the re-engineered code. This will be always slowest and always expensive phase.
BONUS:While this may seem as the least sexy item in the inventory list of the parallel-computing, pycuda, distributed-computing, hpc, parallelism-amdahl or whatever the slang brings next, the rudimentary truth is that making an HPC-computing indeed fast and resources efficient, both the inputs ( yes, the static files ) and the computing strategy are typically optimised for stream-processing and to best also enjoy a non-broken ( collision avoided ) data-locality, if peak performance is to get achieved. Any inefficiency in these two domains can, not just add-on, but actually FACTOR the computing expenses ( so DIVIDE performance ) and differences may easily get into several orders of magnitude ( from [ns] -> [us] -> [ms] -> [s] -> [min] -> [hr] -> [day] -> [week], you name them all ... )
2 ) cost / benefits may get you PAY-WAY-MORE-THAN-YOU-GET
This part is indeed your worst enemy : if a lumpsum of your efforts is higher than a sum of net benefits, GPU will not add any added value at all, or not enough, so as to cover your add-on costs.
Why?
GPU engines are SIMD devices, that are great for using a latency-masking over a vast area of a repetitively the very same block of SMX-instructions, whih needs a certain "weight"-of-"nice"-mathematics to happen locally, if they are to show any processing speedup over other problem implementation strategies - GPU devices ( not the gamers' ones, but the HPC ones, which not all cards in the class are, are they? ) deliver best for indeed small areas of data-locality ( micro-kernel matrix operations, having a very dense, best very small SMX-local "RAM" footprint of such a dense kernel << ~ 100 [kB] as of 2018/Q2 ).
Your "computing"-part of the code has ZERO-RE-USE of any single data-element that was ( rather expensively ) fetched from an original static storage, so almost all the benefits, that the GPU / SMX / SIMD artillery has been invented for is NOT USED AT ALL and you receive a NEGATIVE net-benefit from trying to load that sort of code onto such a heterogeneous ( NUMA complicated ) distributed computing ( yes, each GPU-device is a rather "far", "expensive" ( unless your code will harness it's SMX-resources up until almost smoke comes out of the GPU-silicon ... ) and "remote" asynchronously operated distributed-computing node, inside your global computing strategy ) system.
Any first branching of the GPU code will be devastatingly expensive in SIMD-execution costs of doing so, thus your heavily if-ed code is syntactically fair, but performance-wise an almost killer of the game:
for i in range( 0,
len( longitude_aq )
): #______________________________________ITERATOR #1 ( SEQ-of-I-s )
currentAq = aq[i, :] # .SET
center = Coordinates( latitude_aq[i], # .SET
longitude_aq[i]
) # |
# +-------------> # EASY2VECTORISE in [i]
for j in range( 0,
len( longitude_meo )
): #- - - - - - - - - - - - - - - - - ITERATOR #2 ( SEQ-of-J-s )
currentMeo = meo[j, :] # .SET
grid_point = Coordinates( latitude_meo[j], # .SET
longitude_meo[j]
) # |
# +-------> # EASY2VECTORISE in [j]
if is_in_circle( center,
RADIUS,
grid_point
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/ IF-ed SIMD-KILLER #1
if ( currentAq[TIME_AQ]
== currentMeo[TIME_MEO]
): # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ IF-ed SIMD-KILLER #2
humidity += currentMeo[ HUMIDITY_MEO] # BEST PERF.
pressure += currentMeo[ PRESSURE_MEO] # IF SMART
temperature += currentMeo[ TEMPERATURE_MEO] # CURATED
wind_speed += currentMeo[ WIND_SPEED_MEO] # AS NON
wind_direction += currentMeo[WIND_DIRECTION_MEO] # ATOMICS
count += 1.0
if count != 0.0: # !!!!!!!!!!!!!!!!!! THIS NEVER HAPPENS
# EXCEPT WHEN ZERO DATA-POINTS WERE AVAILABLE FOR THE i-TH ZONE,
# FILE DID NOT CONTAIN ANY SUCH,
# WHICH IS FAIR,
# BUT SUCH A BLOCK OUGHT NEVER HAVE STARTED ANY COMPUTING AT ALL
# IF ASPIRING FOR INDEED BEING LOADED
# ONTO AN HPC-GRADE COMPUTING INFRASTRUCTURE ( SPONSORED OR NOT )
#
final_tmp[i, HUMIDITY_FINAL] = humidity / count
final_tmp[i, PRESSURE_FINAL] = pressure / count
final_tmp[i, TEMPERATURE_FINAL] = temperature / count
final_tmp[i, WIND_SPEED_FINAL] = wind_speed / count
final_tmp[i, WIND_DIRECTION_FINAL] = wind_direction / count
If we omit both the iterators over the domain of all-[i,j]-s and the if-ed crossroads, the actual "useful"-part of the computing does a very shallow amount of mathematics -- the job contains a few SLOC-s, where principally independent values are summed ( best having avoided any collision of adding operation, so could be very cheaply operated independently each to the other ( best with well ahead pre-fetched constants ) in less than a few [ns] YES, your computing payload does not require anything more than just a few units [ns] to execute.
The problem is in smart-engineering of the data-flow ( I like to call that a DATA-HYDRAULICS ( how to make a further incompressible flow of DATA into the { CPU | GPU | APU | *** }-processor registers, so as to get 'em processes ) )
All the rest is easy. A smart solution of the HPC-grade DATA-HYDRAULICS typically is not.
No language, no framework will help you in this automatically. Some can release some part of the solution-engineering from your "manual" efforts, some cannot, some can even spoil a possible computing performance, due to "cheap" shortcuts in their internal design decision and compromises made, that do not benefit the same target you have - The Performance.
The Best next step?
A ) Try to better understand the limits of computing infrastructures you expect to use for your extensive ( but not intensive ( yes, just a few SLOC's per [i,j] ), which HPC-supervisors do not like to see flowing onto their operated expensive HPC-resources ).
B ) If in troubles with time + headcount + financial resouces to re-engineer the top-down DATA-HYDRAULICS solution, best re-factor your code so as to get at least into the vectorised, numpy / numba ( not always will numba get remarkably farther than an already smart numpy-vectorised code, but a quantitative test will tell the facts per inciden, not in general )
C ) If your computing-problem is expected to get re-run more often, definitely assess a re-designed pipeline from the early pre-processing of the data-storage ( the slowest part of the processing ), where a stream-based pre-processing of principally static values is possible, which could impact the resulting DATA-HYDRAULICS flow ( performance ) with a pre-computed + smart-aligned values the most. The block of a few ADD-s down the lane will not get improved beyond a few [ns], as reported above, but the slow-flow can jump orders of magnitude faster, if re-arranged into a smarted flow, harnessing all available, yet "just"-[CONCURRENT]-ly operated resources ( any attempt to try to arrange a True-[PARALLEL] scheduling is a pure nonsense here, as the task is principally by no means a [PARALLEL] scheduling problem, but a stream of pure-[SERIAL] (re-)processing of data-points, where a smart, yet "just"-[CONCURRENT] processing re-arrangement may help scale-down the resulting duration of the process ).
BONUS:If interested in deeper reasoning about the achievable performance gains from going into N-CPU operated computing graphs, feel free to learn more about re-formulated Amdahl's Law and related issues, as posted in further details here.

Massively parallel search operation with Dask, Distributed

I created a demo problem when testing out auto-scalling Dask Distributed implementation on Kubernetes and AWS and I I'm not sure I'm tackling the problem correctly.
My scenario is given a md5 hash of a string (representing a password) find the original string. I hit three main problems.
A) the parameter space is massive and trying to create a dask bag with 2.8211099e+12 members caused memory issues (hence the 'explode' function you'll see in the sample code below).
B) Clean exit on early find. I think using take(1, npartitions=-1) will achieve this but I wasn't sure. Originally I raised an exception raise Exception("%s is your answer' % test_str) which worked but felt "dirty"
C) Given this is long running and sometimes workers or AWS boxes die, how would it be best to store progress?
Example code:
import distributed
import math
import dask.bag as db
import hashlib
import dask
import os
if os.environ.get('SCHED_URL', False):
sched_url = os.environ['SCHED_URL']
client = distributed.Client(sched_url)
versions = client.get_versions(True)
dask.set_options(get=client.get)
difficulty = 'easy'
settings = {
'hard': (hashlib.md5('welcome1'.encode('utf-8')).hexdigest(),'abcdefghijklmnopqrstuvwxyz1234567890', 8),
'mid-hard': (hashlib.md5('032abgh'.encode('utf-8')).hexdigest(),'abcdefghijklmnop1234567890', 7),
'mid': (hashlib.md5('b08acd'.encode('utf-8')).hexdigest(),'0123456789abcdef', 6),
'easy': (hashlib.md5('0812'.encode('utf-8')).hexdigest(),'0123456789', 4)
}
hashed_pw, keyspace, max_guess_length = settings[difficulty]
def is_pw(guess):
return hashlib.md5(guess.encode('utf-8')).hexdigest() == hashed_pw
def guess(n):
guess = ''
size = len(keyspace)
while n>0 :
n -= 1
guess += keyspace[n % size];
n = math.floor(n / size);
return guess
def make_exploder(num_partitions, max_val):
"""Creates a function that maps a int to a range based on the number maximum value aimed for
and the number of partitions that are expected.
Used in this code used with map and flattent to take a short list
i.e 1->1e6 to a large one 1->1e20 in dask rather than on the host machine."""
steps = math.ceil(max_val / num_partitions)
def explode(partition):
return range(partition * steps, partition * steps + steps)
return explode
max_val = len(keyspace) ** max_guess_length # How many possiable password permutation
partitions = math.floor(max_val / 100)
partitions = partitions if partitions < 100000 else 100000 # split in to a maximum of 10000 partitions. Too many partitions caused issues, memory I think.
exploder = make_exploder(partitions, max_val) # Sort of the opposite of a reduce. make_exploder(10, 100)(3) => [30, 31, ..., 39]. Expands the problem back in to the full problem space.
print("max val: %s, partitions:%s" % (max_val, partitions))
search = db.from_sequence(range(partitions), npartitions=partitions).map(exploder).flatten().filter(lambda i: i <= max_val).map(guess).filter(is_pw)
search.take(1,npartitions=-1)
I find 'easy' works well locally, 'mid-hard' works well on our 6 to 8 * m4.2xlarge AWS cluster. But so far haven't got hard to work.

A) the parameter space is massive and trying to create a dask bag with 2.8211099e+12 members caused memory issues (hence the 'explode' function you'll see in the sample code below).
This depends strongly on how you arrange your elements into a bag. If each element is in its own partition then yes, this will certainly kill everything. 1e12 partitions is very expensive. I recommend keeping the number of partitions in the thousands or tens of thousands.
B) Clean exit on early find. I think using take(1, npartitions=-1) will achieve this but I wasn't sure. Originally I raised an exception raise Exception("%s is your answer' % test_str) which worked but felt "dirty"
If you want this then I recommend not using dask.bag, but instead using the concurrent.futures interface and in particular the as_completed iterator.
C) Given this is long running and sometimes workers or AWS boxes die, how would it be best to store progress?
Dask should be resilient to this as long as you can guarantee that the scheduler survives. If you use the concurrent futures interface rather than dask bag then you can also track intermediate results on the client process.

Map Reduce Simple score aggregating for common queries

One of my mapper produces some logs distributed in files like part-0, part-1, part-2 etc. Now each of these have some queries and some associated data for that query:
part-0
q score
1 ben 10 4.01
horse shoe 5.96
...
part-1
1 ben 10 3.23
horse shoe 2.98
....
and so on for part-2,3 etc.
Now the same query q i.e. "1 ben 10" above resides in part-1, part-2 etc.
Now I have to write a map reduce phase where in I can collect the same queries and aggregate (add up) their scores.
My mapper function can be an identity and in the reduce I will be accomplishing this task.
Output would be:
q aggScore
1 ben 10 7.24
horse shoe 8.96
...
Seems to be a simple task but I am not able to think of as to how can I proceed with this (Read a lot but not really able to proceed). I can think in terms of generic Algorithm problem wherein first I will collect the common queries and than add up their scores.
Any help with some hints of pythonic solution or Algorithm (map reduce) would really be appreciated.

Here is the MapReduce solution:
Map input: Each input file (part-0, part-1, part-2, ...) can be input to individual (separate) map task.
foreach input line in the input file,
Mapper emits <q,aggScore>. If there are multiple scores for a query in a single file, Map will sum them all up, otherwise if we know that each query will appear in each file just once, map can be an identity function emitting <q,aggScore> for each input line as is.
Reducer input is in the form <q,list<aggScore1,aggScore2,...> The Reducer operation is similar to well-known MapReduce example of wordcount. If you are using Hadoop, you can use the following method for Reducer.
public void reduce(Text q, Iterable<IntWritable> aggScore, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : aggScore) {
sum += val.get();
}
context.write(q, new IntWritable(sum));
}
The method will sum all aggScores for a particularq and give you the desired output. The python code for the reducer should look something like this (Here q is the key and the list of aggScores is the values) :
def reduce(self, key, values, output, reporter):
sum = 0
while values.hasNext():
sum += values.next().get()
output.collect(key, IntWritable(sum))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.