The following simple spark program takes 4 minutes to run. I don't know what's wrong with this code.
First, I generate a VERY small rdd
D = spark.sparkContext.parallelize([(0,[1,2,3]),(1,[2,3]),(2,[0,3]),(3,[1])]).cache()
Then I generate a vector
P1 = spark.sparkContext.parallelize(list(zip(list(range(4)),[1/4]*4))).cache()
Then I defines a function to do the map step
def MyFun(x):
L0 = len(x[2])
L = []
for i in x[2]:
L.append((i,x[1]/L0))
return L
Then I execute the following code
P0 = P1
D0 = D.join(P1).map(lambda x: [x[0],x[1][1],x[1][0]]).cache()
C0 = D0.flatMap(lambda x: MyFun(x)).cache()
P1 = C0.reduceByKey(lambda x,y:x+y).mapValues(lambda x:x*1.2+3.4).sortByKey().cache()
Diff = P1.join(P0).map(lambda x: abs(x[1][0]-x[1][1])).sum()
Given my data is so small, I couldn't figure out a reason why this piece of code runs so slow...
I have a few suggestions to help you speed up this job.
Cache only when needed
The process of caching is to write the dag you created on the disk. So caching every step might cost a lot instead of speeding up the process.
I would suggest you to cache only P1.
Use DataFrames to allow Spark to help you
Afterwards, I strongly suggest you to use the DataFrame api, Spark will be able to do some optimizations for you, such as push down predicates optimizations.
The last, but not the least, using custom functions cost a lot as well. If you are using DataFrames, try to use only existing functions from the org.apache.spark.sql.functions module.
Profile the code with Spark UI
I also suggest to profile your code via the Spark UI, because it might not be a problem of your code since you have a small data, but a problem with the nodes.
Related
In my case I have several files in S3 and a custom function that read each one of them and process it using all threads. To simplify the example I just generate a dataframe df and I assume that my function is tsfresh.extract_features which use multiprocessing.
Generate Data
import pandas as pd
from tsfresh import extract_features
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, \
load_robot_execution_failures
download_robot_execution_failures()
ts, y = load_robot_execution_failures()
df = []
for i in range(5):
tts = ts.copy()
tts["id"] += 88 * i
df.append(tts)
df = pd.concat(df, ignore_index=True)
Function
def fun(df, n_jobs):
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
Cluster
import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster
my_vpc = # your vpc
my_subnets = # your subnets
cpu = 2
ram = 4
cluster = FargateCluster(n_workers=1,
image='rpanai/feats-worker:2020-08-24',
vpc=my_vpc,
subnets=my_subnets,
worker_cpu=int(cpu * 1024),
worker_mem=int(ram * 1024),
cloudwatch_logs_group="my_log_group",
task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
scheduler_timeout='20 minutes'
)
cluster.adapt(minimum=1,
maximum=4)
client = Client(cluster)
client
Using all worker threads (FAIL)
to_process = [delayed(fun)(df, cpu) for i in range(10)]
out = compute(to_process)
AssertionError: daemonic processes are not allowed to have children
Using only one thread (OK)
In this case it works fine but I'm wasting resources.
to_process = [delayed(fun)(df, 0) for i in range(10)]
out = compute(to_process)
Question
I know that for this particular function I could eventually write a custom distributor using multithreading and few other tricks but I'd like to distribute a job where on every worker I can take advantages of all resources without having to worry too much.
Update
The function was just an example and actually it has some sort of cleaning before the actual feature extraction and after it save it to S3.
def fun(filename, bucket_name, filename_out, n_jobs):
#
df pd.read_parquet(f"s3://{bucket_name}/{filename}")
# do some cleaning
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
extract_features.to_parquet(f"s3://{bucket_name}/{filename_out}")
I can help answering your specific question for tsfresh, but if tsfresh was just a simple toy example that might not be what you want.
For tsfresh, you would typically not mix the multiprocessing of tsfresh and dask, but let dask do all the handling. This means, you start with a single dask.DataFrame (in your test case, you could just convert the pandas dataframe into a dask one - for your read use case you can read directly from S3 docu), and then distribute the feature extraction in the dask dataframe (the nice thing on the feature extraction is, that it works independently on every time series. Therefore we can generate a single job for every time series).
The current version of tsfresh (0.16.0) has a small helper function that will do this for you: see here.
In the next version, it might even be possible to just run extract_features on the dask dataframe directly.
I am not sure if this helps to solve your more general question. In my opinion, you (in most of the cases) do not want to mix dask's distribution function and "local" multicore calculation but just let dask handle everything. Because if you are on a dask cluster, you might not even know how many cores you will have on each of the machines (or you might only get a single one per job).
This means if your job can be distributed N times and each of them will start M sub-jobs, you just give "N x M" jobs to dask and let it figure out the rest (including data locality).
I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I have a script in python but it takes more than 20 hours to run until the end.
Since my code is pretty big, I will post a simplified one.
The first part of the code:
flag = 1
mydic = {}
for i in mylist:
mydic[flag] = myfunction(i)
flag += 1
mylist has more than 700 entries and each time I call myfunction it run for around 20sec.
So, I was thinking if I can use paraller programming to split the iteration into two groups and run it simultaneously. Is that possible and will I need the half time than before?
The second part of the code:
mymatrix = []
for n1 in range(0,flag):
mat = []
for n2 in range(0,flag):
if n1 >= n2:
mat.append(0)
else:
res = myfunction2(mydic(n1),mydic(n2))
mat.append(res)
mymatrix.append(mat)
So, if mylist has 700 entries, I want to create a 700x700 matrix where it is upper triangular matrix. But the myfunction2() needs around 30sec each time. I don't know if I can use parallel programming here too.
I cannot simplify the myfunction() and myfunction2() since they are functions where I call an external api and return the results.
Do you have any suggestion of how can I change it to make it faster.
Based on your comments, I think it's very likely that the 30seconds of time is mostly due to external API calls. I would add some timing code to test what portions of your code are actually responsible for the slowness.
If it is from the external API calls, there are some easy fixes. The external API calls block, so you'll get a speedup if you can move to a parallel model ( though 30s of blocking sounds huge to me ).
I think it would be easiest to create a quick "task list" by having the output of 2 loops be a matrix of arguments to pass into a function. Then I'd pipe them into Celery to run the tasks. That should give you a decent speedup with a minimal amount of work.
You would probably save a lot more time with the threading or multiprocessing modules to run tasks (or sections) , or even write it all in Twisted python - but that usually takes longer than a simple celery function.
The one caveat with the Celery approach is that you'll be dispatching a lot of work - so you'll have to have some functionality to poll for results. That could be a while loop that just sleeps(10) and repeats itself until celery has a result for every task. If you do it in Twisted, you can access/track results on finish. I've never had to do something like this with multiprocessing, so don't know how that would fit in.
how about using a generator for the second part instead of one of the for loops
def fn():
for n1 in range(0, flag):
yield n1
generate = fn()
while True:
a = next(generate)
for n2 in range(0, flag):
if a >= n2:
mat.append(0)
else:
mat.append(myfunction2(mydic(a),mydic(n2))
mymatrix.append(mat)
i have a costly calculation to do for fitting some experimental data. The fitting function is a sum over eigenmodes, each of them containing a specific surface integral. As it is rather slow if you do it the classical way i thought about threading it. I'm using python btw.
The function i want to calculate is something like
def fit_func(params , Mmin, Mmax):
values = np.zeros(1000)
for m in range(Mmin, Mmax):
# Fancy Calculation for each mode
# some calulation with all modes, adding them up 'values'
return values
How can i split this up? I did something like
data1 = thread.start_new_thread(fit_func, (params,0,13))
data2 = thread.start_new_thread(fit_func, (params,13,25))
but then the sum of data1 and data2 is not the same as fitfunc(params, 0,25)...
Try out multiprocessing. This will effectively create separate Python processes using a thread-like interface. However, make sure that you profile your computation and make sure that it is the problem, not something else like IO. Starting processes is very slow, so keep them around for a while if you are planning to use them.
You can also use numpy for those functions. They're written in C code, so they're stupid fast. Check them both out and see what fits best. I would go for the numpy solution myself...
use multiprocessing pool
import multiprocessing as mp
p = mp.Pool(10)
res = p.map(your_function, range(Mmin, Mmax))