I'm learning dask and I want to generate random strings. But this only works if the import statements are inside the function f.
This works:
import dask
from dask.distributed import Client, progress
c = Client(host='scheduler')
def f():
from random import choices
from string import ascii_letters
rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
return rand_str(5)
xs = []
for i in range(3):
x = dask.delayed(f)()
xs.append(x)
res = c.compute(xs)
print([r.result() for r in res])
This prints something like ['myvDi', 'rZnYO', 'MyzaG']. This is good, as the strings are random.
This, however, doesn't work:
from random import choices
from string import ascii_letters
import dask
from dask.distributed import Client, progress
c = Client(host='scheduler')
def f():
rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
return rand_str(5)
xs = []
for i in range(3):
x = dask.delayed(f)()
xs.append(x)
res = c.compute(xs)
print([r.result() for r in res])
This prints something like ['tySQP', 'tySQP', 'tySQP'], which is bad because all the random strings are the same.
So I'm curious how I'm going to distribute large non-trivial code. My goal is to be able to pass arbitrary json to a dask.delayed function and have that function perform analysis using other modules, like google's ortools.
Any suggestions?
Python's random module is odd.
It creates some state when it first imports and uses that state when generating random numbers. Unfortunately, having this state around makes it difficult to serialize and move between processes.
Your solution of importing random within your function is what I do.
Related
I have been using parfor in MATLAB to run parallel for loops for quite some time. I need to do something similar in Python but I cannot find any simple solution. This is my code:
t = list(range(1,3,1))
G = list(range(0,3,2))
results = pandas.DataFrame(columns = ['tau', 'p_value','G','t_i'],index=range(0,len(G)*len(t)))
counter = 0
for iteration_G in list(range(0,len(G))):
for iteration_t in list(range(0,len(t))):
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
results['tau'][counter] = tau
results['p_value'][counter] = p_value
results['G'][counter] = G[iteration_G]
results['t_i'][counter] = G[iteration_t]
counter = counter + 1
I would like to use the parfor equivalent in the first loop.
I'm not familiar with parfor, but you can use the joblib package to run functions in parallel.
In this simple example there's a function that prints its argument and we use Parallel to execute it multiple times in parallel with a for-loop
import multiprocessing
from joblib import Parallel, delayed
# function that you want to run in parallel
def foo(i):
print(i)
# define the number of cores (this is how many processes wil run)
num_cores = multiprocessing.cpu_count()
# execute the function in parallel - `return_list` is a list of the results of the function
# in this case it will just be a list of None's
return_list = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in range(20))
If this doesn't work for what you want to do, you can try to use numba - it might be a bit more difficult to set-up, but in theory with numba you can just add #njit(parallel=True) as a decorator to your function and numba will try to parallelise it for you.
I found a solution using parfor. It is still a bit more complicated than MATLAB's parfor but it's pretty close to what I am used to.
t = list(range(1,16,1))
G = list(range(0,62,2))
for iteration_t in list(range(0,len(t))):
#parfor(list(range(0,len(G))))
def fun(iteration_G):
result = pandas.DataFrame(columns = ['tau', 'p_value'],index=range(0,1))
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
result['tau'] = tau
result['p_value'] = p_value
fun = numpy.array([tau,p_value])
return fun
Update: it's working after updating my Spyder to 5.0.5. Thanks everyone!
I am trying to speed up a loop using multiprocessing. The code below aims to generate 10000 random vectors.
My idea is to split the task into 5 processes and store it in result. However, it returned an empty list when I run the code.
But, if I remove result = add_one(result) in the randomize_data function, the code runs perfectly. So, the error must be coming from using functions from other modules (Testing.test) inside multiprocessing.
Here is the add_one function from Testing.test:
def add_one(x):
return x+1
How can I use function from other modules inside process? Thank you.
import multiprocessing
import numpy as np
import pandas as pd
def randomize_data(mean, cov, n_init, proc_num, return_dict):
result = pd.DataFrame()
for _ in range(n_init):
temp = np.random.multivariate_normal(mean, cov)
result = result.append(pd.Series(temp), ignore_index=True)
result = add_one(result)
return_dict[proc_num] = result
if __name__ == "__main__":
from Testing.test import add_one
mean = np.arange(0, 1, 0.1)
cov = np.identity(len(mean))
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(5):
p = multiprocessing.Process(target=randomize_data, args=(mean, cov, 2000, i, return_dict, ))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
result = return_dict.values()
The issue here is pretty obvious:
You imported add_one in a local scope, not in global. Because of this, the referenz to this function only exists inside your main-if.
Move this import-statement to the other ones to the top of your file, and your code should work.
import multiprocessing
import numpy as np
import pandas as pd
from Testing.test import add_one
I have two lists. List X contains 1000 words. List Y contains 500 words. I am trying to find similar words for List X with respect to Y.
I am using Spacy's similarity function.
The problem I am facing is that it takes a long time for the for loop part of the execution. I have understood from research that in python, multi threading only gives a illusion of concurrency and hence does not have any real performance increase. Thus I thought multiprocessing is the way but I am new to multiprocessing usage, hence request help.
How do I speed up the execution of the for loop part through multiprocessing in python?
The following is my code.
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
ListX =['HSBC', 'JP Morgan',......] #500 words lists
ListY = ['Currency','Blockchain'.......] #1000 words lists
s_words = []
for token1 in ListY:
list_to_sort = []
for token2 in ListX:
list_to_sort.append((token1, token2,nlp(str(token1)).similarity(nlp(str(token2)))))
sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
You can try this:
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
def compare_function(token1, token2, nlp):
return token1, token2, nlp(str(token1)).similarity(nlp(str(token2)))
from multiprocessing import Pool
import itertools
tokenlist = [(a,b, nlp) for a, b in itertools.product(ListX, ListY)]
p = Pool(8)
results = p.map(compare_function, tokenlist)
If you are under windows, use
if __name__ == '__main__':
results = p.map(compare_function, tokenlist)
I am new to python and tried a lot of methods for multiprocessing in python with no such benefits:
I have a task of implementing 3 methods x,y and z. What I have tried till now is:
Def foo:
Iterate over the lines in a text file:
Call_method_x()
Result from method x say x1
Call_method_y() #this uses x1
Result from method y say y1
For i in range(4):
Multiprocessing.Process(target=Call_method_z()) #this uses y1
I used multiprocessing here on method_z as this is the most cpu intensive.
i tried this another way:
def foo:
call method_x()
call method_y()
call method_z()
def main():
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(foo())
Which one seems more appropriate ? I checked the execution time but it was not much of a difference. the thing is that first method_x(), then method_y() and then method_z() should be implemented as they use the output from each other. Both these ways work but theres no significant difference of using multiprocessing with these two methods.
Please let me know if I am missing something here.
You can use multiprocessing.Pool from python, something like :
from multiprocessing import Pool
with open(<path-to-file>) as f:
data = f.readlines()
def method_x():
# do something
pass
def method_y():
x1 = method_x()
#do something
def method_z():
y1 = method_y()
# do something
def call_home():
p = Pool(6)
p.map(method_z, data)
First you read all lines in variable data. Then invoke 6 processes and allow each line to be processed by any of 6 process
I am trying to define several functions on, let say, one variable.
First I defined x variable:
from sympy import *
x=var('x')
I want to define series of functions using Lambda something look like this:
f0=Lambda(x,x)
f1=Lambda(x,x**2)
....
fn=....
How can I define this?
Thanks
It took me a while to understand what your after. It seems that you are most likely after loops, but do not articulate this. So I would suggest you do this:
from sympy import *
x = symbols('x')
f=[]
for i in range(1,11): # generate 1 to 10
f.append( Lambda(x,x**i) )
# then you use them like this
print( f[0](2) ) # indexes 0 based
print( f[1](2) )
# ,,, up to [9]
Anyway your not really clear in your question on what the progression should be.
EDIT: As for generating random functions here is one that example generates a polynomial with growing order and a random set of lower orders:
from random import randint
from sympy import *
x = symbols('x')
f=[]
for i in range(1,11):
fun = x**i
for j in range(i):
fun += randint(0,1)* x**j
f.append( Lambda(x,fun) )
# then you use them like this
# note I am using python 2.7 if you use 3.x modify to suit your need
print( map(str, f) ) #show them all
def call(me):
return me(2)
print( map(call, f) )
Your random procedure may be different as there are a infinite number of randoms available. Note its different each time you run the creation loop it, use random seed to fix the random if needed same generation between runs. The functions once created are stable in one process.