I have got three python scripts, scheduler one, multithreading one and another one is for multiprocessing. I am trying to execute the multiprocessing script with the help of scheduler. When I use multithreading instead of multiprocessing the program works. I can't figure out what is the issue. Below are the scripts for both multithreading and multiprocessing. I am doing all this in Windows using Python3.6. I just replace multiprocessing.py with multithreading.py in the scheduler.py and program works.
scheduler.py
def job():
try :
exec(open('multiprocessing.py').read())
except Exception as e:
print(e)
schedule.every(100).seconds.do(job)
while True:
schedule.run_pending()
multiprocessing.py
for i in range(cores):
start = i*singleLength
processName = 'Process-'+str(i+1)
if (len(linesList) / 2 == 0 ) and (i == cores - 1):
end = singleLength + (i*singleLength)
values = (linesList[start:end],processName)
elif (len(linesList) / 2 != 0) and (i == cores - 1):
#semiListLength = len(linesList[start:])
values = (linesList[start:],processName)
else:
end = singleLength + (i*singleLength)
#semiListLength = len(linesList[start:end])
values = (linesList[start:end],processName)
processList.append(multiprocessing.Process(target = FTMain.parsing, args = values))
for i in range(cores):
processList[i].start()
for i in range(cores):
processList[i].join()
multithreading.py
for i in range(cores):
start = i*singleLength
processName = 'Thread-'+str(i+1)
if (len(linesList) / 2 == 0 ) and (i == cores - 1):
end = singleLength + (i*singleLength)
#semiListLength = len(linesList[start:end])
values = (linesList[start:end],processName)
elif (len(linesList) / 2 != 0) and (i == cores - 1):
#semiListLength = len(linesList[start:])
values = (linesList[start:],processName)
else:
end = singleLength + (i*singleLength)
#semiListLength = len(linesList[start:end])
values = (linesList[start:end],processName)
processList.append(threading.Thread(target = FTMainTemp.parsing, args = values))
for i in range(cores):
processList[i].start()
for i in range(cores):
processList[i].join()
Below is the error I am getting when running the multiprocessing.py
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Related
Currently i am trying to accelerate my simulation. I have already tried it with threading and it worked. Now i want to try it with parallel processes to compare both ways. That for i use futures.ProcessPoolExecutor. When i start my script the simulationtime gets printed (its very low) but my program does not work as it should. Normally it should generate several files but they are not generated. Furthermore there is no error message. I already did some research on it in books and on the internet but i can´t figure out the problem.
Here is my code:
def main(setting):
cfg_path = generate(settings[setting])
run_simulation(cfg_path)
if __name__ == '__main__':
settings = get_wrapper_input_json("Szenarioanalyse.json")
typ = "processes"
start = time.perf_counter()
if typ == "threads":
with futures.ThreadPoolExecutor(cpu_count()-1) as e:
e.map(main,settings)
elif typ == "processes":
with futures.ProcessPoolExecutor(cpu_count()-1) as e:
e.map(main,settings)
else:
for setting in settings:
main(setting)
print("Simulationtime: "+str(time.perf_counter()-1))
I have solved the problem:
settings = get_wrapper_input_json("Szenarioanalyse.json") #get the settings
parameters = {"Threads":True,"Processes":False,"Serial":False}
def simulate(setting):
cfg_path = generate(settings[setting])
run_simulation(cfg_path)
if __name__ == '__main__':
for key,value in parameters.items():
if key == "Threads" and value == True:
start_threads = time.perf_counter()
with futures.ThreadPoolExecutor(10) as e:
e.map(simulate,settings)
print("Simulationtime "+key+": "+str(time.perf_counter()-start_threads))
elif key == "Processes" and value == True:
start_processes = time.perf_counter()
pool = multiprocessing.Pool(multiprocessing.cpu_count())
pool.map(simulate,settings)
print("Simulationtime "+key+": "+str(time.perf_counter()-start_processes))
elif key == "Serial" and value == True:
start_serial = time.perf_counter()
for setting in settings:
simulate(setting)
print("Simulationtime "+key+": "+str(time.perf_counter()-start_serial))
#save_dataframe("Szenarioanalyse.json")
#file_management()
I'm using this code as a template (KILLING IT section)
https://stackoverflow.com/a/36962624/9274778
So I've solved this for now... changed the code to the following
import random
from time import sleep
def worker(i,ListOfData):
print "%d started" % i
#MyCalculations with ListOfData
x = ListOfData * Calcs
if x > 0.95:
return ListOfDataRow, True
else:
return ListOfDataRow, False
callback running only in main
def quit(arg):
if arg[1] == True:
p.terminate() # kill all pool workers
if __name__ == "__main__":
import multiprocessing as mp
Loops = len(ListOfData) / 25
Start = 0
End = 25
pool = mp.Pool()
for y in range(0,Loops)
results = [pool.apply(worker, args=(i,ListOfData[x]),callback = quit)
for y in range(0,len(ListofData))]
for c in results:
if c[1] == True
break
Start = Start + 25
End = End +25
So I chunk my data frame (assume for now my ListOfData is always divisible by 25) and send it off to the multiprocessing. I've found for my PC performance groups of 25 works best. If the 1st set doesn't return a TRUE, then I go to the next chunk.
I couldn't use the async method as it ran all at different times and sometimes I'd get a TRUE back that was further down the list (not what I wanted).
I've been struggling with this program for some time.
I wrote this program to learn how to run the same program in different CPU's at the same time and to reduce processing time.
By itself, the program is not complex, just detect if there are some screenshots in different directories, but as I said before my final goal is not to detect those files, but to learn multiprocessing.
To be easier to test, instead of reading files or reading user inputs, I just set those variables so you won't have to be typing those things. I'll be faster to test this way.
Here's the code:
from multiprocessing import Array, Lock, Pool
def DetectSCREENSHOT(file, counter, total):
# To check if this function is ever accessed
print 'Entered in DetectSCREENSHOT function'
if file.startswith('Screenshot') == False:
print ' (%s %%) ERROR: the image is not a screenshot ' %(round((float(counter)/total)*100, 1))
return 0
else:
print ' (%s %%) SUCCESS: the image is a screenshot' %(round((float(counter)/total)*100, 1))
return 1
def DetectPNG(file_list, counter_success_errors, lock):
# To check if this function is ever accessed
print 'Entered in DetectPNG function'
for i in range(len(file_list)):
file = file_list[i]
# If the file is an .png image:
if file.endswith('.png'):
lock.acquire()
counter_success_errors[0] = counter_success_errors[0] + 1
lock.release()
if DetectSCREENSHOT(file, counter_success_errors[0], len(file_list)) == 1:
lock.acquire()
counter_success_errors[1] = counter_success_errors[1] + 1
lock.release()
else:
lock.acquire()
counter_success_errors[2] = counter_success_errors[2] + 1
lock.release()
def Main():
# file_list = # List of lists of files in different directories
file_list = [['A.png', 'B.png', 'C.txt', 'Screenshot_1.png'], ['D.txt', 'Screenshot_2.png', 'F.png', 'Screenshot_3.png']]
# Array where the first cell is a counter, the second one, the number of successes and the third one, the number of errors.
counter_success_errors = Array('i', 3)
lock = Lock()
counter_success_errors[0] = 0
counter_success_errors[1] = 0
counter_success_errors[2] = 0
# Number of CPUS's to be used (will set the number of different processes to be run)
# CPUs = raw_input('Number of CPUs >> ')
CPUs = 2
pool = Pool(processes=CPUs)
for i in range(CPUs):
pool.apply_async(DetectPNG, [file_list[i], counter_success_errors, lock])
pool.close()
pool.join()
success = int(counter_success_errors[1])
errors = int(counter_success_errors[2])
print(' %s %s %s successfully. %s %s.' %('Detected', success, 'screenshot' if success == 1 else 'screenshots', errors, 'error' if errors == 1 else 'errors'))
#################################################
if __name__ == "__main__":
Main()
When I execute it, I don't get any errors, but it looks like it doesn't even access to both DetectPNG and DetectSCREENSHOT functions.
What's broken in the code?
Output: Detected 0 screenshots successfully. 0 errors.
Background
A small server which waits for different types of jobs which are represented
as Python functions (async_func and async_func2 in the sample code below).
Each job gets submitted to a Pool with apply_async and takes a different amount of time, i.e. I cannot be sure that a job which was submitted first, also finishes first
I can check whether the job was finished with .get(timeout=0.1)
Question
How I can check whether the job is still waiting in the queue or is already running?
Is using a Queue the correct way or is there a more simple way?
Code
import multiprocessing
import random
import time
def async_func(x):
iterations = 0
x = (x + 0.1) % 1
while (x / 10.0) - random.random() < 0:
iterations += 1
time.sleep(0.01)
return iterations
def async_func2(x):
return(async_func(x + 0.5))
if __name__ == "__main__":
results = dict()
status = dict()
finished_processes = 0
worker_pool = multiprocessing.Pool(4)
jobs = 10
for i in range(jobs):
if i % 2 == 0:
results[i] = worker_pool.apply_async(async_func, (i,))
else:
results[i] = worker_pool.apply_async(async_func2, (i,))
status[i] = 'submitted'
while finished_processes < jobs:
for i in range(jobs):
if status[i] != 'finished':
try:
print('{0}: iterations needed = {1}'.format(i, results[i].get(timeout=0.1)))
status[i] = 'finished'
finished_processes += 1
except:
# how to distinguish between "running but no result yet" and "waiting to run"
status[i] = 'unknown'
Just send the status dict, to the function, since dicts are mutable all you need to do is change a bit your functions:
def async_func2(status, x):
status[x] = 'Started'
return(async_func(x + 0.5))
Of course you can change the status to pending just before calling your apply_async
I am having a problem when multithreading and using queues in python 2.7. I want the code with threads to take about half as long as the one without, but I think I'm doing something wrong. I am using a simple looping technique for the fibonacci sequence to best show the problem.
Here is the code without threads and queues. It printed 19.9190001488 seconds as its execution time.
import time
start_time = time.time()
def fibonacci(priority, num):
if num == 1 or num == 2:
return 1
a = 1
b = 1
for i in range(num-2):
c = a + b
b = a
a = c
return c
print fibonacci(0, 200000)
print fibonacci(1, 100)
print fibonacci(2, 200000)
print fibonacci(3, 2)
print("%s seconds" % (time.time() - start_time))
Here is the code with threads and queues. It printed 21.7269999981 seconds as its execution time.
import time
start_time = time.time()
from Queue import *
from threading import *
numbers = [200000,100,200000,2]
q = PriorityQueue()
threads = []
def fibonacci(priority, num):
if num == 1 or num == 2:
q.put((priority, 1))
return
a = 1
b = 1
for i in range(num-2):
c = a + b
b = a
a = c
q.put((priority, c))
return
for i in range(4):
priority = i
num = numbers[i]
t = Thread(target = fibonacci, args = (priority, num))
threads.append(t)
#print threads
for t in threads:
t.start()
for t in threads:
t.join()
while not q.empty():
ans = q.get()
q.task_done()
print ans[1]
print("%s seconds" % (time.time() - start_time))
What I thought would happen is the multithreaded code takes half as long as the code without threads. Essentially I thought that all the threads work at the same time, so the 2 threads calculating the fibonacci number at 200,000 would finish at the same time, so execution is about twice as fast as the code without threads. Apparently that's not what happened. Am I doing something wrong? I just want to execute all threads at the same time, print in the order that they started and the thread that takes the longest time is pretty much the execution time.
EDIT:
I updated my code to use processes, but now the results aren't being printed. Only an execution time of 0.163000106812 seconds is showing. Here is the new code:
import time
start_time = time.time()
from Queue import *
from multiprocessing import *
numbers = [200000,100,200000,2]
q = PriorityQueue()
processes = []
def fibonacci(priority, num):
if num == 1 or num == 2:
q.put((priority, 1))
return
a = 1
b = 1
for i in range(num-2):
c = a + b
b = a
a = c
q.put((priority, c))
return
for i in range(4):
priority = i
num = numbers[i]
p = Process(target = fibonacci, args = (priority, num))
processes.append(p)
#print processes
for p in processes:
p.start()
for p in processes:
p.join()
while not q.empty():
ans = q.get()
q.task_done()
print ans[1]
print("%s seconds" % (time.time() - start_time))
You've run in one of the basic limiting factors of the CPython implementation, the Global Interpreter Lock or GIL. Effectively this serializes your program, your threads will take turns executing. One thread will own the GIL, while the other threads will wait for the GIL to come free.
One solution would to be use separate processes. Each process would have its own GIL so would execute in parallel. Probably the easiest way to do this is to use Python's multiprocessing module as replacement for the threading module.