I'm confused with the questions about python multiprocessing problem - python

import multiprocessing
import time
import os
def whoami(name):
print("I'm %s, in process %s" % (name, os.getpid()))
def loopy(name):
whoami(name)
start = 1
stop = 1000000
for num in range(start, stop):
print("\tNumber %s of %s. Honk!!!" % (num, stop))
time.sleep(1)
if __name__ == "__main__":
whoami("main")
p = multiprocessing.Process(target=loopy, args=("loopy",))
p.start()
time.sleep(5)
p.terminate()
This code was from 'Introducing python' Book, 2nd versions by Bill Lubanovic. According to the book, when I run this code, the result occurs:
I'm main, in process xxxx
I'm loopy, in process ----
Number 1 of 1000000. Honk!
Number 2 of 1000000. Honk!
Number 3 of 1000000. Honk!
Number 4 of 1000000. Honk!
Number 5 of 1000000. Honk!
However, In my case, only 'Process finished with exit code 0' printed. I want to know which point of this code is wrong.

Related

how to extract a variable in the middle of process to another process multiprocessing python

i made a program in python that runs 2 processes simultaneously, but i want to take a variable in the middle of the first process to the other process, like passing that variable from function to function. Here is a code that is working but the output is not what i expected.
import multiprocessing
import time
t=0
def countup():
global t
while t<25:
t=t+1
time.sleep(1)
print("count",t)
def what():
global t
globals().update()
while True:
time.sleep(3)
print ("the count reached ",t)
if __name__=="__main__":
p1 = multiprocessing.Process(target=countup,args=())
p2 = multiprocessing.Process(target=what, args=())
p1.start()
p2.start()
p1.join()
p2.join()
the output is shown as this
count 1
count 2
the count reached 0
count 3
count 4
count 5
the count reached 0
count 6
count 7
count 8
the count reached 0
...
the variable "t" is not updating, but both processes are working at the same time. The expected results suppose to be like this:
count 1
count 2
the count reached 2
count 3
count 4
count 5
the count reached 5
count 6
count 7
count 8
the count reached 6
...
am i missing something ? is there something wrong in my code?
i solved it , according to #deets , using Queuing or pipes to exchange variables is the key.
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print(q.get()) # prints "[42, None, 'hello']"
p.join()
amazing that i used this method for synchronized count printing between 2 different processes that work simultaneously. ... just wow
here is the code:
import multiprocessing
from multiprocessing import Process,Queue
import time
t=0
def countup(q):
global t
while t<25:
t=t+1
time.sleep(1)
print("count",t)
q.put(t)
if __name__=="__main__":
q = Queue()
p = multiprocessing.Process(target=countup,args=(q,))
p.start()
while (q.get()<15):
time.sleep(2)
print("the count reached ",q.get())
p.join()
the result is this
count 1
count 2
the count reached 2
count 3
count 4
count 5
the count reached 4
count 6
the count reached 6
count 7
count 8
...

Parallelizing through Multi-threading and Multi-processing taking significantly more time than serial

I'm trying to learn how to do parallel programming in python. I wrote a simple int square function and then ran it in serial, multi-thread, and multi-process:
import time
import multiprocessing, threading
import random
def calc_square(numbers):
sq = 0
for n in numbers:
sq = n*n
def splita(list, n):
a = [[] for i in range(n)]
counter = 0
for i in range(0,len(list)):
a[counter].append(list[i])
if len(a[counter]) == len(list)/n:
counter = counter +1
continue
return a
if __name__ == "__main__":
random.seed(1)
arr = [random.randint(1, 11) for i in xrange(1000000)]
print "init completed"
start_time2 = time.time()
calc_square(arr)
end_time2 = time.time()
print "serial: " + str(end_time2 - start_time2)
newarr = splita(arr,8)
print 'split complete'
start_time = time.time()
for i in range(8):
t1 = threading.Thread(target=calc_square, args=(newarr[i],))
t1.start()
t1.join()
end_time = time.time()
print "mt: " + str(end_time - start_time)
start_time = time.time()
for i in range(8):
p1 = multiprocessing.Process(target=calc_square, args=(newarr[i],))
p1.start()
p1.join()
end_time = time.time()
print "mp: " + str(end_time - start_time)
Output:
init completed
serial: 0.0640001296997
split complete
mt: 0.0599999427795
mp: 2.97099995613
However, as you can see, something weird happened and mt is taking the same time as serial and mp is actually taking significantly longer (almost 50 times longer).
What am I doing wrong? Could someone push me in the right direction to learn parallel programming in python?
Edit 01
Looking at the comments, I see that perhaps the function not returning anything seems pointless. The reason I'm even trying this is because previously I tried the following add function:
def addi(numbers):
sq = 0
for n in numbers:
sq = sq + n
return sq
I tried returning the addition of each part to a serial number adder, so at least I could see some performance improvement over a pure serial implementation. However, I couldn't figure out how to store and use the returned value, and that's the reason I'm trying to figure out something even simpler than that, which is just dividing up the array and running a simple function on it.
Thanks!
I think that multiprocessing takes quite a long time to create and start each process. I have changed the program to make 10 times the size of arr and changed the way that the processes are started and there is a slight speed-up:
(Also note python 3)
import time
import multiprocessing, threading
from multiprocessing import Queue
import random
def calc_square_q(numbers,q):
while q.empty():
pass
return calc_square(numbers)
if __name__ == "__main__":
random.seed(1) # note how big arr is now vvvvvvv
arr = [random.randint(1, 11) for i in range(10000000)]
print("init completed")
# ...
# other stuff as before
# ...
processes=[]
q=Queue()
for arrs in newarr:
processes.append(multiprocessing.Process(target=calc_square_q, args=(arrs,q)))
print('start processes')
for p in processes:
p.start() # even tho' each process is started it waits...
print('join processes')
q.put(None) # ... for q to become not empty.
start_time = time.time()
for p in processes:
p.join()
end_time = time.time()
print("mp: " + str(end_time - start_time))
Also notice above how I create and start the processes in two different loops, and then finally join with the processes in a third loop.
Output:
init completed
serial: 0.53214430809021
split complete
start threads
mt: 0.5551605224609375
start processes
join processes
mp: 0.2800724506378174
Another factor of 10 increase in size of arr:
init completed
serial: 5.8455305099487305
split complete
start threads
mt: 5.411392450332642
start processes
join processes
mp: 1.9705185890197754
And yes, I've also tried this in python 2.7, although Threads seemed slower.

Python Multiprocessing reading input iterator all at once

Using python 3.4.3, I have a generator function foo that yields data to be processed in parallel. Passing this function to multiprocessing.Pool.map of n processes, I expected it to be called n times at a time:
from multiprocessing import Pool
import time
now = time.time
def foo(n):
for i in range(n):
print("%f get %d" % (now(), i))
yield i
def bar(i):
print("%f start %d" % (now(), i))
time.sleep(1)
print("%f end %d" % (now(), i))
pool = Pool(2)
pool.map(bar, foo(6))
pool.close()
pool.join()
Unfortunately, the generator function is called 6 times immediately. The output is this:
1440713274.290760 get 0
1440713274.290827 get 1
1440713274.290839 get 2
1440713274.290849 get 3
1440713274.290858 get 4
1440713274.290867 get 5
1440713274.291526 start 0
1440713274.291654 start 1
1440713275.292680 end 0
1440713275.292803 end 1
1440713275.293056 start 2
1440713275.293129 start 3
1440713276.294106 end 2
1440713276.294182 end 3
1440713276.294344 start 4
1440713276.294390 start 5
1440713277.294803 end 4
1440713277.294859 end 5
But I had hoped to get something more like:
1440714272.612041 get 0
1440714272.612078 get 1
1440714272.612090 start 0
1440714272.612100 start 1
1440714273.613174 end 0
1440714273.613247 end 1
1440714273.613264 get 2
1440714273.613276 get 3
1440714273.613287 start 2
1440714273.613298 start 3
1440714274.614357 end 2
1440714274.614423 end 3
1440714274.614432 get 4
1440714274.614437 get 5
1440714274.614443 start 4
1440714274.614448 start 5
1440714275.615475 end 4
1440714275.615549 end 5
(Reason is that foo is going to read a large amount of data into memory.)
I got the same results with pool.imap(bar, foo(6), 2) and
for i in foo(6):
pool.apply_async(bar, args=(i,))
What is the easiest way to make this work?
I had faced a similar problem, where I needed to read a large amount of data and process parts of it in parallel. I solved it by sub-classing the multiprocessing.Process and using queues. I think you will benefit from reading about embarrassingly parallel problems. I have given sample code below:
import multiprocessing
import time
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)-8s %(message)s',
datefmt='%m-%d %H:%M:%S')
#Producer class
class foo(multiprocessing.Process):
def __init__(self, n, queues):
super(foo, self).__init__()
self.n=n
self.queues = queues
def run(self):
logging.info('Starting foo producer')
for i in range(self.n):
logging.info('foo: Sending "%d" to a consumer' % (i))
self.queues[i%len(self.queues)].put(i)
time.sleep(1)#Unnecessary sleep to demonstrate order of events
for q in self.queues:
q.put('end')
logging.info('Ending foo producer')
return
#Consumer class
class bar(multiprocessing.Process):
def __init__(self, idx, queue):
super(bar, self).__init__()
self.idx = idx
self.queue = queue
def run(self):
logging.info("Starting bar %d consumer" % (self.idx ))
while True:
fooput = self.queue.get()
if type(fooput)==str and fooput=='end':
break
logging.info('bar %d: Got "%d" from foo' % (self.idx, fooput))
time.sleep(2)#Unnecessary sleep to demonstrate order of events
logging.info("Ending bar %d consumer" % (self.idx ))
return
if __name__=='__main__':
#make queues to put data read by foo
count_queues = 2
queues =[]
for i in range(count_queues):
q = multiprocessing.Queue(2)
# Give queue size according to your buffer requirements
queues.append(q)
#make reader for reading data. lets call this object Producer
foo_object = foo(6, queues)
#make receivers for the data. Lets call these Consumers
#Each consumer is assigned a queue
bar_objects = []
for idx, q in enumerate(queues):
bar_object = bar(idx, q)
bar_objects.append(bar_object)
# start the consumer processes
for bar_object in bar_objects:
bar_object.start()
# start the producer processes
foo_object.start()
#Join all started processes
for bar_object in bar_objects:
bar_object.join()
foo_object.join()
The best I can come up with myself is this:
pool_size = 2
pool = Pool(pool_size)
count = 0
for i in foo(6):
count += 1
if count % pool_size == 0:
pool.apply(bar, args=(i,))
else:
pool.apply_async(bar, args=(i,))
pool.close()
pool.join()
for pool_size=2 it outputs:
1440798963.389791 get 0
1440798963.490108 get 1
1440798963.490683 start 0
1440798963.595587 start 1
1440798964.491828 end 0
1440798964.596687 end 1
1440798964.597137 get 2
1440798964.697373 get 3
1440798964.697629 start 2
1440798964.798024 start 3
1440798965.698719 end 2
1440798965.799108 end 3
1440798965.799419 get 4
1440798965.899689 get 5
1440798965.899984 start 4
1440798966.001016 start 5
1440798966.901050 end 4
1440798967.002097 end 5
for pool_size=3 it outputs:
1440799101.917546 get 0
1440799102.018438 start 0
1440799102.017869 get 1
1440799102.118868 get 2
1440799102.119903 start 1
1440799102.219616 start 2
1440799103.019600 end 0
1440799103.121066 end 1
1440799103.220746 end 2
1440799103.221124 get 3
1440799103.321402 get 4
1440799103.321664 start 3
1440799103.422589 get 5
1440799103.422824 start 4
1440799103.523286 start 5
1440799104.322934 end 3
1440799104.423878 end 4
1440799104.524350 end 5
However, it would take 3 new items from the iterator as soon as the apply finishes. If the processing takes variable time, this won't work as well.

running multiple processes simultaneously

I am attempting to create a program in python that runs multiple instances (15) of a function simultaneously over different processors. I have been researching this, and have the below program set up using the Process tool from multiprocessing.
Unfortunately, the program executes each instance of the function sequentially (it seems to wait for one to finish before moving onto the next part of the loop).
from __future__ import print_function
from multiprocessing import Process
import sys
import os
import re
for i in range(1,16):
exec("path%d = 0" % (i))
exec("file%d = open('%d-path','a', 1)" % (i, i))
def stat(first, last):
for j in range(1,40000):
input_string = "water" + str(j) + ".xyz.geocard"
if os.path.exists('./%s' % input_string) == True:
exec("out%d = open('output%d', 'a', 1)" % (first, first))
exec('print("Processing file %s...", file=out%d)' % (input_string, first))
with open('./%s' % input_string,'r') as file:
for line in file:
for i in range(first,last):
search_string = " " + str(i) + " path:"
for result in re.finditer(r'%s' % search_string, line):
exec("path%d += 1" % i)
for i in range(first,last):
exec("print(path%d, file=file%d)" % (i, i))
processes = []
for m in range(1,16):
n = m + 1
p = Process(target=stat, args=(m, n))
p.start()
processes.append(p)
for p in processes:
p.join()
I am reasonably new to programming, and have no experience with parallelization - any help would be greatly appreciated.
I have included the entire program above, replacing "Some Function" with the actual function, to demonstrate that this is not a timing issue. The program can take days to cycle through all 40,000 files (each of which is quite large).
I think what is happening is that you are not doing enough in some_function to observe work happening in parallel. It spawns a process, and it completes before the next one gets spawned. If you introduce a random sleep time into some_function, you'll see that they are in fact running in parallel.
from multiprocessing import Process
import random
import time
def some_function(first, last):
time.sleep(random.randint(1, 3))
print first, last
processes = []
for m in range(1,16):
n = m + 1
p = Process(target=some_function, args=(m, n))
p.start()
processes.append(p)
for p in processes:
p.join()
Output
2 3
3 4
5 6
12 13
13 14
14 15
15 16
1 2
4 5
6 7
9 10
8 9
7 8
11 12
10 11
Are you sure? I just tried it and it worked for me; the results are out of order on every execution, so they're being executed concurrently.
Have a look at your function. It takes "first" and "last", so is its execution time smaller for lower values? In this case, you could expect the smaller numbered arguments to make runtime lower, so it would appear to run in parallel.
ps ux | grep python | grep -v grep | wc -l
> 16
If you execute the code repeatedly (i.e. using a bash script) you can see that every process is starting up. If you want to confirm this, import os and have the function print out os.getpid() so you can see they have a different process ID.
So yeah, double check your results because it seems to me like you've written it concurrently just fine!
This code below can run 10 processes parallelly printing the numbers from 0 to 99.
*if __name__ == "__main__":
is needed to run processes on Windows:
from multiprocessing import Process
def test():
for i in range(0, 100):
print(i)
if __name__ == "__main__": # Here
process_list = []
for _ in range(0, 10):
process = Process(target=test)
process_list.append(process)
for process in process_list:
process.start()
for process in process_list:
process.join()
And, this code below is the shorthand for loop version of the above code running 10 processes parallelly printing the numbers from 0 to 99:
from multiprocessing import Process
def test():
[print(i) for i in range(0, 100)]
if __name__ == "__main__":
process_list = [Process(target=test) for _ in range(0, 10)]
[process.start() for process in process_list]
[process.join() for process in process_list]
This is the result below:
...
99
79
67
71
67
89
81
99
80
68
...

Confusion regarding the output of threads in python

I am currently working with python v.2.7 on windows 8.
My programme is using threads. I am providing a name to these threads during their creation. The first thread is named First-Thread and second one is named Second-Thread. The threads execute a method named as getData() that does the following:
makes the current thread to sleep for some time
calls the compareValues()
retrieve the information from the compareValues() and adds them to a
list called myList
The compareValues() does the following:
generates a random number
checks if it is less than 5 or if it is greater than or equal to 5
and yields the result along with the current thread's name
I save the results of these threads to a list named as myList and then finally print this myList.
Problem: Why I never see the Second-Thread in myList? I don't understand this behavior. Please try to execute this code to see the output in order to understand my problem.
Code:
import time
from random import randrange
import threading
myList = []
def getData(i):
print "Sleep for %d"%i
time.sleep(i)
data = compareValues()
for d in list(data):
myList.append(d)
def compareValues():
number = randrange(10)
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(t.name, number)
else:
yield "%s: Less than 5: %d "%(t.name, number)
threadList = []
wait = randrange(10)+1
t = threading.Thread(name = 'First-Thread', target = getData, args=(wait,))
threadList.append(t)
t.start()
wait = randrange(3)+1
t = threading.Thread(name = 'Second-Thread', target = getData, args=(wait,))
threadList.append(t)
t.start()
for t in threadList:
t.join()
print
print "The final list"
print myList
Sample output:
Sleep for 4Sleep for 1
The final list
['First-Thread: Greater than or equal to 5: 7 ', 'First-Thread: Greater than or equal to 5: 8 ']
Thank you for your time.
def compareValues():
number = randrange(10)
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(t.name, number)
else:
yield "%s: Less than 5: %d "%(t.name, number)
In the body of compareValues the code refers to t.name. By the time compareValues() gets called by the threads, t, which is looked up according to the LEGB rule and found in the global scope, references the first thread because the t.join() is waiting on the first thread. t.name thus has the value First-Thread.
To get the current thread name, use threading.current_thread().name:
def compareValues():
number = randrange(10)
name = threading.current_thread().name
if number >= 5:
yield "%s: Greater than or equal to 5: %d "%(name, number)
else:
yield "%s: Less than 5: %d "%(name, number)
Then you will get output like
Sleep for 4
Sleep for 2
The final list
['Second-Thread: Less than 5: 3 ', 'First-Thread: Greater than or equal to 5: 5 ']

Categories