I am suppose to modify this code without changing the main function to stop it from deadlocking. It is deadlocking because of how the locks end up waiting for each other but I cannot figure out how to stop it. My professors lecture talks about os.fork which I can't use since I am on windows.
I was looking into the pool thing with multiprocessing but can't see how to implement that without changing the main function. I am pretty sure I am supposed to use subprocess, but again, she didn't include any information about it and I can't find a relevant example online.
import threading
x = 0
def task(lock1, lock2, count):
global x
for i in range(count):
lock1.acquire()
lock2.acquire()
# Assume that a thread can update the x value
# only after both locks have been acquired.
x+=1
print(x)
lock2.release()
lock1.release()
# Do not modify the main method
def main():
global x
count = 1000
lock1 = threading.Lock()
lock2 = threading.Lock()
T1 = threading.Thread(target = task, args = (lock1, lock2, count))
T2 = threading.Thread(target = task, args = (lock2, lock1, count))
T1.start()
T2.start()
T1.join()
T2.join()
print(f"x = {x}")
main()
Edit: Changing task to this seems to have fixed it, although I do not think it was done the way she wanted...
def task(lock1, lock2, count):
global x
for i in range(count):
lock1.acquire(False)
lock2.acquire(False)
# Assume that a thread can update the x value
# only after both locks have been acquired.
x+=1
print(x)
if lock2.locked():
lock2.release()
if lock1.locked():
lock1.release()
Your threads need to lock the locks in a consistent order. You can do this by locking the one with the lower id value first:
def task(lock1, lock2, count):
global x
if id(lock1) > id(lock2):
lock1, lock2 = lock2, lock1
for i in range(count):
lock1.acquire()
lock2.acquire()
# Assume that a thread can update the x value
# only after both locks have been acquired.
x+=1
print(x)
lock2.release()
lock1.release()
With a consistent lock order, it's impossible for two threads to each be holding a lock the other needs.
(multiprocessing, subprocess, and os.fork are all unhelpful here. They would just add more issues.)
Related
Sorry if this is a stupid question, but I'm having trouble understanding how managers work in python.
Let's say I have a manager that contains a dictionary to be shared across all processes. I want to have just one process writing to the dictionary at a time, while many others read from the dictionary.
Can this happen concurrently, with no synchronization primitives or will something break if read/writes happen at the same time?
What if I want to have multiple processes writing to the dictionary at once - is that allowed or will it break (I know it could cause race conditions, but could it error out)?
Additionally, does a manager process each read and write transaction in a queue like fashion, one at a time, or does it do them all at once?
https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes
It depends on how you write to the dictionary, i.e. whether the operation is atomic or not:
my_dict[some_key] = 9 # this is atomic
my_dict[some_key] += 1 # this is not atomic
So creating a new key and updating a an existing key as in the first line of code above are atomic operations. But the second line of code are really multiple operations equivalent to:
temp = my_dict[some_key]
temp = temp + 1
my_dict[some_key] = temp
So if two processes were executing my_dict[some_key] += 1 in parallel, they could be reading the same value of temp = my_dict[some_key] and incrementing temp to the same new value and the net effect would be that the dictionary value only gets incremented once. This can be demonstrated as follows:
from multiprocessing import Pool, Manager, Lock
def init_pool(the_lock):
global lock
lock = the_lock
def worker1(d):
for _ in range(1000):
with lock:
d['x'] += 1
def worker2(d):
for _ in range(1000):
d['y'] += 1
if __name__ == '__main__':
lock = Lock()
with Manager() as manager, \
Pool(4, initializer=init_pool, initargs=(lock,)) as pool:
d = manager.dict()
d['x'] = 0
d['y'] = 0
# worker1 will serialize with a lock
pool.apply_async(worker1, args=(d,))
pool.apply_async(worker1, args=(d,))
# worker2 will not serialize with a lock:
pool.apply_async(worker2, args=(d,))
pool.apply_async(worker2, args=(d,))
# wait for the 4 tasks to complete:
pool.close()
pool.join()
print(d)
Prints:
{'x': 2000, 'y': 1162}
Update
As far as serialization, goes:
The BaseManager creates a server using by default a socket for Linux and a named pipe for Windows. So essentially every method you execute against a managed dictionary, for example, is pretty much like a remote method call implemented with message passing. This also means that the server could also be running on a different computer altogether. But, these method calls are not serialized; the object methods themselves must be thread-safe because each method call is run in a new thread.
The following is an example of creating our own managed type and having the server listening for requests possibly from a different computer (although in this example, the client is running on the same computer). The client is calling increment on the managed object 1000 times across two threads, but the method implementation is not done under a lock and so the resulting value of self.x when we are all done is not 1000. Also, when we retrieve the value of x twice concurrently by method get_x we see that both invocations start up more-or-less at the same time:
from multiprocessing.managers import BaseManager
from multiprocessing.pool import ThreadPool
from threading import Event, Thread, get_ident
import time
class MathManager(BaseManager):
pass
class MathClass:
def __init__(self, x=0):
self.x = x
def increment(self, y):
temp = self.x
time.sleep(.01)
self.x = temp + 1
def get_x(self):
print(f'get_x started by thread {get_ident()}', time.time())
time.sleep(2)
return self.x
def set_x(self, value):
self.x = value
def server(event1, event2):
MathManager.register('Math', MathClass)
manager = MathManager(address=('localhost', 5000), authkey=b'abracadabra')
manager.start()
event1.set() # show we are started
print('Math server running; waiting for shutdown...')
event2.wait() # wait for shutdown
print("Math server shutting down.")
manager.shutdown()
def client():
MathManager.register('Math')
manager = MathManager(address=('localhost', 5000), authkey=b'abracadabra')
manager.connect()
math = manager.Math()
pool = ThreadPool(2)
pool.map(math.increment, [1] * 1000)
results = [pool.apply_async(math.get_x) for _ in range(2)]
for result in results:
print(result.get())
def main():
event1 = Event()
event2 = Event()
t = Thread(target=server, args=(event1, event2))
t.start()
event1.wait() # server started
client() # now we can run client
event2.set()
t.join()
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
Math server running; waiting for shutdown...
get_x started by thread 43052 1629375415.2502146
get_x started by thread 71260 1629375415.2502146
502
502
Math server shutting down.
I am making a simple project to learn about threading and this is my code:
import time
import threading
x = 0
def printfunction():
while x == 0:
print("process running")
def timer(delay):
while True:
time.sleep(delay)
break
x = 1
return x
t1 = threading.Thread(target = timer,args=[3])
t2 = threading.Thread(target = printfunction)
t1.start()
t2.start()
t1.join()
t2.join()
It is supposed to just print out process running in the console for three seconds but it never stops printing. The console shows me no errors and I have tried shortening the time to see if I wasn't waiting long enough but it still doesn't work. Then I tried to delete the t1.join()and t2.join()but I still have no luck and the program continues running.
What am I doing wrong?
Add
global x
to the top of timer(). As is, because timer() assigns to x, x is considered to be local to timer(), and its x = 1 has no effect on the module-level variable also named x. The global x remains 0 forever, so the while x == 0: in printfunction() always succeeds. It really has nothing to do with threading :-)
I want to be able to run multiple threads without actually making a new line for every thread I want to run. In the code below I cannot dynamically add more accountIDs, or increase the #of threads just by changing the count on thread_count
For example this is my code now:
import threading
def get_page_list(account,thread_count):
return list_of_pages_split_by_threads
def pull_data(page_list,account_id):
data = api(page_list,account_id)
return data
if __name__ == "__main__":
accountIDs = [100]
#of threads to make:
thread_count = 3
#Returns a list of pages ie : [[1,2,3],[4,5,6],[7,8,9,10]]
page_lists = get_page_list(accountIDs[0],thread_count)
t1 = threading.Thread(target=pull_data, args=(page_list[0],accountIDs[0]))
t2 = threading.Thread(target=pull_data, args=(page_list[1],accountIDs[0]))
t3 = threading.Thread(target=pull_data, args=(page_list[2],accountIDs[0]))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
This is where I want to get to:
Anytime I want to add an additional thread if the server can handle it or add additional accountIDs I dont have to reproduce the code?
IE (This example is what I would like to do, but the below doesnt work it tries to finish a whole list of pages before moving on to the next thread)
if __name__ == "__main__":
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
page_lists = get_page_list(account,thread_count)
for pg_list in page_list:
t1 = threading.Thread(target=pull_data, args=(pg_list,account))
t1.start()
t1.join()
One way of doing it is using Pool and Queue.
The pool will keep working while there are items in the queue, without holding the main thread.
Chose one of these imports:
import multiprocessing as mp (for process based parallelization)
import multiprocessing.dummy as mp (for thread based parallelization)
Creating the workers, pool and queue:
the_queue = mp.Queue() #store the account ids and page lists here
def worker_main(queue):
while waiting == True:
while not queue.empty():
account, pageList = queue.get(True) #get an id from the queue
pull_data(pageList, account)
waiting = True
the_pool = mp.Pool(num_parallel_workers, worker_main,(the_queue,))
# don't forget the coma here ^
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
list_of_page_lists = get_page_list(account, thread_count)
for pg_list in page_list:
the_queue.put((account, pg_list))
....
waiting = False #while you don't do this, the pool will probably never end.
#not sure if it's a good practice, but you might want to have
#the pool hanging there for a while to receive more items
the_pool.close()
the_pool.join()
Another option is to fill the queue first, create the pool second, use the worker only while there are items in the queue.
Then if more data arrives, you create another queue, another pool:
import multiprocessing.dummy as mp
#if you are not using dummy, you will probably need a queue for the results too
#as the processes will not access the vars from the main thread
#something like worker_main(input_queue, output_queue):
#and pull_data(pageList,account,output_queue)
#and mp.Pool(num_parallel_workers, worker_main,(in_queue,out_queue))
#and you get the results from the output queue after pool.join()
the_queue = mp.Queue() #store the account ids and page lists here
def worker_main(queue):
while not queue.empty():
account, pageList = queue.get(True) #get an id from the queue
pull_data(pageList, account)
accountIDs = [100,101,103]
thread_count = 3
for account in accountIDs:
list_of_page_lists = get_page_list(account, thread_count)
for pg_list in page_list:
the_queue.put((account, pg_list))
the_pool = mp.Pool(num_parallel_workers, worker_main,(the_queue,))
# don't forget the coma here ^
the_pool.close()
the_pool.join()
del the_queue
del the_pool
I couldn't get MP to work correctly so I did this instead and it seems to work great. But MP is probably the better way to tackle this problem
#Just keeps track of the threads
threads = []
#Generates a thread for whatever variable thread_count = N
for thread in range(thread_count):
#function retrns a list of pages stored in page_listS, this ensures each thread gets a unique list.
page_list = page_lists[thread]
#actual fucntion for each thread to work
t = threading.Thread(target=pull_data, args=(account,thread))
#puts all threads into a list
threads.append(t)
#runs all the treads up
t.start()
#After all threads are complete back to the main thread.. technically this is not needed
for t in threads:
t.join()
I also didn't understand why you would "need" .join() great answer here:
what is the use of join() in python threading
The provided code is about 2 thread trying to access the function increment() to increment the value of a global variable x. I have designed a semaphore class for process synchronization. So the expected increment of each thread is expected to be 1000000 summing up to 2000000. But actual output is not reaching up to 2000000. The output is reaching up to 1800000 - 1950000. Why are all loop not executing?
import threading as th
x=0
class Semaphore:
def __init__(self):
self.__s = 1
def wait(self):
while(self.__s==0):
pass
self.__s-=1
def signal(self):
self.__s+=1
def increament(s):
global x
s.wait()
x+=1
s.signal()
def task1(s):
for _ in range(1000000):
increament(s)
def task2(s):
for _ in range(1000000):
increament(s)
def main():
s = Semaphore()
t1 = th.Thread(target=task1,name="t1",args=(s,))
t2 = th.Thread(target=task2,name="t1",args=(s,))
t1.start()
t2.start()
#Checking Synchronization
for _ in range(10):
print("Value of X: %d"%x)
#waiting for termination of thread
t2.join()
t1.join()
if __name__=="__main__":
main()
print("X = %d"%x) #Final Output
Output:
Value of X: 5939
Value of X: 14150
Value of X: 25036
Value of X: 50490
Value of X: 54136
Value of X: 57674
Value of X: 69994
Value of X: 84912
Value of X: 94284
Value of X: 105895
X = 1801436
The threads are working fine and they're completing correctly. It's your 'z' variable that's the problem.
In general using a global variable as a container for your shared memory between two threads is a bad way to go about it.
Check out this answer to see why.
I made the following changes to your code. I made 'z' the shared variable and 'x' and 'y' are data for each thread alone.
x=0
y=0
z=0
def increament1(s):
global x,z
s.wait()
x+=1
z+=1
s.signal()
def increament2(s):
global y,z
s.wait()
y+=1
z+=1
s.signal()
def task1(s):
for somei in range(1000000):
increament1(s)
def task2(s):
for somej in range(1000000):
increament2(s)
This is the output I got:
X = 1000000
Y = 1000000
Z = 1961404
As you can see there's nothing wrong with the threads themselves, as they're completing their execution. But the shared data Z is a little wonky. Z will change randomly each time you run the script. Hence as you can see using global variables as shared memory is a bad idea.
A much better option would be using some python supported sharing tool such as Queue provided by python's library itself. It's a multi-producer, multi-consumer message queue and helps when it comes to shared data such as the data you're using now.
Let me show you how it can be done with Queue:
import threading as th
from Queue import Queue
def task1(q):
global x,z
for somei in range(1000000):
q.put(q.get() + 1)
def task2(q):
global y,z
for somei in range(1000000):
q.put(q.get() + 1)
def main():
queue = Queue()
queue.put(0)
t1 = th.Thread(target=task1,name="t1",args=(queue, ))
t2 = th.Thread(target=task2,name="t1",args=(queue, ))
t1.start()
t2.start()
#Checking Synchronization
t1.join()
t2.join()
return queue.get()
if __name__=="__main__":
print("Queue = %d"%main()) #Final Output
You don't even need to create a semaphore here as the Queue will automatically take care of synchronization.
The output of this final program is this:
Queue = 2000000
I am reading myself into multighreading in python and came up with this simple test:
(btw. this implementation might be very bad, I just wrote that down quickly for testing purpose. Buf if there is something terribly wrong I would be thankful if you could point that out)
#!/usr/bin/python2.7
import threading
import timeit
lst = range(0, 100000)
lstres = []
lstlock = threading.Lock()
lstreslock = threading.Lock()
def add_five(x):
return x+5
def worker_thread(args):
print "started"
while len(lst) > 0:
lstlock.acquire()
try:
x = lst.pop(0)
except IndexError:
lstlock.release()
return
lstlock.release()
x = add_five(x)
lstreslock.acquire()
lstres.append(x)
lstreslock.release()
def test():
try:
t1 = threading.Thread(target = worker_thread, args = (1,))
#t2 = threading.Thread(target = worker_thread, args = (2,))
#t3 = threading.Thread(target = worker_thread, args = (3,))
#t4 = threading.Thread(target = worker_thread, args = (4,))
t1.start();
#t2.start();
#t3.start();
#t4.start();
t1.join();
#t2.join();
#t3.join();
#t4.join();
except:
print "Error"
print len(lstres)
if __name__ == "__main__":
t = timeit.Timer(test)
print t.timeit(2)
Despite the terrible example I see the following: one thread is faster than 4.
With one thread I get: 13.46 seconds, and with 4 threads: 25.47 seconds.
Is the access to the list by 4 threads a bottleneck thus causing slower times or did I do something wrong?
In your case, the Global Interpreter Lock isn't actually the problem.
Threading doesn't make things faster by default. In your case, the code is CPU bound. No thread is ever waiting for I/O (which allow another to use the CPU). If you have code which needs 100% of the CPU, then threading will only make it faster if a lot of the code is independent which your's isn't: Most of your code is holding locks, so no other thread can proceed.
Which brings us to the cause of the slowdown: Switching threads and fighting for locks costs time. That's what eats 12s in your case.