I have multiple python files in different folders that work together to make my program function. They consist of a main.pyfile that creates new threads for each file and then starts them with the necessary parameters. This works great while the parameters are static, but if a variable changes in the main.py it doesn't get changed in the other files. I also can't import the main.py file into otherfile.py to get the new variable since it is in a previous dir.
I have created an example below. What should happen is that the main.py file creates a new thread and calls otherfile.py with set params. After 5 seconds, the variable in main.py changes and so should the var in otherfile (so it starts printing the number 5 instead of 10), but I haven't found a solution to update them in otherfile.py
The folder structure is as follows:
|-main.py
|-other
|
otherfile.py
Here is the code in both files:
main.py
from time import sleep
from threading import Thread
var = 10
def newthread():
from other.otherfile import loop
nt = Thread(target=loop(var))
nt.daemon = True
nt.start()
newthread()
sleep(5)
var = 5 #change the var, otherfile.py should start printing it now (doesnt)
otherfile.py
from time import sleep
def loop(var):
while True:
sleep(1)
print(var)
In Python, there are two types of objects:
Immutable objects can’t be changed.
Mutable objects can be changed.
Int is immutable. you must be use list or dict variable.
from time import sleep
from threading import Thread
var = [10]
def newthread():
from other.otherfile import loop
nt = Thread(target=loop, args=(var,), daemon=True)
nt.start()
newthread()
sleep(5)
var[0] = 5
This happens because of how objects are passed into functions in Python. You'll hear that everything is passed by reference in Python, but since integers are immutable, when you edit the value of val, you're actually creating a new object and your thread still holds a reference to the integer with a value of 10.
To get around this, I wrote a simple wrapper class for an integer:
class IntegerHolder():
def __init__(self, n):
self.value = n
def set_value(self, n):
self.value = n
def get_value(self):
return self.value
Then, instead of var = 10, I did i = IntegerHolder(10), and after the sleep(5) call, I simply did i.set_value(5), which updates the wrapper object. The thread still has the same reference to the IntegerHolder object i, and when i.get_value() is called in the thread, it will return 5, as required.
You can also do this with a Python list, since lists are objects — it's just that this implementation makes it clearer what's going on. You'd just do var = [10] and do var[0] = 5, which would work since your thread should still keep a reference to the same list object as the main thread.
Two more errors:
Instead of Thread(target=loop(var)), you need to do Thread(target=loop, args=(i,)). This is because target is supposed to be a callable object, which is basically a function. Doing loop(var) will cause the Thread constructor to loop forever waiting for the function to return (and then set target to the return value), so the thread never actually gets created. You can verify this with your favorite Python debugger, or print statements.
Setting nt.daemon = True allows main.py to exit before the thread finishes. This means that as soon as i.set_value(5) is called, the main program terminates and your integer wrapper object ceases to exist. This makes your thread very confused when it tries to access the wrapper object, and by very confused, I mean it throws an exception and dies because threads do that. You can verify this by catching the exit code of the thread. Deleting that line fixes things (nt.daemon = False by default), but it's probably safer to do a nt.join() call in the main thread, which waits for a thread to finish execution.
And one warning, because programming wouldn't be complete without warnings:
Whenever different threads try to access a value, if AT LEAST ONE thread is modifying the value, this can cause a race condition. This means that all accesses at that point should be wrapped in a lock/mutex to prevent this. The Python (3.7.4) docs have more info about this.
Let me know if you have any more questions!
Related
Well met!
I'm trying to use pyautogui to run some simple checks, I'm attempting to make the main process detect a visual input, then start a sub process that continually updates a shared variable with the Y position of a different image as it moves through the screen until it disappears.
Unfortunately I'm barely a programmer so I keep getting stuck on the execution, so I wanted to ask for help. This is the code I wrote,
import pyautogui
import time
import importlib
foobar = importlib.import_module("cv2")
foocat = importlib.import_module("concurrent")
import numpy
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as executor:
CheckingInput = executor.submit(CheckPositionInput)
CheckingImage = executor.submit(CheckPositionImage)
print(XMark, YMark)
print(time.time() - startingtime)
def CheckPositionInput():
Checked = False
global XImage, YImage, XMark, YMark, AreaOnScreen
while not Checked:
print('Searching')
if pyautogui.locateCenterOnScreen('Area.png', confidence=0.8) != None:
Checked = True
AreaOnScreen = True
XMark, YMark = pyautogui.locateCenterOnScreen('Area.png', confidence=0.8)
def CheckPositionImage():
global XImage, YImage, XMark, YMark, AreaOnScreen
print('start')
while not AreaOnScreen:
print('Waiting')
while AreaOnScreen:
if pyautogui.locateCenterOnScreen('Image.png', confidence=0.6) != None:
XMark, YMark = pyautogui.locateCenterOnScreen('Image.png', confidence=0.6)
print(YMark)
print('Checking')
The problems I've run into go from the while loop in CheckPositionImage closing and dying after a single loop, to the while loop in CheckPositionImage getting stuck and stopping the check position process, and that no matter what I try I can't manage to update the crucial Ymark variable properly outside the process.
It's important to understand that global variables are not read/write sharable across multiple processes. A child process can possibly inherit such a variable value (depends on what platform you are running) and read its value, but once a process assigns a new value to that variable, this change is not reflected back to any other process. This is because every process runs in its own address space and can only modify its copy of a global variable. You would need to use instead shared memory variables created by the main process and passed to its child processes. But let's ignore this fundamental problem for now and assume that global variables were sharable.
If I follow your code correctly, this is what you appear to be doing:
The main process submits two tasks to a multiprocessing pool to be processed by worker functions CheckPositionInput and CheckPositionImage and then waits for both tasks to complete to print out global variables XMark and YMark, presumably set by the CheckPositionImage function.
CheckPositionImage is effectively doing nothing until CheckPositionInput sets global variable AreaOnScreen to True, which only occurs after the call pyautogui.locateCenterOnScreen('Area.png', confidence=0.8) returns a value that is not None. When this occurs, checked is set to True and your loop terminates effectively terminating the task.
When varibale AreaOnScreen is set to True (in step 2. above), function CheckPositionImage finally enters into a loop calling pyautogui.locateCenterOnScreen('Image.png', confidence=0.6). When this function returns a value that is not None a couple of print statements are issued and the loop is re-iterated.
To the extent that my analysis is correct, I have a few comments:
This CheckPositionImage task never ends since variable AreaOnSCreen is neither ever reset to False nor is a return nor break statement issued in the loop. I assume this is an oversight and once we are returned a non-None value from our call to pyautogui.locateCenterOnScreen, we should return. My assumption is based on the fact that without this termination occurring, the main process's block beginning with concurrent.futures.ProcessPoolExecutor() as executor: will never complete (there is am implicit wait for all submitted tasks to complete) and you will therefore never fall through to the subsequent print statements.
You never initialize variable startingtime.
Function CheckPositionInput sets global variables XMark and YMark, whose values are never referenced by either the main process or function pyautogui.locateCenterOnScreen('Image.png', confidence=0.6). What is the point in calling this function a second time with identical arguments to set these variables that are never read?
You have processes running, but the actual processing is essentially sequential: The main process does nothing until both child processes it has created end and one child process does nothing useful until the other child process sets a flag when its terminating. I see, therefore, no reason for using multiprocessing at all. Your code could be simply (note that I have renamed variables and functions according to Python's PEP8 coding conventions:
import pyautogui
import time
# What is the purpose of these next 3 commented-out statements?
#import importlib
#foobar = importlib.import_module("cv2")
#foocat = importlib.import_module("concurrent")
def search_and_check():
print('Searching...')
while True:
if pyautogui.locateCenterOnScreen('Area.png', confidence=0.8) != None:
# What is the purpose of this second call, which I have commented out?
# Note that the values set, i.e. xMark and yMark, are never referenced.
#xMark, yMark = pyautogui.locateCenterOnScreen('Area.png', confidence=0.8)
break
print('Checking...')
while True:
result = pyautogui.locateCenterOnScreen('Image.png', confidence=0.6)
if result != None:
return result
starting_time = time.time()
xMark, yMark = search_and_check()
print(xMark, yMark)
print(time.time() - starting_time)
Could/should the two different calls to pyautogui.locateCenterOnScreen be done in parallel?
I'm trying to use a multiprocessing.Queue to communicate something when an object in another process which has this queue as an attribute is deleted. While doing so I noticed that Queue.put() with block=False (or equivalently Queue.put_nowait()) blocks anyway under certain circumstances and boiled it down to the following minimal reproducible example:
import multiprocessing as mp
class A:
def __init__(self):
self.q = mp.Queue()
# uncommenting this makes it work fine:
# self.q.put("test")
def __del__(self):
print("before putting CLOSE")
self.q.put_nowait("CLOSE")
print("after putting CLOSE")
a = A()
# uncommenting this also makes it work fine:
# del a
Running this code, the output gets as far as
before putting CLOSE
and then it freezes indefinitely.
I'm at a loss as to what's going on here and why this happens. Queue.put_nowait() seems to block if and only if it's called from an object's destructor (__del__()) and it hasn't had data put into it before and the destructor was called due to the object going out of scope rather than due to explicit del. The same exact thing happens if the queue is a global variable rather than an attribute of A, as long as put_nowait() is called within A's destructor.
Interestingly, aborting the script with Ctrl+C doesn't result in the usual exception output (KeyboardInterrupt etc.) but quits without any output.
Am I missing something obvious here? Why does this happen?
I have written a class in python 2.7 (under linux) that uses multiple processes to manipulate a database asynchronously. I encountered a very strange blocking behaviour when using multiprocessing.Queue.put() and multiprocessing.Queue.get() which I can't explain.
Here is a simplified version of what I do:
from multiprocessing import Process, Queue
class MyDB(object):
def __init__(self):
self.inqueue = Queue()
p1 = Process(target = self._worker_process, kwargs={"inqueue": self.inqueue})
p1.daemon = True
started = False
while not started:
try:
p1.start()
started = True
except:
time.sleep(1)
#Sometimes I start a same second process but it makes no difference to my problem
p2 = Process(target = self._worker_process, kwargs={"inqueue": self.inqueue})
#blahblah... (same as above)
#staticmethod
def _worker_process(inqueue):
while True:
#--------------this blocks depite data having arrived------------
op = inqueue.get(block = True)
#do something with specified operation
#---------------problem area end--------------------
print "if this text gets printed, the problem was solved"
def delete_parallel(self, key, rawkey = False):
someid = ...blahblah
#--------------this section blocked when I was posting the question but for unknown reasons it's fine now
self.inqueue.put({"optype": "delete", "kwargs": {"key":key, "rawkey":rawkey}, "callid": someid}, block = True)
#--------------problem area end----------------
print "if you see this text, there was no blocking or block was released"
If I run the code above inside a test (in which I call delete_parallel on the MyDB object) then everything works, but if I run it in context of my entire application (importing other stuff, inclusive pygtk) strange things happen:
For some reason self.inqueue.get blocks and never releases despite self.inqueue having the data in its buffer. When I instead call self.inqueue.get(block = False, timeout = 1) then the call finishes by raising Queue.Empty, despite the queue containing data. qsize() returns 1 (suggests that data is there) while empty() returns True (suggests that there is no data).
Now clearly there must be something somewhere else in my application that renders self.inqueue unusable by causing acquisition of some internal semaphore. However I don't know what to look for. Eclipse dubugging becomes useless once a blocking semaphore is reached.
Edit 8 (cleaning up and summarizing my previous edits) Last time I had a similar problem, it turned out that pygtk was hijacking the global interpreter lock, but I solved it by calling gobject.threads_init() before I called anything else. Could this issue be related?
When I introduce a print "successful reception" after the get() method and execute my application in terminal, the same behaviour happens at first. When I then terminate by pressing CTRL+D I suddenly get the string "successful reception" inbetween messages. This looks to me like some other process/thread is terminated and releases the lock that blocks the process that is stuck at get().
Since the process that was stuck terminates later, I still see the message. What kind of process could externally mess with a Queue like that? self.inqueue is only accessed inside my class.
Right now it seems to come down to this queue which won't return anything despite the data being there:
the get() method seems to get stuck when it attempts to receive the actual data from some internal pipe. The last line before my debugger hangs is:
res = self._recv()
which is inside of multiprocessing.queues.get()
Tracking this internal python stuff further I find the assignments
self._recv = self._reader.recv and self._reader, self._writer = Pipe(duplex=False).
Edit 9
I'm currently trying to hunt down the import that causes it. My application is quite complex with hundreds of classes and each class importing a lot of other classes, so it's a pretty painful process. I have found a first candidate class which Uses 3 different MyDB instances when I track all its imports (but doesn't access MyDB.inqueue at any time as far as I can tell). The strange thing is, it's basically just a wrapper and the wrapped class works just fine when imported on its own. This also means that it uses MyDB without freezing. As soon as I import the wrapper (which imports that class), I have the blocking issue.
I started rewriting the wrapper by gradually reusing the old code. I'm testing each time I introduce a couple of new lines until I will hopefully see which line will cause the problem to return.
queue.Queue uses internal threads to maintain its state. If you are using GTK then it will break these threads. So you will need to call gobject.init_threads().
It should be noted that qsize() only returns an approximate size of the queue. The real size may be anywhere between 0 and the value returned by qsize().
How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.
I read on the python documentation that Queue.Queue() is a safe way of passing variables between different threads. I didn't really know that there was a safety issue with multithreading. For my application, I need to develop multiple objects with variables that can be accessed from multiple different threads. Right now I just have the threads accessing the object variables directly. I wont show my code here because there's way too much of it, but here is an example to demonstrate what I'm doing.
from threading import Thread
import time
import random
class switch:
def __init__(self,id):
self.id=id
self.is_on = False
def self.toggle():
self.is_on = not self.is_on
switches = []
for i in range(5):
switches[i] = switch(i)
def record_switch():
switch_record = {}
while True:
time.sleep(10)
current = {}
current['time'] = time.srftime(time.time())
for i in switches:
current[i.id] = i.is_on
switch_record.update(current)
def toggle_switch():
while True:
time.sleep(random.random()*100)
for i in switches:
i.toggle()
toggle = Thread(target=toggle_switch(), args = ())
record = Thread(target=record_switch(), args = ())
toggle.start()
record.start()
So as I understand, the queue object can be used only to put and get values, which clearly won't work for me. Is what I have here "safe"? If not, how can I program this so that I can safely access a variable from multiple different threads?
Whenever you have threads modifying a value other threads can see, then you are going to have safety issues. The worry is that a thread will try to modify a value when another thread is in the middle of modifying it, which has risky and undefined behavior. So no, your switch-toggling code is not safe.
The important thing to know is that changing the value of a variable is not guaranteed to be atomic. If an action is atomic, it means that action will always happen in one uninterrupted step. (This differs very slightly from the database definition.) Changing a variable value, especially a list value, can often times take multiple steps on the processor level. When you are working with threads, all of those steps are not guaranteed to happen all at once, before another thread starts working. It's entirely possible that thread A will be halfway through changing variable x when thread B suddenly takes over. Then if thread B tries to read variable x, it's not going to find a correct value. Even worse, if thread B tries to modify variable x while thread A is halfway through doing the same thing, bad things can happen. Whenever you have a variable whose value can change somehow, all accesses to it need to be made thread-safe.
If you're modifying variables instead of passing messages, you should be using aLockobject.
In your case, you'd have a global Lock object at the top:
from threading import Lock
switch_lock = Lock()
Then you would surround the critical piece of code with the acquire and release functions.
for i in switches:
switch_lock.acquire()
current[i.id] = i.is_on
switch_lock.release()
for i in switches:
switch_lock.acquire()
i.toggle()
switch_lock.release()
Only one thread may ever acquire a lock at a time (this kind of lock, anyway). When any of the other threads try, they'll be blocked and wait for the lock to become free again. So by putting locks around critical sections of code, you make it impossible for more than one thread to look at, or modify, a given switch at any time. You can put this around any bit of code you want to be kept exclusive to one thread at a time.
EDIT: as martineau pointed out, locks are integrated well with the with statement, if you're using a version of Python that has it. This has the added benefit of automatically unlocking if an exception happens. So instead of the above acquire and release system, you can just do this:
for i in switches:
with switch_lock:
i.toggle()