I have a problem with values being passed from threads to threads. I just want that every thread once started has the variable with the thread-only values. For example, if I have a requests.session, I don't want that the session for Thread 1 and 2 are the same.
import requests
import threading
def functionName():
s=requests.session()
r=s.get("") #get a random site
#do some things
if __name__== "__main__":
t=threading.Thread(target=functionName)
tt=threading.Thread(target=functionName)
t.start()
tt.start()
If I add other actions instead of #do some things and save the whole results in a file, it looks like the two threads got merged and worked in an unique session, even if I want the 2 sessions being separate for each Thread.
From your description of the problem and the fact that r and s are already local to each thread (as #Solomon Slow pointed-out in a comment), I suspect the problem is with how you're obtaining the results from each thread.
Since you haven't provided a MCVE, I made up something to show one way that can be done. In it, the results of each thread are stored in a shared global dictionary named merged. As you can see from the output, the two threads did not interfere with one another.
from ast import literal_eval
import requests
import threading
from random import randint
def functionName(thread_name, shared, lock):
s = requests.Session()
sessioncookie = str(randint(100000000, 123456789))
s.get('https://httpbin.org/cookies/set/sessioncookie/' + sessioncookie)
r = s.get('https://httpbin.org/cookies')
r_as_dict = literal_eval(r.text)
print('r_as_dict:', r_as_dict)
# Store result in shared dictionary.
with lock:
shared[thread_name] = r_as_dict['cookies']['sessioncookie']
if __name__ == '__main__':
merged = {}
mlock = threading.Lock() # Control concurrent access to "merged" dict.
t=threading.Thread(target=functionName, args=('thread1', merged, mlock))
tt=threading.Thread(target=functionName, args=('thread2', merged, mlock))
t.start()
tt.start()
t.join()
tt.join()
print(merged)
Sample output:
r_as_dict: {'cookies': {'sessioncookie': '111147840'}}
r_as_dict: {'cookies': {'sessioncookie': '119511820'}}
{'thread1': '111147840', 'thread2': '119511820'}
Related
I'm trying to store the results of multiple API requests using multiprocessing queue as the API can't handle more than 5 connections at once.
I found part of a solution of How to use multiprocessing with requests module?
def worker(input_queue, stop_event):
while not stop_event.is_set():
try:
# Check if any request has arrived in the input queue. If not,
# loop back and try again.
request = input_queue.get(True, 1)
input_queue.task_done()
except queue.Empty:
continue
print('Started working on:', request)
api_request_function(request) #make request using a function I wrote
print('Stopped working on:', request)
def master(api_requests):
input_queue = multiprocessing.JoinableQueue()
stop_event = multiprocessing.Event()
workers = []
# Create workers.
for i in range(3):
p = multiprocessing.Process(target=worker,
args=(input_queue, stop_event))
workers.append(p)
p.start()
# Distribute work.
for requests in api_requests:
input_queue.put(requests)
# Wait for the queue to be consumed.
input_queue.join()
# Ask the workers to quit.
stop_event.set()
# Wait for workers to quit.
for w in workers:
w.join()
print('Done')
I've looked at the documentation of threading and pooling but missing a step. So the above runs and all requests get a 200 status code which is great. But I do I store the results of the requests to use?
Thanks for your help
Shan
I believe you have to make a Queue. The code can be a little tricky, you need to read up on the multiprocessing module. In general, with multiprocessing, all the variables are copied for each worker, hence you can't do something like appending to a global variable. Since that will literally be copied and the original will be untouched. There are a few functions that already automatically incorporate workers, queues, and return values. Personally, I try to write my functions to work with mp.map, like below:
def worker(*args,**kargs):
#do stuff
return 'thing'
output = multiprocessing.Pool().map(worker,[1,2,3,4,5])
Currently, I am trying to spawn a process in a Python program which again creates threads that continuously update variables in the process address space. So far I came up with this code which runs, but the update of the variable seems not to be propagated to the process level. I would have expected that defining a variable in the process address space and using global in the thread (which shares the address space of the process) would allow the thread to manipulate the variable and propagate the changes to the process.
Below is a minimal example of the problem:
import multiprocessing
import threading
import time
import random
def process1():
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
def urlCaller(url):
global lst
while True:
# the thread continuously pulls data from an API
# this is I/O heavy and therefore done by a thread
lst = {random.randint(1,9), random.randint(20,30)}
print(lst)
time.sleep(2)
prcss = multiprocessing.Process(target = process1)
prcss.start()
The process always prints an empty list while the thread prints, as expected, a list with two integers. I would expect that the process prints a list with two integers as well.
(Note: I am using Spyder as IDE and somehow there is only printed something to the console if I run this code on Linux/Ubuntu but nothing is printed to the console if I run the exact same code in Spyder on Windows.)
I am aware that the use of global variables is not always a good solution but I think it serves the purpose well in this case.
You might wonder why I want to create a thread within a process. Basically, I need to run the same complex calculation on different data sets that constantly change. Hence, I need multiple processes (one for each data set) to optimize the utilization of my CPU and use threads within the processes to make the I/O process most efficient. The data depreciates very fast, therefore, I cannot just store it in a database or file, which would of course simplify the communication process between data producer (thread) and data consumer (process).
You are defining a local variable lst inside the function process1, so what urlCaller does is irrelevant, it cannot change the local variable of a different function. urlCaller is defining a global variable but process1 can never see it because it's shadowed by the local variable you defined.
You need to remove lst = {} from that function and find an other way to return a value or declare the variable global there too:
def process1():
global lst
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
I'd use something like concurrent.futures instead of the threading module directly.
Thanks to the previous answer, I figured out that it's best to implement a process class and define "thread-functions" within this class. Now, the threads can access a shared variable and manipulate this variable without the need of using "thread.join()" and terminating a thread.
Below is a minimal example in which 2 concurrent threads provide data for a parent process.
import multiprocessing
import threading
import time
import random
class process1(multiprocessing.Process):
lst = {}
url = "url"
def __init__(self, url):
super(process1, self).__init__()
self.url = url
def urlCallerInt(self, url):
while True:
self.lst = {random.randint(1,9), random.randint(20,30)}
time.sleep(2)
def urlCallerABC(self, url):
while True:
self.lst = {"Ab", "cD"}
time.sleep(5)
def run(self):
t1 = threading.Thread(target = self.urlCallerInt, args=(self.url,))
t2 = threading.Thread(target = self.urlCallerABC, args=(self.url,))
t1.start()
t2.start()
while True:
print(self.lst)
time.sleep(1)
p1 = process1("url")
p1.start()
I want to read in a stream of float numbers, do some simple calculation and append the value into a global list. Can you tell where I get it wrong? The list is not appending.
from random import random
from time import sleep
def process(x):
from random import random
sleep(random()*2)
t = x * 2
processed_queue.append(t)
print(processed_queue)
return t
if __name__ == "__main__":
from distributed import Client
from queue import Queue
client = Client()
processed_queue = []
input_q = Queue()
remote_q = client.scatter(input_q)
processed_q = client.map(process, remote_q)
result_q = client.gather(processed_q)
for i in [random() for x in range(100)]:
sleep(random())
input_q.put(i)
print(i)
print(processed_queue)
print(result_q.qsize())
Whilest queue.Queue and multiprocessing.Queue can be used to send data between threads and processes, generally this kind of programming-by-side-effect is not the model encouraged by dask.
You are able to pass data to functions executed by the cluster and get their return values in real time using client.submit, what are the queues doing for you that you cannot do otherwise? In addition, there are some dask constructs such as shared variables that maybe could do this, but (again) that is rarely used and I think unlikely the right paradigm for you.
For the specific reason that the code is not working for you: Client() creates at least one separate process for the scheduler and one for a worker with one or more threads (see your task-manager, top, or other system-watching tool). The queue.Queue is process-local, so each process will see the empty queue and add to it, but that information is not seen in the main process, and actions on the input queue are not seen in the workers.
I have a system designed to take data via a socket and store that into a dictionary to serve as a database. Then all my other modules (GUI, analysis, write_to_log_file, etc) will access the database and do what they need to do with the dictionary e.g make widgets/copy the dictionary to a log file. But since all these things happen at a different rate, I chose to have each module on their own thread so I can control the frequency.
In the main run function there's something like this:
from threading import Thread
import data_collector
import write_to_log_file
def main():
db = {}
receive_data_thread = Thread(target=data_collector.main, arg=(db,))
recieve_data_thread.start() # writes to dictionary # 50 Hz
log_data_thread = Thread(target=write_to_log_file.main, arg(db,))
log_data_thread.start() # reads dictionary # 1 Hz
But it seems that both modules aren't working on the same dictionary instance because the log_data_thread just prints out the empty dictionary even when the data_collector shows the data it's inserted into the dictionary.
There's only one writer to the dictionary so I don't have to worry about threads stepping on each others toes, I just need to figure out a way for all the modules to read the current database as it's being written.
Rather than using a builtin dict, you could look at using a Manager object from the multiprocessing library:
from multiprocessing import Manager
from threading import Thread
from time import sleep
manager = Manager()
d = manager.dict()
def do_this(d):
d["this"] = "done"
def do_that(d):
d["that"] ="done"
thread0 = Thread(target=do_this,args=(d,))
thread1 = Thread(target=do_that,args=(d,))
thread0.start()
thread1.start()
thread0.join()
thread1.join()
print d
This gives you a standard-library thread-safe synchronised dictionary which should be easy to swap in to your current implementation without changing the design.
Use a Queue.Queue to pass values from the reader threads to a single writer thread. Pass the Queue instance to each data_collector.main function. They can all call the Queue's put method.
Meanwhile the write_to_log_file.main should also be passed the same Queue instance, and it can call the Queue's get method.
As items are pulled out of the Queue, they can be added to the dict.
See also: Alex Martelli, on why Queue.Queue is the secret sauce of CPython multithreading.
This should not be a problem. I also assume you are using the threading module. I would have to know more about what the data_collector and write_to_log_file are doing to figure out why they are not working.
You could technically even have more then 1 thread writing and it would not be a problem because the GIL would take care of all the locking needed. Granted you will never get more then one cpus worth of work out of it.
Here is a simple Example:
import threading, time
def addItem(d):
c = 0
while True:
d[c]="test-%d"%(c)
c+=1
time.sleep(1)
def checkItems(d):
clen = len(d)
while True:
if clen < len(d):
print "dict changed", d
clen = len(d)
time.sleep(.5)
DICT = {}
t1 = threading.Thread(target=addItem, args=(DICT,))
t1.daemon = True
t2 = threading.Thread(target=checkItems, args=(DICT,))
t2.daemon = True
t1.start()
t2.start()
while True:
time.sleep(1000)
Sorry, I figured out my problem, and I'm dumb. The modules were working on the same dictionary, but my logger wasn't wrapped around a while True so it just executed once and terminated the thread and thus my dictionary was only logged to disk once. So I made write_to_log_file.main(db) constantly write at 1Hz forever and set log_data_thread.deamon = True so that once the writer thread (which won't be a daemon thread) exits, it'll quit. Thanks for all the input about best practices on this type of system.
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.