Update a Global list in Dask

Update a Global list in Dask - python

I want to read in a stream of float numbers, do some simple calculation and append the value into a global list. Can you tell where I get it wrong? The list is not appending.
from random import random
from time import sleep
def process(x):
from random import random
sleep(random()*2)
t = x * 2
processed_queue.append(t)
print(processed_queue)
return t
if __name__ == "__main__":
from distributed import Client
from queue import Queue
client = Client()
processed_queue = []
input_q = Queue()
remote_q = client.scatter(input_q)
processed_q = client.map(process, remote_q)
result_q = client.gather(processed_q)
for i in [random() for x in range(100)]:
sleep(random())
input_q.put(i)
print(i)
print(processed_queue)
print(result_q.qsize())

Whilest queue.Queue and multiprocessing.Queue can be used to send data between threads and processes, generally this kind of programming-by-side-effect is not the model encouraged by dask.
You are able to pass data to functions executed by the cluster and get their return values in real time using client.submit, what are the queues doing for you that you cannot do otherwise? In addition, there are some dask constructs such as shared variables that maybe could do this, but (again) that is rarely used and I think unlikely the right paradigm for you.
For the specific reason that the code is not working for you: Client() creates at least one separate process for the scheduler and one for a worker with one or more threads (see your task-manager, top, or other system-watching tool). The queue.Queue is process-local, so each process will see the empty queue and add to it, but that information is not seen in the main process, and actions on the input queue are not seen in the workers.

Related

Separate variables for each thread

I have a problem with values being passed from threads to threads. I just want that every thread once started has the variable with the thread-only values. For example, if I have a requests.session, I don't want that the session for Thread 1 and 2 are the same.
import requests
import threading
def functionName():
s=requests.session()
r=s.get("") #get a random site
#do some things
if __name__== "__main__":
t=threading.Thread(target=functionName)
tt=threading.Thread(target=functionName)
t.start()
tt.start()
If I add other actions instead of #do some things and save the whole results in a file, it looks like the two threads got merged and worked in an unique session, even if I want the 2 sessions being separate for each Thread.

From your description of the problem and the fact that r and s are already local to each thread (as #Solomon Slow pointed-out in a comment), I suspect the problem is with how you're obtaining the results from each thread.
Since you haven't provided a MCVE, I made up something to show one way that can be done. In it, the results of each thread are stored in a shared global dictionary named merged. As you can see from the output, the two threads did not interfere with one another.
from ast import literal_eval
import requests
import threading
from random import randint
def functionName(thread_name, shared, lock):
s = requests.Session()
sessioncookie = str(randint(100000000, 123456789))
s.get('https://httpbin.org/cookies/set/sessioncookie/' + sessioncookie)
r = s.get('https://httpbin.org/cookies')
r_as_dict = literal_eval(r.text)
print('r_as_dict:', r_as_dict)
# Store result in shared dictionary.
with lock:
shared[thread_name] = r_as_dict['cookies']['sessioncookie']
if __name__ == '__main__':
merged = {}
mlock = threading.Lock() # Control concurrent access to "merged" dict.
t=threading.Thread(target=functionName, args=('thread1', merged, mlock))
tt=threading.Thread(target=functionName, args=('thread2', merged, mlock))
t.start()
tt.start()
t.join()
tt.join()
print(merged)
Sample output:
r_as_dict: {'cookies': {'sessioncookie': '111147840'}}
r_as_dict: {'cookies': {'sessioncookie': '119511820'}}
{'thread1': '111147840', 'thread2': '119511820'}

What is the easiest way to make maximum cpu usage for nested for-loops?

I have code that makes unique combinations of elements. There are 6 types, and there are about 100 of each. So there are 100^6 combinations. Each combination has to be calculated, checked for relevance and then either be discarded or saved.
The relevant bit of the code looks like this:
def modconffactory():
for transmitter in totaltransmitterdict.values():
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
Now this takes a long time and that is fine, but now I realize this process (making the configurations and then calculations for later use) is only using 1 of my 8 processor cores at a time.
I've been reading up about multithreading and multiprocessing, but I only see examples of different processes, not how to multithread one process. In my code I call two functions: 'dosomethingwith()' and 'saveforlateruse_if_useful()'. I could make those into separate processes and have those run concurrently to the for-loops, right?
But what about the for-loops themselves? Can I speed up that one process? Because that is where the time consumption is. (<-- This is my main question)
Is there a cheat? for instance compiling to C and then the os multithreads automatically?

I only see examples of different processes, not how to multithread one process
There is multithreading in Python, but it is very ineffective because of GIL (Global Interpreter Lock). So if you want to use all of your processor cores, if you want concurrency, you have no other choice than use multiple processes, which can be done with multiprocessing module (well, you also could use another language without such problems)
Approximate example of multiprocessing usage for your case:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(generator, step, offset, conn):
"""
Function to be invoked by every worker process.
generator: iterable object, the very top one of all you are iterating over,
in your case, totalrecieverdict.values()
We are passing a whole iterable object to every worker, they all will iterate
over it. To ensure they will not waste time by doing the same things
concurrently, we will assume this: each worker will process only each stepTH
item, starting with offsetTH one. step must be equal to the WORKERS_NUMBER,
and offset must be a unique number for each worker, varying from 0 to
WORKERS_NUMBER - 1
conn: a multiprocessing.Connection object, allowing the worker to communicate
with the main process
"""
for i, transmitter in enumerate(generator):
if i % step == offset:
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
conn.send('done')
def modconffactory():
"""
Function to launch all the worker processes and wait until they all complete
their tasks
"""
processes = []
generator = totaltransmitterdict.values()
for i in range(WORKERS_NUMBER):
conn, childConn = multiprocessing.Pipe()
process = multiprocessing.Process(target=modconffactoryProcess, args=(generator, WORKERS_NUMBER, i, childConn))
process.start()
processes.append((process, conn))
# Here we have created, started and saved to a list all the worker processes
working = True
finishedProcessesNumber = 0
try:
while working:
for process, conn in processes:
if conn.poll(): # Check if any messages have arrived from a worker
message = conn.recv()
if message == 'done':
finishedProcessesNumber += 1
if finishedProcessesNumber == WORKERS_NUMBER:
working = False
except KeyboardInterrupt:
print('Aborted')
You can adjust WORKERS_NUMBER to your needs.
Same with multiprocessing.Pool:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(transmitter):
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
def modconffactory():
pool = multiprocessing.Pool(WORKERS_NUMBER)
pool.map(modconffactoryProcess, totaltransmitterdict.values())
You probably would like to use .map_async instead of .map
Both snippets do the same, but I would say in the first one you have more control over the program.
I suppose the second one is the easiest, though :)
But the first one should give you the idea of what is happening in the second one
multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html

you can run your function in this way:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

parallel writing to list in python

I got multiple parallel processes writing into one list in python. My code is:
global_list = []
class MyThread(threading.Thread):
...
def run(self):
results = self.calculate_results()
global_list.extend(results)
def total_results():
for param in params:
t = MyThread(param)
t.start()
while threading.active_count() > 1:
pass
return total_results
I don't like this aproach as it has:
An overall global variable -> What would be the way to have a local variable for the `total_results function?
The way I check when the list is returned seems somewhat clumsy, what would be the standard way?

Is your computation CPU-intensive? If so you should look at the multiprocessing module which is included with Python and offers a fairly easy to use Pool class into which you can feed compute tasks and later get all the results. If you need a lot of CPU time this will be faster anyway, because Python doesn't do threading all that well: only a single interpreter thread can run at a time in one process. Multiprocessing sidesteps that (and offers the Pool abstraction which makes your job easier). Oh, and if you really want to stick with threads, multiprocessing has a ThreadPool too.

1 - Use a class variable shared between all Worker's instances to append your results
from threading import Thread
class Worker(Thread):
results = []
...
def run(self):
results = self.calculate_results()
Worker.results.extend(results) # extending a list is thread safe
2 - Use join() to wait untill all the threads are done and let them have some computational time
def total_results(params):
# create all workers
workers = [Worker(p) for p in params]
# start all workers
[w.start() for w in workers]
# wait for all of them to finish
[w.join() for w in workers]
#get the result
return Worker.results

How to make a multithreaded system work on same dictionary in python

I have a system designed to take data via a socket and store that into a dictionary to serve as a database. Then all my other modules (GUI, analysis, write_to_log_file, etc) will access the database and do what they need to do with the dictionary e.g make widgets/copy the dictionary to a log file. But since all these things happen at a different rate, I chose to have each module on their own thread so I can control the frequency.
In the main run function there's something like this:
from threading import Thread
import data_collector
import write_to_log_file
def main():
db = {}
receive_data_thread = Thread(target=data_collector.main, arg=(db,))
recieve_data_thread.start() # writes to dictionary # 50 Hz
log_data_thread = Thread(target=write_to_log_file.main, arg(db,))
log_data_thread.start() # reads dictionary # 1 Hz
But it seems that both modules aren't working on the same dictionary instance because the log_data_thread just prints out the empty dictionary even when the data_collector shows the data it's inserted into the dictionary.
There's only one writer to the dictionary so I don't have to worry about threads stepping on each others toes, I just need to figure out a way for all the modules to read the current database as it's being written.

Rather than using a builtin dict, you could look at using a Manager object from the multiprocessing library:
from multiprocessing import Manager
from threading import Thread
from time import sleep
manager = Manager()
d = manager.dict()
def do_this(d):
d["this"] = "done"
def do_that(d):
d["that"] ="done"
thread0 = Thread(target=do_this,args=(d,))
thread1 = Thread(target=do_that,args=(d,))
thread0.start()
thread1.start()
thread0.join()
thread1.join()
print d
This gives you a standard-library thread-safe synchronised dictionary which should be easy to swap in to your current implementation without changing the design.

Use a Queue.Queue to pass values from the reader threads to a single writer thread. Pass the Queue instance to each data_collector.main function. They can all call the Queue's put method.
Meanwhile the write_to_log_file.main should also be passed the same Queue instance, and it can call the Queue's get method.
As items are pulled out of the Queue, they can be added to the dict.
See also: Alex Martelli, on why Queue.Queue is the secret sauce of CPython multithreading.

This should not be a problem. I also assume you are using the threading module. I would have to know more about what the data_collector and write_to_log_file are doing to figure out why they are not working.
You could technically even have more then 1 thread writing and it would not be a problem because the GIL would take care of all the locking needed. Granted you will never get more then one cpus worth of work out of it.
Here is a simple Example:
import threading, time
def addItem(d):
c = 0
while True:
d[c]="test-%d"%(c)
c+=1
time.sleep(1)
def checkItems(d):
clen = len(d)
while True:
if clen < len(d):
print "dict changed", d
clen = len(d)
time.sleep(.5)
DICT = {}
t1 = threading.Thread(target=addItem, args=(DICT,))
t1.daemon = True
t2 = threading.Thread(target=checkItems, args=(DICT,))
t2.daemon = True
t1.start()
t2.start()
while True:
time.sleep(1000)

Sorry, I figured out my problem, and I'm dumb. The modules were working on the same dictionary, but my logger wasn't wrapped around a while True so it just executed once and terminated the thread and thus my dictionary was only logged to disk once. So I made write_to_log_file.main(db) constantly write at 1Hz forever and set log_data_thread.deamon = True so that once the writer thread (which won't be a daemon thread) exits, it'll quit. Thanks for all the input about best practices on this type of system.

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Update a Global list in Dask - python

Related

Separate variables for each thread

What is the easiest way to make maximum cpu usage for nested for-loops?

parallel writing to list in python

How to make a multithreaded system work on same dictionary in python

Assistance with Python multithreading

Categories

Resources