I'm trying to write a 'market data engine' of sorts.
So far I have a queue of threads, each thread will urllib to google finance and re the stock details out the page. Each thread will poll the page ever few seconds.
From here, how can I persist the data in a way another class can just poll it, without the problem of 2 processes accessing the same resource at the same time? For example, if I get my threads to write to a dict that's constantly being updated, will I have trouble reading that same hash from another function?
You are correct that using a standard dict is not thread-safe (see here) and may cause you problems.
A very nice way to handle this is to use the Queue class in the queue module in the standard library. It is thread-safe. Have the worker threads send updates to the main thread via the queue and have the main thread alone update the dictionary.
You could also have the threads update a database, but that may or may not be overkill for what you're doing.
You might also want to take a look at something like eventlet. In fact, they have a web crawler example on their front page.
Related
I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
I'm using the Python threading library. Works fine (subject to the Global Interpreter Lock, of course).
Now I have a condundrum. I have two separate sources of concurrency: either two Queues, or a Queue and a Condition. How can I wait for the first one that is ready? (They have to be separate objects since they are owned by different modular parts of my application.)
Windows has the WaitForMultipleObjects function; is there something similar for Python concurrency primitives?
There is not an already existing function that I know of that you asked about. However there is the threading.enumaerate() which I think just might return a list off all current daemon threads no matter the source. Once you have that list you could iterate over it looking for the condition you want. To set a thread as a daemon each thread has a method that can be called like thread.setDaemon(True) before the thread is started.
I cant say for sure that this is your answer. I don't have as much experience as apparently you do, but I looked this up in a book I have, The Python Standard Library by Example - by Doug Hellmann. He has 23 pages on managing concurrent operations in the section on threading and enumerate seamed to be something that would help.
You could create a new synchronization object (queue, condition, etc.) let's call it the ready_event, and one Thread for each sync object you want to watch. Each thread waits for its sync object to be ready, when a thread's sync object is ready, the thread signals it via the ready_event. after you created and started the threads, you can wait on that ready_event.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Asynchronous HTTP calls in Python
I have a Django view which needs to retrieve search results from multiple web services, blend the results together, and render them. I've never done any multithreading in Django before. What is a modern, efficient, safe way of doing this?
I don't know anything about it yet, but gevent seems like a reasonable option. Should I use that? Does it play well with Django? Should I look elsewhere?
Not sure about gevent. The simplest way is to use threads[*]. Here's a simple example of how to use threads in Python:
# std lib modules. "Batteries included" FTW.
import threading
import time
thread_result = -1
def ThreadWork():
global thread_result
thread_result = 1 + 1
time.sleep(5) # phew, I'm tiered after all that addition!
my_thread = threading.Thread(target=ThreadWork)
my_thread.start() # This will call ThreadWork in the background.
# In the mean time, you can do other stuff
y = 2 * 5 # Completely independent calculation.
my_thread.join() # Wait for the thread to finish doing it's thing.
# This should take about 5 seconds,
# due to time.sleep being called
print "thread_result * y =", thread_result * y
You can start multiple threads, have each make different web service calls, and join on all of those threads. Once all those join calls have returned, the results are in, and you'll be able to blend them.
more advanced tips: You should call join with a timeout; otherwise, your users might be waiting indefinitely for your app to send them a response. Even better would be for you to make those web service calls before the request arrives at your app; otherwise, the responsiveness of your app is at the mercy of the services that you rely on.
caveat about threading in general: Be careful with data that can be accessed by two (or more) different threads. Access to the same data needs to be "synchronized". The most popular synchronization device is a lock, but there is a plethora of others. threading.Lock implements a lock. If you're not careful about synchronization, you're likely to write a "race condition" into your app. Such bugs are notoriously difficult to debug, because they cannot be reliably reproduced.
In my simple example, thread_result was shared between my_thread and the main thread. I didn't need any locks, because the main thread did not access thread_result until my_thread terminated. If I hadn't called my_thread.join, the result would some times be -10 instead of 20. Go ahead and try it yourself.
[*] Python doesn't have true threading in the sense that concurrent threads do not execute simulatneously, even if you have idle cores. However, you still get concurrent execution; when one thread is blocked, other threads can execute.
I just nicely solved this problem using futures, available in 3.2 and backported to earlier versions including 2.x.
In my case I was retrieving results from an internal service and collating them:
def _getInfo(request,key):
return urllib2.urlopen(
'http://{0[SERVER_NAME]}:{0[SERVER_PORT]}'.format(request.META) +
reverse('my.internal.view', args=(key,))
, timeout=30)
…
with futures.ThreadPoolExecutor(max_workers=os.sysconf('SC_NPROCESSORS_ONLN')) as executor:
futureCalls = dict([ (
key,executor.submit(getInfo,request,key)
) for key in myListOfItems ])
curInfo = futureCalls[key]
if curInfo.exception() is not None:
# "exception calling for info: {0}".format(curInfo.exception())"
else:
# Handle the result…
gevent will not help you to process the task faster. It is just more efficient than threads when it comes to resource footprint. When running gevent with Django (usually via gunicorn) your web app will be able to handle more concurrent connections than a normal django wsgi app.
But: I think this has nothing to do with your problem. What you want to do is handle a huge task in one Django view, which is usually not a good idea. I personally advise you against using threads or gevents greenlets for this in Django. I see the point for standalone Python scripts or daemon's or other tools, but not for web. This mostly results in instability and more resource footprint. Instead I am agreeing with the comments of dokkaebi and Andrew Gorcester. Both comments differ somehow though, since it really depends of what your task is about.
If you can split your task into many smaller tasks you could create multiple views handling these subtasks. These views could return something like JSON and can be consumed via AJAX from your frontend. Like this you can build the content of your page as it "comes in" and the user does not need to wait until the whole page is loaded.
If you task is one huge chunk you are better off with a task queue handler. Celery comes in mind. If Celery is too overkill you can use zeroMQ. This basically works like mentioned above from Andrew: you schedule the task for processing and are polling the backend from your frontend page until the task is finished (usually also via AJAX). You could also use something like long polling here.
I'm making a python script that needs to do 3 things simultaneously.
What is a good way to achieve this as do to what i've heard about the GIL i'm not so lean into using threads anymore.
2 of the things that the script needs to do will be heavily active, they will have lots of work to do and then i need to have the third thing reporting to the user over a socket when he asks (so it will be like a tiny server) about the status of the other 2 processes.
Now my question is what would be a good way to achieve this? I don't want to have three different script and also due to GIL using threads i think i won't get much performance and i'll make things worse.
Is there a fork() for python like in C so from my script so fork 2 processes that will do their job and from the main process to report to the user? And how can i communicate from the forked processes with the main process?
LE:: to be more precise 1thread should get email from a imap server and store them into a database, another thread should get messages from db that needs to be sent and then send them and the main thread should be a tiny http server that will just accept one url and will show the status of those two threads in json format. So are threads oK? will the work be done simultaneously or due to the gil there will be performance issues?
I think you could use the multiprocessing package that has an API similar to the threading package and will allow you to get a better performance with multiple cores on a single CPU.
To view the gain of performance using multiprocessing instead threading, check on this link about the average time comparison of the same program using multiprocessing x threading.
The GIL is really only something to care about if you want to do multiprocessing, that is spread the load over several cores/processors. If that is the case, and it kinda sounds like it from your description, use multiprocessing.
If you just need to do three things "simultaneously" in that way that you need to wait in the background for things to happen, then threads are just fine. That's what threads are for in the first place. 8-I)