I'm not sure what is the correct terminology to use. But my example should clear it up.
I want to listen to a Reddit comment stream.
This stream receives a comment in real-time as it is posted to reddit (/r/askReddit, and /r/worldNews), so I don't have to poll the server.
However, this function is blocking, I need to put it into several threads.
Here's what I have so far:
#! usr/bin/python3
from multiprocessing.dummy import Pool
import praw
def process_item(self, stream):
# Display the comment
for comment in stream:
print(comment.permalink)
def get_streams(reddit):
# Listen for comments from these two subReddits:
streams = [
reddit.subreddit('AskReddit').stream.comments(skip_existing=True),
reddit.subreddit('worldnews').stream.comments(skip_existing=True)
]
pool = Pool(4)
print('waiting for comments...')
results = pool.map(self.process_item, streams)
# But I want to do tons of other things down here or in `main()`.
# The code will never reach down here because it's always listening for comments.
The only workaround I can see is to put my entire program logic into process_item(), but that seems really stupid.
I think I want process_item to keep adding comments to a list, in the background, and then I can process those comments as I see fit. But I need to not get stuck in process_item()
As the program is doing other things, a list is being queued up with jobs to do, all while the program is doing other things.
Possible? If so, could you give me some tips as to the pattern?
I'm brand new to threading.
Read more about pub/sub pattern.
if you want thread use threading module.
multiprocessing is os process. Process and threads is different things. If you want threads use threading(when process data think about GIL)
to do anything that:
you can start some threads to read data from stream and put message to LIFO data structure
start some threads to read data from LIFO data structure to process your data
Related
I'm trying to make a threaded cgi webserver similar to this; however, I'm stuck on how to set local data in the handler for a different thread. Is it possible to set threading.local data, such as a dict, for a thread other than the handler. To be more specific I want to have the request parameters, headers, etc available from a cgi file that was started with subprocess.run. The bottom of the do_GET in this file on github is what I use now, but that can only serve one client at a time. I want to replace this part because I want multiple connections/threads at once, and I need different data in each connection/thread.
Is there a way to edit/set threading.local data from a different thread. Or if there is a better way to achieve what I am trying, please let me know. If you know that this is definently impossible, say so.
Thanks in advance!
Without seeing what test code you have, and knowing what you've tried so far, I can't tell you exactly what you need to succeed. That said, I can tell you that trying to edit information in a threading.local() object from another thread is not the cleanest path to take.
Generally, the best way to send calls to other threads is through threading.Event() objects. Usually, a thread listens to an Event() object and does an action based on that. In this case, I could see having a handler set an event in the case of a GET request.
Then, in the thread that is writing the cgi file, have a function that, when the Event() object is set, records the data you need and unsets the Event() object.
So, in pseudo-code:
import threading
evt = threading.Event()
def noteTaker(evt):
while True:
if evt.wait():
modifyDataYouNeed()
f.open()
f.write()
f.close()
evt.clear()
def do_GET(evt):
print "so, a query hit your webserver"
evt.set()
print "and noteTaker was just called"
So, while I couldn't answer your question directly, I hope this helps some on how threads communicate and will help you infer what you need :)
threading information (as I'm sure you've read already, but for the sake of diligence) is here
I apologize if this isn't the correct way to word it, but I'm not sure where to start. If this question needs to be reworded, I will definitely do that.
I have just finished writing a piece of code that is collecting data from a variety of servers. It is currently running, and I would like to be able to start writing other pieces of code that can access the data being collected. Obviously I can do this by dumping the data into files, and have my data analysis code read the files stored on disk. However, for some forms of my analysis I would like to have something closer to real time data. Is there a way for me to access the class from my data collection piece of code without explicitly instantiating it? I mean, can I set up one piece of code to start the data collection, and then write other pieces of code later that are able to access the data collection class without stopping and restarting the data collection piece of code?
I hope that makes sense. I realize the data can just be stored to disk, and I could do things like just have my data analysis code search directories for changes. However, I am just curious to know if something like this can be done.
This seems to be like a Producer Consumer problem.
The producer's job is to generate a piece of data, put it into the
buffer and start again. At the same time, the consumer is consuming
the data (i.e., removing it from the buffer) one piece at a time
The catch here is "At the same time". So, producer and consumer need
to run concurrently. Hence we need separate threads for Producer and
Consumer.
I am taking code from the above link, you should go through it for extra details.
from threading import Thread
import time
import random
from Queue import Queue
queue = Queue(10)
class ProducerThread(Thread):
def run(self):
nums = range(5)
global queue
while True:
num = random.choice(nums)
queue.put(num)
print "Produced", num
time.sleep(random.random())
class ConsumerThread(Thread):
def run(self):
global queue
while True:
num = queue.get()
queue.task_done()
print "Consumed", num
time.sleep(random.random())
ProducerThread().start()
ConsumerThread().start()
Explanation :
We are using a Queue instance(hereafter queue).Queue has a Condition
and that condition has its lock. You don't need to bother about
Condition and Lock if you use Queue.
Producer uses put available on queue to insert data in the queue.
put() has the logic to acquire the lock before inserting data in
queue.
Also put() checks whether the queue is full. If yes, then it calls
wait() internally and so producer starts waiting.
Consumer uses get.
get() acquires the lock before removing data from queue.
get() checks if the queue is empty. If yes, it puts consumer in
waiting state.
I need to do some three-step web scraping in Python. I have a couple base pages that I scrape initially, and I need to get a few select links off those pages and retrieve the pages they point to, and repeat that one more time. The trick is I would like to do this all asynchronously, so that every request is fired off as soon as possible, and the whole application isn't blocked on a single request. How would I do this?
Up until this point, I've been doing one-step scraping with eventlet, like this:
urls = ['http://example.com', '...']
def scrape_page(url):
"""Gets the data from the web page."""
body = eventlet.green.urllib2.urlopen(url).read()
# Do something with body
return data
pool = eventlet.GreenPool()
for data in pool.imap(screen_scrape, urls):
# Handle the data...
However, if I extend this technique and include a nested GreenPool.imap loop, it blocks until all the requests in that group are done, meaning the application can't start more requests as needed.
I know I could do this with Twisted or another asynchronous server, but I don't need such a huge library and I would rather use something lightweight. I'm open to suggestions, though.
Here is an idea... but forgive me since I don't know eventlet. I can only provide a rough concept.
Consider your "step 1" pool the producers. Create a queue and have your step 1 workers place any new urls they find into the queue.
Create another pool of workers. Have these workers pull from the queue for urls and process them. If during their process they discover another url, put that into the queue. They will keep feeding themselves with subsequent work.
Technically this approach would make it easily recursive beyond 1,2,3+ steps. As long as they find new urls and put them in the queue, the work keeps happening.
Better yet, start out with the original urls in the queue, and just create a single pool that puts new urls to that same queue. Only one pool needed.
Post note
Funny enough, after I posted this answer and went to look for what the eventlet 'queue' equivalent was, I immediately found an example showing exactly what I just described:
http://eventlet.net/doc/examples.html#producer-consumer-web-crawler
In that example there is a producer and fetch method. The producer starts pulling urls from the queue and spawning threads to fetch. fetch then puts any new urls back into the queue and they keep feeding each other.
I'm using the threading module to control threads that send data through sockets and what not, however I can't find a suitable solution to pass data into the thread to work with. I've tried things such as Overriding python threading.Thread.run() but can't seem to get it working. If anyone has any suggestions I'd be happy to try anything :)
Thanks !
You are thinking about this backwards. Forget about the fact that it happens to be a thread that's sending the data through the sockets. The data doesn't need to get to the thread, it needs to get to the logic that sends data on the socket.
For example, you can have a queue that holds things that need to be sent through the socket. The socket write code pulls messages from the queue and sends them out the socket. The other code puts messages on this queue. The code that needs to send messages to the socket shouldn't know or care that there happens to be a thread that does the sending.
Use message queues for this. Python has the Queue module for passing data between threads, but if you use a third party library like 0MQ http://www.zeromq.org instead, then you can split the threads into separate processes and it will work the same way.
Multiprocessing is easier to do than threading, but if you have to use threading, avoid locking and sharing data as much as you can. Instead use a prewritten module like Queue to limit the ways in which subtle bugs can arise.
I'm trying to write a 'market data engine' of sorts.
So far I have a queue of threads, each thread will urllib to google finance and re the stock details out the page. Each thread will poll the page ever few seconds.
From here, how can I persist the data in a way another class can just poll it, without the problem of 2 processes accessing the same resource at the same time? For example, if I get my threads to write to a dict that's constantly being updated, will I have trouble reading that same hash from another function?
You are correct that using a standard dict is not thread-safe (see here) and may cause you problems.
A very nice way to handle this is to use the Queue class in the queue module in the standard library. It is thread-safe. Have the worker threads send updates to the main thread via the queue and have the main thread alone update the dictionary.
You could also have the threads update a database, but that may or may not be overkill for what you're doing.
You might also want to take a look at something like eventlet. In fact, they have a web crawler example on their front page.