pop unique items from queue and use in multiple threads

pop unique items from queue and use in multiple threads - python

I don't understand how to have each thread pop a unique item off the queue and run the function concurrently. The way this is currently written it runs linearly, two threads run by popping the same item from the queue.
How could I have the loop pass unique items from the queue to each thread?
import sys
import subprocess
import threading
import Queue
def pinger(host):
subprocess.Popen(['ping', '-c 1', '-W 1', host])
ping_hosts = ['google.com', 'yahoo.com', 'disney.com', 'myspace.com','www.pingler.com', 'www.pingmylink.com',
'www.pingoat.net' ,'www.blogsearch.google.com' ,'pingmyblog.com', 'www.twingly.com', 'www.weblogs.com', 'auto-ping.com' ]
ping_hosts = [ 'google.com', 'yahoo.com', 'disney.com']
def Main():
q = Queue.Queue()
for item in ping_hosts:
q.put(item)
while q.qsize() != 0:
host = q.get()
t1 = threading.Thread(target=pinger, args=(host,)
t2 = threading.Thread(target=pinger, args=(host,)
t1.start()
t2.start()
t1.join()
t2.join()
print "Main Completed"
if __name__ == '__main__':
Main()

Your worker threads get to ping the same host, because you give the same host value to them. But the main problem with your code is that you consume the queue in the main (single) thread. What you do is put tasks in a queue (main thread), then get them (main thread) and start two worker threads, that do the ping (thread 1, thread 2), then join them (back in main thread again) and do that while there're tasks in a queue.
Also, the popen actually starts a new process to do the ping, so you might not even need the threads if it's only the ping you're interested in, just:
for host in ping_hosts:
subprocess.Popen(['ping', '-c 1', '-W 1', host])
But assuming your question is about threading, not doing the ping...
When using a queue, usually, you put tasks (data) in it in one thread, called the "producer", and get and process tasks (data) in another thread, called the "consumer", running simultaneously. Of course there may be several producers and several consumers.
So, in your case what you want is: put the hosts in the queue (main thread), start worker threads, that will get the data from the queue (thread 1, thread 2) and process it (same thread 1, thread 2), running simultaneously.
Try like this:
import subprocess
import threading
import Queue
def ping(host):
"""Pings a host."""
print("Pinging '%s'" % host)
subprocess.Popen(['ping', '-c 1', '-W 1', host])
def pinger_thread(thread_name, task_queue):
"""Gets the data from the queue and processes it."""
while task_queue.qsize() != 0:
host = task_queue.get()
print("%s: pinging '%s'" % (thread_name, host))
ping(host)
def main():
ping_hosts = ['google.com', 'yahoo.com', 'disney.com']
# Fill the queue
q = Queue.Queue()
for item in ping_hosts:
q.put(item)
# Start the threads, that will consume data from the queue
t1 = threading.Thread(target=pinger_thread, args=("Thread 1", q))
t2 = threading.Thread(target=pinger_thread, args=("Thread 2", q))
t1.start()
t2.start()
t1.join()
t2.join()
print("Main Completed")
if __name__ == '__main__':
main()
The output:
python2 test.py
Thread 1: pinging 'google.com'
Pinging 'google.com'
Thread 2: pinging 'yahoo.com'
Pinging 'yahoo.com'
Thread 1: pinging 'disney.com'
Pinging 'disney.com'
Main Completed
PING google.com (74.125.232.231) 56(84) bytes of data.
PING yahoo.com (206.190.36.45) 56(84) bytes of data.
PING disney.com (199.181.131.249) 56(84) bytes of data.
...
--- google.com ping statistics ---
...
--- disney.com ping statistics ---
...
--- yahoo.com ping statistics ---
...
I removed some ping command output as irrelevant. But what's important, you can see that the hosts are processed by two threads in order.

Related

How does Python threadsafe queue work after calling get()

In the example from documentation:
import threading, queue
q = queue.Queue()
def worker():
while True:
item = q.get()
print(f'Working on {item}')
print(f'Finished {item}')
q.task_done()
# turn-on the worker thread
threading.Thread(target=worker, daemon=True).start()
# send thirty task requests to the worker
for item in range(30):
q.put(item)
print('All task requests sent\n', end='')
# block until all tasks are done
q.join()
print('All work completed')
After the worker thread get from the queue, which i assume is protected by some lock and checking the queue and modifying the queue is atomic, it does prints. How are the prints also atomic across all the worker threads and we won't see intermingled prints?

Your threads print to stdout, which is a shared global object. One possible solution is to use a threading.Lock or threading.Semaphore to guard stdout. For example:
import threading, queue
print_semaphore = threading.Semaphore()
q = queue.Queue()
def worker():
while True:
item = q.get()
with print_semaphore:
print(f'Working on {item}')
print(f'Finished {item}')
q.task_done()
# turn-on the worker thread
threading.Thread(target=worker, daemon=True).start()
# send thirty task requests to the worker
for item in range(99):
q.put(item)
with print_semaphore:
print('All task requests sent\n', end='')
# block until all tasks are done
q.join()
with print_semaphore:
print('All work completed')
Another solution would be to introduce another queue, and have your threads put messages to that queue, instead of printing to stdout directly.

Python threading - with limited threads, iterate over n number of Items

Here is an example read from IBM python threading tutorial. I was going through this URL (http://www.ibm.com/developerworks/aix/library/au-threadingpython/)
#!/usr/bin/env python
import Queue
import threading
import urllib2
import time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(host)
print url.read(1024)
#signals to queue job is done
self.queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
#wait on the queue until everything has been processed
queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
The example here works perfectly. I have been looking for a slightly different modification. Here there are known number of URL's , like for example 5. used range(5) in for loop to iterate over the URL's and process it.
What if, i want to use only '5' threads to process 1000 URL's? so when a thread completes, the completed URL should be removed from queue and new URL needs to be added to queue. But all these should happen by using the same thread.
I can check ,
if self.queue.task_done():
return host
This is the only way i can check if the URL is processed successfully or not. Once returned , i should remove URL from the queue. and add a new URL to queue. How to implement this using queue ?
Thanks,

That code will already do what you describe. If you put 1000 items into the queue instead of 5, they will be processed by those same 5 threads - each one will take an item from the queue, process it, then take a new one as long as there are items left in the queue.

Threading in python using queue

I wanted to use threading in python to download lot of webpages and went through the following code which uses queues in one of the website.
it puts a infinite while loop. Does each of thread run continuously with out ending till all of them are complete? Am I missing something.
#!/usr/bin/env python
import Queue
import threading
import urllib2
import time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(host)
print url.read(1024)
#signals to queue job is done
self.queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
#wait on the queue until everything has been processed
queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)

Setting the thread's to be daemon threads causes them to exit when the main is done. But, yes you are correct in that your threads will run continuously for as long as there is something in the queue else it will block.
The documentation explains this detail Queue docs
The python Threading documentation explains the daemon part as well.
The entire Python program exits when no alive non-daemon threads are left.
So, when the queue is emptied and the queue.join resumes when the interpreter exits the threads will then die.
EDIT: Correction on default behavior for Queue

Your script works fine for me, so I assume you are asking what is going on so you can understand it better. Yes, your subclass puts each thread in an infinite loop, waiting on something to be put in the queue. When something is found, it grabs it and does its thing. Then, the critical part, it notifies the queue that it's done with queue.task_done, and resumes waiting for another item in the queue.
While all this is going on with the worker threads, the main thread is waiting (join) until all the tasks in the queue are done, which will be when the threads have sent the queue.task_done flag the same number of times as messages in the queue . At that point the main thread finishes and exits. Since these are deamon threads, they close down too.
This is cool stuff, threads and queues. It's one of the really good parts of Python. You will hear all kinds of stuff about how threading in Python is screwed up with the GIL and such. But if you know where to use them (like in this case with network I/O), they will really speed things up for you. The general rule is if you are I/O bound, try and test threads; if you are cpu bound, threads are probably not a good idea, maybe try processes instead.
good luck,
Mike

I don't think Queue is necessary in this case. Using only Thread:
import threading, urllib2, time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, host):
threading.Thread.__init__(self)
self.host = host
def run(self):
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(self.host)
print url.read(1024)
start = time.time()
def main():
#spawn a pool of threads
for i in range(len(hosts)):
t = ThreadUrl(hosts[i])
t.start()
main()
print "Elapsed Time: %s" % (time.time() - start)

Making sure a worker process always terminate in zeroMQ

I am implementing a pipeline pattern with zeroMQ using the python bindings.
tasks are fanned out to workers which listen for new tasks with an infinite loop like this:
while True:
socks = dict(self.poller.poll())
if self.receiver in socks and socks[self.receiver] == zmq.POLLIN:
msg = self.receiver.recv_unicode(encoding='utf-8')
self.process(msg)
if self.hear in socks and socks[self.hear] == zmq.POLLIN:
msg = self.hear.recv()
print self.pid,":", msg
sys.exit(0)
they exit when they get a message from the sink node, confirming having received all the results expected.
however, worker may miss such a message and not finish. What is the best way to have workers always finish, when they have no way to know (other than through the already mentioned message, that there are no further tasks to process).
Here is the testing code I wrote for checking the workers status:
#-*- coding:utf-8 -*-
"""
Test module containing tests for all modules of pypln
"""
import unittest
from servers.ventilator import Ventilator
from subprocess import Popen, PIPE
import time
class testWorkerModules(unittest.TestCase):
def setUp(self):
self.nw = 4
#spawn 4 workers
self.ws = [Popen(['python', 'workers/dummy_worker.py'], stdout=None) for i in range(self.nw)]
#spawn a sink
self.sink = Popen(['python', 'sinks/dummy_sink.py'], stdout=None)
#start a ventilator
self.V = Ventilator()
# wait for workers and sinks to connect
time.sleep(1)
def test_send_unicode(self):
'''
Pushing unicode strings through workers to sinks.
'''
self.V.push_load([u'são joão' for i in xrange(80)])
time.sleep(1)
#[p.wait() for p in self.ws]#wait for the workers to terminate
wsr = [p.poll() for p in self.ws]
while None in wsr:
print wsr, [p.pid for p in self.ws if p.poll() == None] #these are the unfinished workers
time.sleep(0.5)
wsr = [p.poll() for p in self.ws]
self.sink.wait()
self.sink = self.sink.returncode
self.assertEqual([0]*self.nw, wsr)
self.assertEqual(0, self.sink)
if __name__ == '__main__':
unittest.main()

All the messaging stuff eventually ends up with heartbeats. If you (as a worker or a sink or whatever) discover that a component you need to work with is dead, you can basically either try to connect somewhere else or kill yourself. So if you as a worker discover that the sink is there no more, just exit. This also means that you may exit even though the sink is still there but the connection is broken. But I am not sure you can do more, perhaps set all the timeouts more reasonably...

Pinging first available host in network subnets

I've written a small script in Python that pings all subnets of my school's wireless network and prints out the IP addresses and hostnames of computers that are connected to each subnet of the network. My current setup is that I'm relying on creating threads to handle each of the ping requests.
from threading import Thread
import subprocess
from Queue import Queue
import time
import socket
#wraps system ping command
def ping(i, q):
"""Pings address"""
while True:
ip = q.get()
#print "Thread %s: Pinging %s" % (i, ip)
result = subprocess.call("ping -n 1 %s" % ip, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
#Avoid flooding the network with ping requests
time.sleep(3)
if result == 0:
try:
hostname=socket.gethostbyaddr(ip)
print "%s (%s): alive" % (ip,hostname[0])
except:
print "%s: alive"%ip
q.task_done()
num_threads = 100
queue = Queue()
addresses=[]
#Append all possible IP addresses from all subnets on wireless network
for i in range(1,255):
for j in range(1,254):
addresses.append('128.119.%s.%s'%(str(i),str(j)))
#Spawn thread pool
for i in range(num_threads):
worker = Thread(target=ping, args=(i, queue))
worker.setDaemon(True)
worker.start()
#Place work in queue
for ip in addresses:
queue.put(ip)
#Wait until worker threads are done to exit
queue.join()
However, I want to modify my script so that it only seeks out the first available host in the subnet. What that means is that suppose I have the following subnet (128.119.177.0/24) and the first available host is 128.119.177.20. I want my script to stop pinging the remaining hosts in the 128.119.177.0/24 after I successfully contact 128.119.177.20. I want to repeat that for every subnet on my network (128.119.0.1 - 128.119.255.254). Given my current setup, what would be the best course of action to make this change? I was thinking of doing something like a list of Queues (where each Queue holds 255 IP addresses for one of the subnets) and having one thread process each queue (unless there is a limitation on how many threads I can spawn in Python on Windows).
EDIT: I have played around with nmap (and Angry IP scanner) for this task, but I was interested in pursuing writing my own script.

Simplest thing would be to have a thread work through a whole subnet and exit when it finds a host.
UNTESTED
from Queue import Queue
import time
import socket
#wraps system ping command
def ping(i, q):
"""Pings address"""
while True:
subnet = q.get()
# each IP addresse in subnet
for ip in (subnet=str(x) for x in range(1,254)):
#print "Thread %s: Pinging %s" % (i, ip)
result = subprocess.call("ping -n 1 %s" % ip, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
#Avoid flooding the network with ping requests
time.sleep(3)
if result == 0:
try:
hostname=socket.gethostbyaddr(ip)
print "%s (%s): alive" % (ip,hostname[0]
except:
print "%s: alive"%ip
break
q.task_done()
num_threads = 100
queue = Queue()
#Put all possible subnets on wireless network into a queue
for i in range(1,255):
queue.put('128.119.%s.'%i)
#Spawn thread pool
for i in range(num_threads):
worker = Thread(target=ping, args=(i, queue))
worker.setDaemon(True)
worker.start()
#Wait until worker threads are done to exit
queue.join()

Since you know how many threads you got in the beginning of the run, you could periodically check the current number of threads running to see if nowThreadCount < startThreadCount. If it's true terminate the current thread.
PS: Easiest way would be to just clear the queue object too, but I can't find that in the docs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pop unique items from queue and use in multiple threads - python

Related

How does Python threadsafe queue work after calling get()

Python threading - with limited threads, iterate over n number of Items

Threading in python using queue

Making sure a worker process always terminate in zeroMQ

Pinging first available host in network subnets

Categories

Resources