I have a program in place that follows logic like this:
At the start of every hour, multiple directories receive a file that is continuously fed data. I'm developing a simple program that can read all the files simultaneously and abstracted the tailing/reading part into a function, lets call it 'tail' for now. The external program feeding the data doesn't always run smoothly. Sometimes a file will come in late, sometimes the next hour will hit and the program still feeds the stale file data. I can't afford to lose the data. My solution looks something like this using multiprocessing.pool using pseudo code in parts of it:
def process_data(logfile):
num_retries = 5
while num_retries > 0:
if os.path.isfile(logfile):
for record in tail(logfile):
do_something(record)
else:
num_retries -= 1
time.sleep(30)
def tail(logfile):
logfile = open(logfile, 'r')
logfile.seek(0, 2)
while True:
line = logfile.readline()
if line:
wait_time = 0
yield line
else:
if wait_time >= 360:
break
wait_time += 1
time.sleep(1)
continue
if __name__ == '__main__':
start_time = sys.argv[1]
next_hour = None
while True:
logdirs = glob.glob("/opt/logs/plog*")
current_date = datetime.now()
current_hour = current_date.strftime('%H')
current_format = datetime.now().strftime("%Y%m%d%H")
logfiles = [logdir + '/some/custom/path/tofile.log' for logdir in logdirs]
if not next_hour:
next_hour = current_date + timedelta(hours=1)
if current_hour == next_hour.strftime('%H') or current_hour == start_time:
start_time = None
pool = multiprocessing.Pool()
pool.map(process_data, logfiles)
pool.close()
pool.join()
next_hour = current_date + timedelta(hours=1)
time.sleep(30)
Here's what I'm observing when I have logging implemented at the process level:
all files in each directory are getting read appropriately
when the next hour hits, there's a delay of 360s (6 minutes) before the next set of files get read
so if hour 4 ends, a new pool doesn't get created for hour 5 until processes for hour 4 finish
What I'm looking for: I'd like to keep using multiprocessing, but can't figure out why the code inside the main while loop doesn't go through until the previous Pool of processes finishes. I have tried the hourly logic for other examples without multiprocessing and have had it work fine. I'm lead to believe that this has to do with the Pool class and hoping to get advice on how to make it so that even while the previous Pool is active, I can create a new Pool for the new hour and begin processing new files even if it means this creates a ton of processes.
Related
Here I am retrieving different webpages and then counting the frequency of alphabets and getting the final dictionary as an output. In this case I am going through 20 webpages for i in range(1000,1020):and therefore I ran a loop and created 20 threads. It is serving the purpose and is reducing a lot of time as compared to single thread. What if I wanted to retrieving 1000 of webpages, should I use loop to create different 1000 threads or is there any way to give some webpages as chunk to each thread ? Is there any limitation of creating the number of threads ?
import json
import json
import time
import urllib.request
from threading import Thread
finished_tasks = 0
def count_letters(url,frequency):
response = urllib.request.urlopen(url)
txt = str(response.read())
for l in txt:
letter = l.lower()
if letter in frequency:
frequency[letter] += 1
global finished_tasks
finished_tasks += 1
def main():
frequency = {}
for c in 'abcdefghijklmnopqrstuvwxyz':
frequency[c] = 0
start_time = time.time()
for i in range(1000,1020):
p = Thread(target = count_letters,args =('https://www.rfc-editor.org/rfc/rfc'+str(i)+'.txt',frequency))
p.start()
while finished_tasks < 20:
time.sleep(0.5)
end_time = time.time()
print(end_time-start_time)
print(frequency)
main()
I have data, which is in a text file. Each line is a computation to do. This file has around 100 000 000 lines.
First I load everything into the ram, then I have a a method that performs the computation and gives the following results:
def process(data_line):
#do computation
return result
Then I call it like this with packets of 2000 lines and then save the result to disk :
POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
lines_packet = data_lines[:PACKET_SIZE]
data_lines = data_lines[PACKET_SIZE:]
results = pool.map(process, lines_packet)
save_computed_data_to_disk(to_be_computed_filename, results)
# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
The problem is, while I was writing to disk, my CPU is computing nothing and has 8 cores. It is looking at the task manager and it seems that quite a lot of CPU time is lost.
I have to write to disk after having completed my computation because the results are 1000 times larger than the input.
Anyways, I would have to write to the disk at some point. If time is not lost here, it will be lost later.
What could I do to allow one core to write to disk, while still computing with the others? Switch to C?
At this rate I can process 100 millions lines in 75h, but I have 12 billions lines to process, so any improvement is welcome.
example of timings:
Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Launching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Which is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25
Note: This SharedMemory works only for Python >= 3.8 since it first appeared there
Start 3 kinds of processes: Reader, Processor(s), Writer.
Have Reader process read the file incrementally, sharing the result via shared_memory and Queue.
Have the Processor(s) consume the Queue, consume the shared_memory, and return the result(s) via another Queue. Again, as shared_memory.
Have the Writer process consume the second Queue, writing to the destination file.
Have them all communicate through, say, some Events or DictProxy, with the MainProcess who will act as the orchestrator.
Example:
import time
import random
import hashlib
import multiprocessing as MP
from queue import Queue, Empty
# noinspection PyCompatibility
from multiprocessing.shared_memory import SharedMemory
from typing import Dict, List
def readerfunc(
shm_arr: List[SharedMemory], q_out: Queue, procr_ready: Dict[str, bool]
):
numshm = len(shm_arr)
for batch in range(1, 6):
print(f"Reading batch #{batch}")
for shm in shm_arr:
#### Simulated Reading ####
for j in range(0, shm.size):
shm.buf[j] = random.randint(0, 255)
#### ####
q_out.put((batch, shm))
# Need to sync here because we're reusing the same SharedMemory,
# so gotta wait until all processors are done before sending the
# next batch
while not q_out.empty() or not all(procr_ready.values()):
time.sleep(1.0)
def processorfunc(
q_in: Queue, q_out: Queue, suicide: type(MP.Event()), procr_ready: Dict[str, bool]
):
pname = MP.current_process().name
procr_ready[pname] = False
while True:
time.sleep(1.0)
procr_ready[pname] = True
if q_in.empty() and suicide.is_set():
break
try:
batch, shm = q_in.get_nowait()
except Empty:
continue
print(pname, "got batch", batch)
procr_ready[pname] = False
#### Simulated Processing ####
h = hashlib.blake2b(shm.buf, digest_size=4, person=b"processor")
time.sleep(random.uniform(5.0, 7.0))
#### ####
q_out.put((pname, h.hexdigest()))
def writerfunc(q_in: Queue, suicide: type(MP.Event())):
while True:
time.sleep(1.0)
if q_in.empty() and suicide.is_set():
break
try:
pname, digest = q_in.get_nowait()
except Empty:
continue
print("Writing", pname, digest)
#### Simulated Writing ####
time.sleep(random.uniform(3.0, 6.0))
#### ####
print("Writing", pname, digest, "done")
def main():
shm_arr = [
SharedMemory(create=True, size=1024)
for _ in range(0, 5)
]
q_read = MP.Queue()
q_write = MP.Queue()
procr_ready = MP.Manager().dict()
poison = MP.Event()
poison.clear()
reader = MP.Process(target=readerfunc, args=(shm_arr, q_read, procr_ready))
procrs = []
for n in range(0, 3):
p = MP.Process(
target=processorfunc, name=f"Proc{n}", args=(q_read, q_write, poison, procr_ready)
)
procrs.append(p)
writer = MP.Process(target=writerfunc, args=(q_write, poison))
reader.start()
[p.start() for p in procrs]
writer.start()
reader.join()
print("Reader has ended")
while not all(procr_ready.values()):
time.sleep(5.0)
poison.set()
[p.join() for p in procrs]
print("Processors have ended")
writer.join()
print("Writer has ended")
[shm.close() for shm in shm_arr]
[shm.unlink() for shm in shm_arr]
if __name__ == '__main__':
main()
You say you have 8 cores, yet you have:
POOL_SIZE = 15 #nbcore - 1
Assuming you want to leave one processor free (presumably for the main process?) why wouldn't this number be 7? But why do you even want to read a processor free? You are making successive calls to map. While the main process is waiting for these calls to return, it requires know CPU. This is why if you do not specify a pool size when you instantiate your pool it defaults to the number of CPUs you have and not that number minus one. I will have more to say about this below.
Since you have a very large, in-memory list, is it possible that you are expending waisted cycles in your loop rewriting this list on each iteration of the loop. Instead, you can just take a slice of the list and pass that as the iterable argument to map:
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
save_computed_data_to_disk(to_be_computed_filename, results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
Between each call to map the main process is writing out the results to to_be_computed_filename. In the meanwhile, every process in your pool is sitting idle. This should be given to another process (actually a thread running under the main process):
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
offset = 0
for i in range(number_of_packets):
results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
offset += PACKET_SIZE
q.put(results)
if remainder:
results = pool.map(process, data_lines[offset:offset+remainder])
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I've chosen to run save_data in a thread of the main process. This could also be another process in which case you would need to use a multiprocessing.Queue instance. But I figured the main process thread is mostly waiting for the map to complete and there would not be competition for the GIL. Now if you do not leave a processor free for the threading job, save_data, it may end up doing most of the saving only after all of the results have been created. You would need to experiment a bit with this.
Ideally, I would also modify the reading of the input file so as to not have to first read it all into memory but rather read it line by line yielding 2000 line chunks and submitting those as jobs for map to process:
import multiprocessing
import queue
import threading
POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
def save_data(q):
while True:
results = q.get()
if results is None:
return # signal to terminate
save_computed_data_to_disk(to_be_computed_filename, results)
def read_data():
"""
yield lists of PACKET_SIZE
"""
lines = []
with open(some_file, 'r') as f:
for line in iter(f.readline(), ''):
lines.append(line)
if len(lines) == PACKET_SIZE:
yield lines
lines = []
if lines:
yield lines
q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()
with Pool(processes=POOL_SIZE) as pool:
for l in read_data():
results = pool.map(process, l)
q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
I made two assumptions: The writing is hitting the I/O bound, not the CPU bound - meaning that throwing more cores onto writing would not improve the performance. And the process function contains some heavy computations.
I would approach it differently:
Split up the large list into a list of list
Feed it than into the processes
Store the total result
Here is the example code:
import multiprocessing as mp
data_lines = [0]*10000 # read it from file
size = 2000
# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]
def process(data):
result = len(data) # some something fancy
return result
with mp.Pool() as p:
result = p.map(process, work)
save_computed_data_to_disk(file_name, result)
On meta: You may also have a look into numpy or pandas (depending on the data) because it sounds that you would like to do something into that direction.
The first thing that comes to mind for the code is to run the saving function in the thread. By this we exclude the bottelneck of waiting disk writing. Like so:
executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED) # wait all saved to disk after processing
print("Done")
I am adapting the Python script in this project (expanded below) to a point where it updates a JSON file's elements, instead of the InitialState streamer. However, with the multiple threads that are opened by the script, it is impossible to succinctly write the data from each thread back to the file as it would be read, changed, and written back to the file in all threads at the same time. As there can only be one file, no version will ever be accurate as the last thread would override all others.
Question: How can I update the states in the JSON based in each thread (simultaneously) without it affecting the other thread's writing operation or locking up the file?
JSON file contains the occupant's status that I would like to manipulate with the python script:
{
"janeHome": "false",
"johnHome": "false",
"jennyHome": "false",
"jamesHome": "false"
}
This is the python script:
import subprocess
import json
from time import sleep
from threading import Thread
# Edit these for how many people/devices you want to track
occupant = ["Jane","John","Jenny","James"]
# MAC addresses for our phones
address = ["11:22:33:44:55:66","77:88:99:00:11:22","33:44:55:66:77:88","99:00:11:22:33:44"]
# Sleep once right when this script is called to give the Pi enough time
# to connect to the network
sleep(60)
# Some arrays to help minimize streaming and account for devices
# disappearing from the network when asleep
firstRun = [1] * len(occupant)
presentSent = [0] * len(occupant)
notPresentSent = [0] * len(occupant)
counter = [0] * len(occupant)
# Function that checks for device presence
def whosHere(i):
# 30 second pause to allow main thread to finish arp-scan and populate output
sleep(30)
# Loop through checking for devices and counting if they're not present
while True:
# Exits thread if Keyboard Interrupt occurs
if stop == True:
print ("Exiting Thread")
exit()
else:
pass
# If a listed device address is present print
if address[i] in output:
print(occupant[i] + "'s device is connected")
if presentSent[i] == 0:
# TODO: UPDATE THIS OCCUPANT'S STATUS TO TRUE
# Reset counters so another stream isn't sent if the device
# is still present
firstRun[i] = 0
presentSent[i] = 1
notPresentSent[i] = 0
counter[i] = 0
sleep(900)
else:
# If a stream's already been sent, just wait for 15 minutes
counter[i] = 0
sleep(900)
# If a listed device address is not present, print and stream
else:
print(occupant[i] + "'s device is not connected")
# Only consider a device offline if it's counter has reached 30
# This is the same as 15 minutes passing
if counter[i] == 30 or firstRun[i] == 1:
firstRun[i] = 0
if notPresentSent[i] == 0:
# TODO: UPDATE THIS OCCUPANT'S STATUS TO FALSE
# Reset counters so another stream isn't sent if the device
# is still present
notPresentSent[i] = 1
presentSent[i] = 0
counter[i] = 0
else:
# If a stream's already been sent, wait 30 seconds
counter[i] = 0
sleep(30)
# Count how many 30 second intervals have happened since the device
# disappeared from the network
else:
counter[i] = counter[i] + 1
print(occupant[i] + "'s counter at " + str(counter[i]))
sleep(30)
# Main thread
try:
# Initialize a variable to trigger threads to exit when True
global stop
stop = False
# Start the thread(s)
# It will start as many threads as there are values in the occupant array
for i in range(len(occupant)):
t = Thread(target=whosHere, args=(i,))
t.start()
while True:
# Make output global so the threads can see it
global output
# Reads existing JSON file into buffer
with open("data.json", "r") as jsonFile:
data = json.load(jsonFile)
jsonFile.close()
# Assign list of devices on the network to "output"
output = subprocess.check_output("arp-scan -interface en1 --localnet -l", shell=True)
temp = data["janeHome"]
data["janeHome"] = # RETURNED STATE
data["johnHome"] = # RETURNED STATE
data["jennyHome"] = # RETURNED STATE
data["jamesHome"] = # RETURNED STATE
with open("data.json", "w") as jsonFile:
json.dump(data, jsonFile)
jsonFile.close()
# Wait 30 seconds between scans
sleep(30)
except KeyboardInterrupt:
# On a keyboard interrupt signal threads to exit
stop = True
exit()
I think we can all agree that the best idea would be to return the data from each thread to the main and write it to the file in one location but here is where it gets confusing, with each thread checking for a different person, how can the state be passed back to main for writing?
I'm using the following function to get a scraping job that starts at 9 AM and expected to stop at 4 PM. But it stops much before that, contrary to what I expect it to do.
Let's say the scraping job is run at t=1 and it finishes at t=T. If T < 1 sec, then it sleeps for (1-T) sec. After the sleep, it repeats the scrape cycle. The loop starts at 9 AM and supposed to stop at 4 PM.
With the nosleep = False it doesn't terminate immediately, instead it runs for sometime, eventually not writing any new data to csv, though the function doesn't exit.
If I remove the sleeping for (1-T) sec by setting nosleep = True then it runs fine.
Please let me know what is going wrong with this fragment.
Edit: the same works fine when ported to Python 3.4, i.e. it doesn't stop the scraping when nosleep = False. Earlier was executing in Python 2.7.
def getOCdata(ed = '25FEB2016', start = 540*60, stop = 960*60, waitFlag = True, nosleep = True):
ptime = datetime.now()
while True:
if (ptime.hour*60*60 + ptime.minute*60 + ptime.second) > stop:
break
else:
try:
ptime = datetime.now()
scrapedata = pd.DataFrame(scrapefn(ed))
scrapedata = scrapedata.ix[:, 1:21]
scrapedata['timestamp'] = ptime
scrapedata.to_csv(datafile, index=False, mode='a', header=False)
if nosleep == False:
end = datetime.now()
tdiff = end-ptime
time.sleep(1 - (tdiff.seconds + tdiff.microseconds*1.0/10**6))
except:
continue
return 0
So I have been trying to multi-thread some internet connections in python. I have been using the multiprocessing module so I can get around the "Global Interpreter Lock". But it seems that the system only gives one open connection port to python, Or at least it only allows one connection to happen at once. Here is an example of what I am saying.
*Note that this is running on a linux server
from multiprocessing import Process, Queue
import urllib
import random
# Generate 10,000 random urls to test and put them in the queue
queue = Queue()
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
queue.put(url)
# Main funtion for checking to see if generated url is active
def check(q):
while True:
try:
url = q.get(False)
try:
request = urllib.urlopen(url)
del request
print url + ' is an active url!'
except:
print url + ' is not an active url!'
except:
if q.empty():
break
# Then start all the threads (50)
for thread in range(50):
task = Process(target=check, args=(queue,))
task.start()
So if you run this you will notice that it starts 50 instances on the function but only runs one at a time. You may think that the 'Global Interpreter Lock' is doing this but it isn't. Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
So will I have to work with sockets? Or is there something I can do that will give python access to more ports? Or is there something I am not seeing? Let me know what you think! Thanks!
*Edit
So I wrote this script to test things better with the requests library. It seems as though I had not tested it very well with this before. (I had mainly used urllib and urllib2)
from multiprocessing import Process, Queue
from threading import Thread
from Queue import Queue as Q
import requests
import time
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Queue()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Process(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Multiprocessing: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Q()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Thread(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Standard Threading: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# Do the same thing all over again but this time do each url at a time
# A main timestamp
main_time = time.time()
# Generate 100 urls and test them
timer_list = []
for each in range(100):
url = ('http://www.' + str(each) + '.com')
t = time.time()
try:
request = requests.head(url, timeout=5)
timer_list.append(time.time() - t)
except:
timer_list.append(time.time() - t)
main_time_end = time.time()
# Results of the time
average_response = sum(timer_list) / float(len(timer_list))
total_time = main_time_end - main_time
line = "Not using threads: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
As you can see, it is multithreading very well. Actually, most of my tests show that the threading module is actually faster than the multiprocessing module. (I don't understand why!) Here are some of my results.
Multiprocessing: Average response time: 2.40511314869 sec. -- Total time: 25.6876308918 sec.
Standard Threading: Average response time: 2.2179402256 sec. -- Total time: 24.2941861153 sec.
Not using threads: Average response time: 2.1740363431 sec. -- Total time: 217.404567957 sec.
This was done on my home network, the response time on my server is much faster. I think my question has been answered indirectly, since I was having my problems on a much more complex script. All of the suggestions helped me optimize it very well. Thanks to everyone!
it starts 50 instances on the function but only runs one at a time
You have misinterpreted the results of htop. Only a few, if any, copies of python will be runnable at any specific instance. Most of them will be blocked waiting for network I/O.
The processes are, in fact, running parallel.
Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
Changing the task to a mathematical function merely illustrates the difference between CPU-bound (e.g. math) and IO-bound (e.g. urlopen) processes. The former is always runnable, the latter is rarely runnable.
it only prints one at a time. If it was actually running multiple processes it would print many out at once.
It prints one at a time because you are writing lines to a terminal. Because the lines are indistinguishable, you wouldn't be able to tell if they are written all by one thread, or each by a separate thread in turn.
First of all, using multiprocessing to parallelize network I/O is an overkill. Using the built-in threading or a lightweight greenlet library like gevent are a much better option with less overhead. The GIL has nothing to do with blocking IO calls, so you don't have to worry about that at all.
Secondly, an easy way to see if your subprocesses/threads/greenlets are running in parallel if you are monitoring stdout is to print out something at the very beginning of the function, right after the subprocesses/threads/greenlets are spawned. For example, modify your check() function like so
def check(q):
print 'Start checking urls!'
while True:
...
If your code is correct, you should see many Start checking urls! lines printed out before any of the url + ' is [not] an active url!' printed out. It works on my machine, so it looks like your code is correct.
It appears that your issue is actually with the serial behavior of gethostbyname(3). This is discussed in this SO thread.
Try this code that uses the Twisted asynchronous I/O library:
import random
import sys
from twisted.internet import reactor
from twisted.internet import defer
from twisted.internet.task import cooperate
from twisted.web import client
SIMULTANEOUS_CONNECTIONS = 25
# Generate 10,000 random urls to test and put them in the queue
pages = []
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
pages.append(url)
# Main function for checking to see if generated url is active
def check(page):
def successback(data, page):
print "{} is an active URL!".format(page)
def errback(err, page):
print "{} is not an active URL!; errmsg:{}".format(page, err.value)
d = client.getPage(page, timeout=3) # timeout in seconds
d.addCallback(successback, page)
d.addErrback(errback, page)
return d
def generate_checks(pages):
for i in xrange(0, len(pages)):
page = pages[i]
#print "Page no. {}".format(i)
yield check(page)
def work(pages):
print "started work(): {}".format(len(pages))
batch_size = len(pages) / SIMULTANEOUS_CONNECTIONS
for i in xrange(0, len(pages), batch_size):
task = cooperate(generate_checks(pages[i:i+batch_size]))
print "starting..."
reactor.callWhenRunning(work, pages)
reactor.run()