Multiprocessing is stuck at the map function - python

I am currently making a http request that has over 3,000,000 records by using the pagination method. Sometimes the call fails because of a 104 server error, so I retry and it works on the second or third time.
Because there are so many requests, I am using the multiprocess function in python to speed this along. I'm using a ubuntu 16 machine, python3.5 and an 8 core machine. The odd thing here is that all the files get written, and the process "Finishes" i.e reaches the end of the range (regardless of the size so 1 million or 2 million or 3 million) but it wont pass the pool line. So my tmux sessions just says "Working on date (lastrecordnumber)" I need that to occur so I can send an email to let me know the task has finished.
I've tried pool.map(); pool.aysnc(); pool.map_async(), they all seem to have the same issue.
import http.client
from multiprocessing import Pool
from functools import partial
def get_raw_data(auth, url_conn, skip):
headers = {'authorization': "Basic {}".format(auth)}
sucess = None
loop = 0
while not sucess:
try:
conn = http.client.HTTPSConnection(url_conn)
conn.request("GET", "SOME_API&$skip={}".format(skip), headers=headers)
res = conn.getresponse()
data = res.read()
raw_data = json.loads(data.decode("utf-8"))
sucess = 'yes'
except Exception as e:
print('stuck in loop {} {} {}'.format(skip, loop, e))
loop += 1
with open('{}.json'.format(skip), 'w') as outfile:
json.dump(raw_data, outfile)
def process_skips(skip):
print('Working on date {}'.format(skip))
get_raw_data(skip)
if __name__ == '__main__':
print("We started at {}".format(dt.datetime.now()))
n = range(0,3597351,5000)
n = list(n)
pool = Pool(8)
pool.map_async(process_skips, n)
pool.close()
pool.join()

Using pool as a context manager using with which takes care of closing/joining the processes and seems to be the preferred method in the docs.
if __name__ == '__main__':
print("We started at {}".format(dt.datetime.now()))
n = list(range(0,3597351,5000))
with Pool(8) as pool:
pool.map_async(process_skips, n)
If your main process is working and writing your file correctly, that should make your processes close out correctly.

Related

Effective Python Multiprocessing in a list for loop

I'm trying to multiprocess an action inside a for x in y loop. Basically, the concept of the script is to do a request to a site, load up a json file containing a list of URLs. Once fetched, another function is called to parse an URL individually. What i've been trying to do is to multiprocess this task with multiprocess.Process() in order to speed up the process since there is lots of URLs to parse. However, my approach doesn't speed up the process at all, it actually goes at the same speed than with no multiprocessing. It seems like gets blocked when using proc.join().
This is a code i've been working on:
import json
import requests
import multiprocessing
def ExtractData(id):
print("Processing ", id)
result = requests.get('http://example-index.com/' + id')
result = result.text.split('\n')[:-1]
for entry in result:
data = json.loads(entry)['url']
print("data is:", data)
def ParseJsonAndCall():
url = "https://example-site.com/info.json"
data = json.loads(requests.get(url).text)
t = []
for results in data:
print("Processing ", results['url'])
p = multiprocessing.Process(target=ExtractData, args=(results['id'],))
t.append(p)
p.start()
for proc in threads:
proc.join()
ParseJsonAndCall()
Any help would be greatly appreciated!
A Pool may help.
import multiprocessing as mp
def ParseJsonAndCall():
url = "https://example-site.com/info.json"
data = json.loads(requests.get(url).text)
collect_results = []
with mp.Pool(processes=mp.cpu_count()) as pool:
for results in data:
res = pool.apply_async(ExtractData, [results['id'],])
collect_results.append(res)
for res in collect_results:
res.get()
Although the print statement in ExtractData() might cause a race condition.

Python multiprocessing only working on one core with networking

I am making a program to generate a .onion url and then test the url to see if it exists. The issue is I have set it up to use multiprocessing and it only uses one core, even when multiple processes are running.
My code is
import requests
import time
import sys
import threading
import random
import string
import multiprocessing
proxies = {
'http': 'socks5h://127.0.0.1:9150',
'https': 'socks5h://127.0.0.1:9150'
}
try:
print("Your hidden ip is: " + requests.get("https://api.ipify.org/?", proxies=proxies, timeout=10).text)
print("Checks at intervals to check if the commection is still established (pings google.com)")
except Exception as e:
print("Make sure tor is running or check that a socks proxy is running at 127.0.0.1:9150!")
sys.exit(0)
def checkSite(site):
if random.randint(0, 100) == 5:
site = "https://google.com"
try:
data = requests.get(site, proxies=proxies, timeout=10)
if data.status_code == 200:
print('Website exists {}'.format(site))
sys.exit(0)
print("hello")
else:
return
#print('Website does not exist {}'.format(site))
except Exception:
#print('Website does not exist {}'.format(site))
return
def genSite():
site = ""
x= string.ascii_lowercase + "234562"
for i in range(0, 16):
site = site + x[random.randint(0, len(x) - 1)]
return ("http://" + site + ".onion")
def findSite():
while True:
checkSite(genSite())
if __name__ == "__main__":
processes = []
for _ in range(int(input("Enter number of processes: "))):
p = multiprocessing.Process(target=findSite) # create a new Process
processes.append(p)
p.start()
for process in processes:
process.join()
I have tried this with running a factorial function instead and this will utilize all cores, however it will only use one core when running the findSite function. I don't know weather this is an issue with networking and threads should be used or if my code is wrong. I have also tried to join the processes directly after they are started and put the join statement in different places however this has not worked.

How many network ports does Linux allow python to use?

So I have been trying to multi-thread some internet connections in python. I have been using the multiprocessing module so I can get around the "Global Interpreter Lock". But it seems that the system only gives one open connection port to python, Or at least it only allows one connection to happen at once. Here is an example of what I am saying.
*Note that this is running on a linux server
from multiprocessing import Process, Queue
import urllib
import random
# Generate 10,000 random urls to test and put them in the queue
queue = Queue()
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
queue.put(url)
# Main funtion for checking to see if generated url is active
def check(q):
while True:
try:
url = q.get(False)
try:
request = urllib.urlopen(url)
del request
print url + ' is an active url!'
except:
print url + ' is not an active url!'
except:
if q.empty():
break
# Then start all the threads (50)
for thread in range(50):
task = Process(target=check, args=(queue,))
task.start()
So if you run this you will notice that it starts 50 instances on the function but only runs one at a time. You may think that the 'Global Interpreter Lock' is doing this but it isn't. Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
So will I have to work with sockets? Or is there something I can do that will give python access to more ports? Or is there something I am not seeing? Let me know what you think! Thanks!
*Edit
So I wrote this script to test things better with the requests library. It seems as though I had not tested it very well with this before. (I had mainly used urllib and urllib2)
from multiprocessing import Process, Queue
from threading import Thread
from Queue import Queue as Q
import requests
import time
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Queue()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Process(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Multiprocessing: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# A main timestamp
main_time = time.time()
# Generate 100 urls to test and put them in the queue
queue = Q()
for each in range(100):
url = ('http://www.' + str(each) + '.com')
queue.put(url)
# Timer queue
time_queue = Queue()
# Main funtion for checking to see if generated url is active
def check(q, t_q): # args are queue and time_queue
while True:
try:
url = q.get(False)
# Make a timestamp
t = time.time()
try:
request = requests.head(url, timeout=5)
t = time.time() - t
t_q.put(t)
del request
except:
t = time.time() - t
t_q.put(t)
except:
break
# Then start all the threads (20)
thread_list = []
for thread in range(20):
task = Thread(target=check, args=(queue, time_queue))
task.start()
thread_list.append(task)
# Join all the threads so the main process don't quit
for each in thread_list:
each.join()
main_time_end = time.time()
# Put the timerQueue into a list to get the average
time_queue_list = []
while True:
try:
time_queue_list.append(time_queue.get(False))
except:
break
# Results of the time
average_response = sum(time_queue_list) / float(len(time_queue_list))
total_time = main_time_end - main_time
line = "Standard Threading: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
# Do the same thing all over again but this time do each url at a time
# A main timestamp
main_time = time.time()
# Generate 100 urls and test them
timer_list = []
for each in range(100):
url = ('http://www.' + str(each) + '.com')
t = time.time()
try:
request = requests.head(url, timeout=5)
timer_list.append(time.time() - t)
except:
timer_list.append(time.time() - t)
main_time_end = time.time()
# Results of the time
average_response = sum(timer_list) / float(len(timer_list))
total_time = main_time_end - main_time
line = "Not using threads: Average response time: %s sec. -- Total time: %s sec." % (average_response, total_time)
print line
As you can see, it is multithreading very well. Actually, most of my tests show that the threading module is actually faster than the multiprocessing module. (I don't understand why!) Here are some of my results.
Multiprocessing: Average response time: 2.40511314869 sec. -- Total time: 25.6876308918 sec.
Standard Threading: Average response time: 2.2179402256 sec. -- Total time: 24.2941861153 sec.
Not using threads: Average response time: 2.1740363431 sec. -- Total time: 217.404567957 sec.
This was done on my home network, the response time on my server is much faster. I think my question has been answered indirectly, since I was having my problems on a much more complex script. All of the suggestions helped me optimize it very well. Thanks to everyone!
it starts 50 instances on the function but only runs one at a time
You have misinterpreted the results of htop. Only a few, if any, copies of python will be runnable at any specific instance. Most of them will be blocked waiting for network I/O.
The processes are, in fact, running parallel.
Try changing the function to a mathematical function instead of a network request and you will see that all fifty threads run simultaneously.
Changing the task to a mathematical function merely illustrates the difference between CPU-bound (e.g. math) and IO-bound (e.g. urlopen) processes. The former is always runnable, the latter is rarely runnable.
it only prints one at a time. If it was actually running multiple processes it would print many out at once.
It prints one at a time because you are writing lines to a terminal. Because the lines are indistinguishable, you wouldn't be able to tell if they are written all by one thread, or each by a separate thread in turn.
First of all, using multiprocessing to parallelize network I/O is an overkill. Using the built-in threading or a lightweight greenlet library like gevent are a much better option with less overhead. The GIL has nothing to do with blocking IO calls, so you don't have to worry about that at all.
Secondly, an easy way to see if your subprocesses/threads/greenlets are running in parallel if you are monitoring stdout is to print out something at the very beginning of the function, right after the subprocesses/threads/greenlets are spawned. For example, modify your check() function like so
def check(q):
print 'Start checking urls!'
while True:
...
If your code is correct, you should see many Start checking urls! lines printed out before any of the url + ' is [not] an active url!' printed out. It works on my machine, so it looks like your code is correct.
It appears that your issue is actually with the serial behavior of gethostbyname(3). This is discussed in this SO thread.
Try this code that uses the Twisted asynchronous I/O library:
import random
import sys
from twisted.internet import reactor
from twisted.internet import defer
from twisted.internet.task import cooperate
from twisted.web import client
SIMULTANEOUS_CONNECTIONS = 25
# Generate 10,000 random urls to test and put them in the queue
pages = []
for each in range(10000):
rand_num = random.randint(1000,10000)
url = ('http://www.' + str(rand_num) + '.com')
pages.append(url)
# Main function for checking to see if generated url is active
def check(page):
def successback(data, page):
print "{} is an active URL!".format(page)
def errback(err, page):
print "{} is not an active URL!; errmsg:{}".format(page, err.value)
d = client.getPage(page, timeout=3) # timeout in seconds
d.addCallback(successback, page)
d.addErrback(errback, page)
return d
def generate_checks(pages):
for i in xrange(0, len(pages)):
page = pages[i]
#print "Page no. {}".format(i)
yield check(page)
def work(pages):
print "started work(): {}".format(len(pages))
batch_size = len(pages) / SIMULTANEOUS_CONNECTIONS
for i in xrange(0, len(pages), batch_size):
task = cooperate(generate_checks(pages[i:i+batch_size]))
print "starting..."
reactor.callWhenRunning(work, pages)
reactor.run()

From synchronous to asynchronous "Processing" when working with list chunks

I am using multiprocessing module via class Process to do some not cpu-bound tasks, e.g. I/O, or web requests. If the tasks take too long the CPU reaches 100% of usage (all threads are waiting the data to return). I suspect asynchronous execution solution but I have never done something like this. The code I am using is something like the following where I have a huge list and each process works on a chunk.
Could you please make a suggestion in this direction?
Thanks in advance!!
import multiprocessing
def getData(urlsChunk, myQueue):
for url in urlsChunk:
fp = urllib.urlopen(url)
try:
data = fp.read()
myQueue.put(data)
finally:
fp.close()
return myQueue
manager = multiprocessing.Manager()
HUGEQ = manager.Queue()
urls = ['a huge list of url items']
chunksize = int(math.ceil(len(urls) / float(nprocs)))
for i in range(nprocs):
p = Process(
target = getData, # This is my worker
args=(urls[chunksize * i:chunksize * (i + 1)],
MYQUEUE
)
)
processes.append(p)
p.start()
for p in processes:
p.join()
while True:
try:
MYQUEUEelem = MYQUEUE.get(block=False)
except Empty:
break
else:
'do something with the MYQUEUEelem'
Using multiprocessing.Pool, your code can be simplified:
import multiprocessing
def getData(url):
fp = urllib.urlopen(url)
try:
return fp.read()
finally:
fp.close()
if __name__ == '__main__': # should protect the "entry point" of the program
urls = ['a huge list of url items']
pool = multiprocessing.Pool()
for result in pool.imap(getData, urls, chunksize=10):
# do something with the result

Learning python and threading. I think my code runs infinitely. Help me find bugs?

So I've started learning python now, and I absolutely am in love with it.
I'm building a small scale facebook data scraper. Basically, it will use the Graph API and scrape the first names of the specified number of users. It works fine in a single thread (or no thread I guess).
I used online tutorials to come up with the following multithreaded version (updated code):
import requests
import json
import time
import threading
import Queue
GraphURL = 'http://graph.facebook.com/'
first_names = {} # will store first names and their counts
queue = Queue.Queue()
def getOneUser(url):
http_response = requests.get(url) # open the request URL
if http_response.status_code == 200:
data = http_response.text.encode('utf-8', 'ignore') # Get the text of response, and encode it
json_obj = json.loads(data) # load it as a json object
# name = json_obj['name']
return json_obj['first_name']
# last = json_obj['last_name']
return None
class ThreadGet(threading.Thread):
""" Threaded name scraper """
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#print 'thread started\n'
url = GraphURL + str(self.queue.get())
first = getOneUser(url) # get one user's first name
if first is not None:
if first_names.has_key(first): # if name has been encountered before
first_names[first] = first_names[first] + 1 # increment the count
else:
first_names[first] = 1 # add the new name
self.queue.task_done()
#print 'thread ended\n'
def main():
start = time.time()
for i in range(6):
t = ThreadGet(queue)
t.setDaemon(True)
t.start()
for i in range(100):
queue.put(i)
queue.join()
for name in first_names.keys():
print name + ': ' + str(first_names[name])
print '----------------------------------------------------------------'
print '================================================================'
# Print top first names
for key in first_names.keys():
if first_names[key] > 2:
print key + ': ' + str(first_names[key])
print 'It took ' + str(time.time()-start) + 's'
main()
To be honest, I don't understand some of the parts of the code but I get the main idea. The output is nothing. I mean the shell has nothing in it, so I believe it keeps on running.
So what I am doing is filling queue with integers that are the user id's on fb. Then each ID is used to build the api call URL. getOneUser returns the name of one user at a time. That task (ID) is marked as 'done' and it moves on.
What is wrong with the code above?
Your usage of first_names is not thread-safe. You could add a lock to protect the increment. Otherwise the code should work. You might be hitting some facebook api limit i.e., you should limit your request rate.
You could simplify your code by using a thread pool and counting the names in the main thread:
#!/usr/bin/env python
import json
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
def get_name(url):
try:
return json.load(urllib2.urlopen(url))['first_name']
except Exception:
return None # error
urls = ('http://graph.facebook.com/%d' % i for i in xrange(100))
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, urls))
print first_names.most_common()
To see what errors you get, you could add logging:
#!/usr/bin/env python
import json
import logging
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s %(threadName)s %(message)s")
def get_name(url):
try:
name = json.load(urllib2.urlopen(url))['first_name']
except Exception as e:
logging.debug('error: %s url: %s', e, url)
return None # error
else:
logging.debug('done url: %s', url)
return name
urls = ('http://graph.facebook.com/%d' % i for i in xrange(100))
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, urls))
print first_names.most_common()
A simple way to limit number of requests per given time period is to use a semaphore:
#!/usr/bin/env python
import json
import logging
import time
import urllib2
from collections import Counter
from multiprocessing.dummy import Pool # use threads
from threading import _BoundedSemaphore as BoundedSemaphore, Timer
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s %(threadName)s %(message)s")
class RatedSemaphore(BoundedSemaphore):
"""Limit to 1 request per `period / value` seconds (over long run)."""
def __init__(self, value=1, period=1):
BoundedSemaphore.__init__(self, value)
t = Timer(period, self._add_token_loop,
kwargs=dict(time_delta=float(period) / value))
t.daemon = True
t.start()
def _add_token_loop(self, time_delta):
"""Add token every time_delta seconds."""
while True:
try:
BoundedSemaphore.release(self)
except ValueError: # ignore if already max possible value
pass
time.sleep(time_delta) # ignore EINTR
def release(self):
pass # do nothing (only time-based release() is allowed)
def get_name(gid, rate_limit=RatedSemaphore(value=100, period=600)):
url = 'http://graph.facebook.com/%d' % gid
try:
with rate_limit:
name = json.load(urllib2.urlopen(url))['first_name']
except Exception as e:
logging.debug('error: %s url: %s', e, url)
return None # error
else:
logging.debug('done url: %s', url)
return name
p = Pool(5) # 5 concurrent connections
first_names = Counter(p.imap_unordered(get_name, xrange(200)))
print first_names.most_common()
After the initial burst, it should make a single request every 6 seconds.
Consider using batch requests.
Your original run function only processed one item from the queue. In all you've only removed 5 items from the queue.
Usually run functions look like
run(self):
while True:
doUsefulWork()
i.e. they have a loop which causes the recurring work to be done.
[Edit] OP edited code to include this change.
Some other useful things to try:
Add a print statement into the run function: you'll find that it is only called 5 times.
Remove the queue.join() call, this is what is causing the module to block, then you will be able to probe the state of the queue.
put the entire body of run into a function. Verify that you can use that function in a single threaded manner to get the desired results, then
try it with just a single worker thread, then finally go for
multiple worker threads.

Categories