multithreaded urllib2 freezes on nose framework - python

I have a python code that uses nose_parameterized as below:
from nose_parameterized import parameterized
from multiprocessing.pool import ThreadPool
import urllib2
def make_http_call(url, req_type):
opener = urllib2.build_opener() # <=== this line causes it to freeze
return 1
pool = ThreadPool(processes=4)
results = []
urls = ['a', 'b', 'c', 'd']
for url in urls:
results.append(pool.apply_async(make_http_call, (url, 'html')))
d = {'add': []}
for ind, res in enumerate(results):
d['add'].append((res.get(), 2+ind, 3+ind))
#parameterized(d['add'])
def test_add(a, b, c):
assert a+b == c
This is the dummy version of the code. Basically, I need to load test parameters with http request responses and since there are lots of urls, I want to multithread them.
As soon as I add urllib2.build_opener, it freezes up using nose (but still works fine with python)
Also, I've tried urllib2.urlopen; the same problem.
Any ideas whether there is 'proper' (debuggable) a way to work around this?

You can use nose multiprocess built in plugin for that, something like:
from nose_parameterized import parameterized
import urllib2
urls = ['http://www.google.com', 'http://www.yahoo.com']
#parameterized(urls)
def test_add(url):
a = urllib2.urlopen(url).read()
b = 2 + urls.index(url)
c = 3 + urls.index(url)
assert a+str(b) == str(c)
and run it with nosetests --processes=2. This enables you to distribute your test run among a set of worker processes that run tests in parallel as you intended. Behind the scenes, multiprocessing module is used.

Related

Using Magic mock to test Github Api

I am basically using magic mock and context manager to test my code, I was successfully able to mock my get_urls function, But I am having trouble mocking out my access_all_repos_pr(): function which contains data of PR newer than 7 days, can anyone help me out on how to mock that data.
Here is the test code for my get_urls():
import unittest
from mock import MagicMock, patch
from contextlib2 import ExitStack
from GithubAPIpackage.GithubAPI import get_urls
class Test_GithubApi(unittest.TestCase):
def test_get_urls_returns_valid_urls(self):
with ExitStack() as stack:
mock_get_urls = stack.enter_context(
patch("GithubAPIpackage.GithubAPI._fetch_url")
)
fake_data = {"current_user_repositories_url": "http://FAKEURL.com"}
mock_get_urls.return_value = fake_data
print(type(fake_data))
result = get_urls()
self.assertEqual(result, "http://FAKEURL.com")
I want to mock out the response for the function access_all_repo_pr, can anyone help me out in what I need to do exactly to create a mock for my access_all_repo_pr function. Do I need to refactor my code in some way? (relatively new to python)
what I am trying is:
class Test_GithubApi_newer_than_7_days(unittest.TestCase):
def test_access_all_repo_pr_returns_valid_response(self):
with ExitStack() as stack:
mock_access_all_repo_pr = stack.enter_context(
patch("GithubAPIpackage.GithubAPI._fetch_url")
)
fake_data = {"current_user_repositories_url": "http://myfakeurl.com"}
mock_access_all_repo_pr.return_value = fake_data
result = access_all_repo_pr()
self.assertEqual(result, "")
Since you are using requests under the hood, may I suggest using responses for your testing? Not trying to skirt the question, but in my experience, I have found this to be the path of least resistance when it comes to writing tests that deal with the requests module. The tests end up being a lot cleaner, safer, and easier to write.

Flask, uWSGI and threads produce unreproducible result

I have an API based on flask + uwsgi (running in a docker container).
Its input is a text, and its output is a simple json with data about this text (something like {"score": 1}).
I request this API in 5 threads. In my uwsgi.ini I have processes = 8 and threads = 2.
Recently I noticed that some results are not reproducible, though the source code of API didn't change and there is no code provoking random answers inside it.
So I took the same set of queries and fed it to my API, first in straight order, then in the reversed one. About 1% of responses differed!
When I did the same locally (without threads and docker& in just one thread), the results became just the same though. So my hypothesis is that flask somewhat confuses responses to different threads from time to time.
Has anyone dealt with it? I found only https://github.com/getsentry/raven-python/issues/923, but if it's the case then the problem still remains unsolved as fas as I understand...
So here are relevant code pieces:
uwsgi.ini
[uwsgi]
socket = :8000
processes = 8
threads = 2
master = true
module = web:app
requests
import json
from multiprocessing import Pool
import requests
def fetch_score(query):
r = requests.post("http://url/api/score", data = query)
score = json.loads(r.text)
query["score"] = score["score"]
return query
def calculateParallel(arr_of_data):
pool = Pool(processes=5)
results = pool.map(fetch_score, arr_of_data)
pool.close()
pool.join()
return results
results = calculateParallel(final_data)

Python URLLib does not work with PyQt + Multiprocessing

A simple code as such:
import urllib2
import requests
from PyQt4 import QtCore
import multiprocessing
import time
data = (
['a', '2'],
)
def mp_worker((inputs, the_time)):
r = requests.get('http://www.gpsbasecamp.com/national-parks')
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
def mp_handler():
p = multiprocessing.Pool(2)
p.map(mp_worker, data)
if __name__ == '__main__':
mp_handler()
Basically, if i import PyQt4, and i have a urllib request (i believe this is used in almost all web extraction libraries such as BeautifulSoup, Requests or Pyquery. it crashes with a cryptic log on my MAC)
This is exactly True. It always fails on Mac, I have wasted rows of days just to fix this. And honestly there is no fix as of now. The best way is to use Thread instead of Process and it will work like a charm.
By the way -
r = requests.get('http://www.gpsbasecamp.com/national-parks')
and
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
do one and the same thing. Why are you doing it twice?
This may be due _scproxy.get_proxies() not being fork-safe on Mac.
This is raised here https://bugs.python.org/issue33725#msg329926
_scproxy has been known to be problematic for some time, see for instance Issue31818. That issue also gives a simple workaround: setting urllib's "no_proxy" environment variable to "*" will prevent the calls to the System Configuration framework.
This is something that urllib may be attempting to do causing failure when multiprocessing.
There is a workaround and that is to set the environmental variable no-proxy to *
Eg. export no_proxy=*

Feedparser - TypeError('a float is required',)

I am using guv and feedparser to parse multiple feeds simultaneously. The following is my code:
guv.monkey_patch(time=True, socket=True)
def parse_feed(_feed):
return feedparser.parse(_feed)
def main():
urls = ["http://feeds.bbci.co.uk/news/rss.xml"]
pool = guv.GreenPool()
results = pool.starmap(parse_feed, zip(urls))
for resp in results:
print(str(resp))
However, I get the following output:
{'bozo_exception': TypeError('a float is required',), 'bozo': 1, 'feed': {}, 'entries': []}
I have the similar problem using Eventlet, but not with native Python 3 threading library.
I'm not able to install the guv module locally, so I can't test your code verbatim, but if I use eventlet.greenpool.GreenPool instead everything works fine:
import feedparser
import eventlet.greenpool
def parse_feed(_feed):
print 'PARSE:', _feed
return feedparser.parse(_feed)
def main():
urls = ["http://feeds.bbci.co.uk/news/rss.xml"]
pool = eventlet.greenpool.GreenPool()
results = pool.starmap(parse_feed, zip(urls))
for resp in results:
print resp
main()
I also see correct behavior with itertools.starmap(). Is it possible you're seeing some sort of transient error?

Requests with multiple connections

I use the Python Requests library to download a big file, e.g.:
r = requests.get("http://bigfile.com/bigfile.bin")
content = r.content
The big file downloads at +- 30 Kb per second, which is a bit slow. Every connection to the bigfile server is throttled, so I would like to make multiple connections.
Is there a way to make multiple connections at the same time to download one file?
You can use HTTP Range header to fetch just part of file (already covered for python here).
Just start several threads and fetch different range with each and you're done ;)
def download(url,start):
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, start+chunk_size)
f = urllib2.urlopen(req)
parts[start] = f.read()
threads = []
parts = {}
# Initialize threads
for i in range(0,10):
t = threading.Thread(target=download, i*chunk_size)
t.start()
threads.append(t)
# Join threads back (order doesn't matter, you just want them all)
for i in threads:
i.join()
# Sort parts and you're done
result = ''.join(parts[i] for i in sorted(parts.keys()))
Also note that not every server supports Range header (and especially servers with php scripts responsible for data fetching often don't implement handling of it).
Here's a Python script that saves given url to a file and uses multiple threads to download it:
#!/usr/bin/env python
import sys
from functools import partial
from itertools import count, izip
from multiprocessing.dummy import Pool # use threads
from urllib2 import HTTPError, Request, urlopen
def download_chunk(url, byterange):
req = Request(url, headers=dict(Range='bytes=%d-%d' % byterange))
try:
return urlopen(req).read()
except HTTPError as e:
return b'' if e.code == 416 else None # treat range error as EOF
except EnvironmentError:
return None
def main():
url, filename = sys.argv[1:]
pool = Pool(4) # define number of concurrent connections
chunksize = 1 << 16
ranges = izip(count(0, chunksize), count(chunksize - 1, chunksize))
with open(filename, 'wb') as file:
for s in pool.imap(partial(download_part, url), ranges):
if not s:
break # error or EOF
file.write(s)
if len(s) != chunksize:
break # EOF (servers with no Range support end up here)
if __name__ == "__main__":
main()
The end of file is detected if a server returns empty body, or 416 http code, or if the response size is not chunksize exactly.
It supports servers that doesn't understand Range header (everything is downloaded in a single request in this case; to support large files, change download_chunk() to save to a temporary file and return the filename to be read in the main thread instead of the file content itself).
It allows to change independently number of concurrent connections (pool size) and number of bytes requested in a single http request.
To use multiple processes instead of threads, change the import:
from multiprocessing.pool import Pool # use processes (other code unchanged)
This solution requires the linux utility named "aria2c", but it has the advantage of easily resuming downloads.
It also assumes that all the files you want to download are listed in the http directory list for location MY_HTTP_LOC. I tested this script on an instance of lighttpd/1.4.26 http server. But, you can easily modify this script so that it works for other setups.
#!/usr/bin/python
import os
import urllib
import re
import subprocess
MY_HTTP_LOC = "http://AAA.BBB.CCC.DDD/"
# retrieve webpage source code
f = urllib.urlopen(MY_HTTP_LOC)
page = f.read()
f.close
# extract relevant URL segments from source code
rgxp = '(\<td\ class="n"\>\<a\ href=")([0-9a-zA-Z\(\)\-\_\.]+)(")'
results = re.findall(rgxp,str(page))
files = []
for match in results:
files.append(match[1])
# download (using aria2c) files
for afile in files:
if os.path.exists(afile) and not os.path.exists(afile+'.aria2'):
print 'Skipping already-retrieved file: ' + afile
else:
print 'Downloading file: ' + afile
subprocess.Popen(["aria2c", "-x", "16", "-s", "20", MY_HTTP_LOC+str(afile)]).wait()
you could use a module called pySmartDLfor this it uses multiple threads and can do a lot more also this module gives a download bar by default.
for more info check this answer

Categories