Python URLLib does not work with PyQt + Multiprocessing - python

A simple code as such:
import urllib2
import requests
from PyQt4 import QtCore
import multiprocessing
import time
data = (
['a', '2'],
)
def mp_worker((inputs, the_time)):
r = requests.get('http://www.gpsbasecamp.com/national-parks')
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
def mp_handler():
p = multiprocessing.Pool(2)
p.map(mp_worker, data)
if __name__ == '__main__':
mp_handler()
Basically, if i import PyQt4, and i have a urllib request (i believe this is used in almost all web extraction libraries such as BeautifulSoup, Requests or Pyquery. it crashes with a cryptic log on my MAC)

This is exactly True. It always fails on Mac, I have wasted rows of days just to fix this. And honestly there is no fix as of now. The best way is to use Thread instead of Process and it will work like a charm.
By the way -
r = requests.get('http://www.gpsbasecamp.com/national-parks')
and
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
do one and the same thing. Why are you doing it twice?

This may be due _scproxy.get_proxies() not being fork-safe on Mac.
This is raised here https://bugs.python.org/issue33725#msg329926
_scproxy has been known to be problematic for some time, see for instance Issue31818. That issue also gives a simple workaround: setting urllib's "no_proxy" environment variable to "*" will prevent the calls to the System Configuration framework.
This is something that urllib may be attempting to do causing failure when multiprocessing.
There is a workaround and that is to set the environmental variable no-proxy to *
Eg. export no_proxy=*

Related

Python - Put a list in a threading module

I want to put a list into my threading script, but I am facing a problem.
Contents of list file (example):
http://google.com
http://yahoo.com
http://bing.com
http://python.org
My script:
import codecs
import threading
import sys
import requests
from time import time as timer
from timeout import timeout
import time
try:
with codecs.open(sys.argv[1], mode='r', encoding='ascii', errors='ignore') as iiz:
iiz=iiz.read().splitlines()
except IOError:
pass
oz = list(iiz)
def nnn(url):
hzz = {'param1': sys.argv[2], 'param2': sys.argv[3]}
po = requests.post(url,data=hzz)
if po:
print("ok \n")
if __name__ == '__main__':
threads = []
for i in range(1):
t = threading.Thread(target=nnn, args=(oz,))
threads.append(t)
t.start()
Can you please clarify what elaborate on exactly what you're trying to achieve.
I'm guessing that you're trying to request urls to load into a web browser or the terminal...
Also you shouldn't need to put the urls into a list because when you opened up the file containing the urls, it automatically sorted it into a list. So in other words, the contents in iiz are already in the list format.
Personally, I haven't worked much with the modules you're using (apart from time), but I'll try my best to help you and hopefully other users will try and help you too.

multithreaded urllib2 freezes on nose framework

I have a python code that uses nose_parameterized as below:
from nose_parameterized import parameterized
from multiprocessing.pool import ThreadPool
import urllib2
def make_http_call(url, req_type):
opener = urllib2.build_opener() # <=== this line causes it to freeze
return 1
pool = ThreadPool(processes=4)
results = []
urls = ['a', 'b', 'c', 'd']
for url in urls:
results.append(pool.apply_async(make_http_call, (url, 'html')))
d = {'add': []}
for ind, res in enumerate(results):
d['add'].append((res.get(), 2+ind, 3+ind))
#parameterized(d['add'])
def test_add(a, b, c):
assert a+b == c
This is the dummy version of the code. Basically, I need to load test parameters with http request responses and since there are lots of urls, I want to multithread them.
As soon as I add urllib2.build_opener, it freezes up using nose (but still works fine with python)
Also, I've tried urllib2.urlopen; the same problem.
Any ideas whether there is 'proper' (debuggable) a way to work around this?
You can use nose multiprocess built in plugin for that, something like:
from nose_parameterized import parameterized
import urllib2
urls = ['http://www.google.com', 'http://www.yahoo.com']
#parameterized(urls)
def test_add(url):
a = urllib2.urlopen(url).read()
b = 2 + urls.index(url)
c = 3 + urls.index(url)
assert a+str(b) == str(c)
and run it with nosetests --processes=2. This enables you to distribute your test run among a set of worker processes that run tests in parallel as you intended. Behind the scenes, multiprocessing module is used.

Python Requests library throws exceptions in logging

The Python requests library appears to have some rather strange quirks when it comes to its logging behaviour.
Using the latest Python 2.7.8, I have the following code:
import requests
import logging
logging.basicConfig(
filename='mylog.txt',
format='%(asctime)-19.19s|%(task)-36s|%(levelname)s:%(name)s: %(message)s',
level=eval('logging.%s' % 'DEBUG'))
logger = logging.getLogger(__name__)
logger.info('myprogram starting up...', extra={'task': ''}) # so far so good
...
(ommited code)
...
payload = {'id': 'abc', 'status': 'ok'}
# At this point the program continues but throws an exception.
requests.get('http://localhost:9100/notify', params=payload)
print 'Task is complete! NotifyURL was hit! - Exiting'
My program seems to exit normally, however inside the log file it creates (mylog.txt) I always find the following exception:
KeyError: 'task'
Logged from file connectionpool.py, line 362
If I remove this:
requests.get('http://localhost:9100/notify', params=payload)
then the exception is gone.
What exactly am I doing wrong here and how may I fix this?
I am using requests v2.4.3.
The problem is your custom logging format, where you expect a %(task).
Requests (or rather the bundled urllib3) does not include the task parameter when logging, as it has no way of knowing, that you expect this.
As indicated in t-8ch's answer, the logger is being used internally by the requests library and this library doesn't know anything about your parameters. A possible solution is to implant a custom filter in the library's logger (in this case, one of its modules):
class TaskAddingFilter(logging.Filter):
def __init__(self):
logging.Filter.__init__(self)
def filter(self, record):
record.args = record.args + ('task', '')
# ...
requestsLogger = logging.getLogger('requests.packages.urllib3.connectionpool')
requestsLogger.addFilter(TaskAddingFilter())
Potentially, you need to add such filtering to all loggers from requests, which are:
requests.packages.urllib3.util
requests.packages.urllib3.connectionpool
requests.packages
requests.packages.urllib3
requests.packages.urllib3.util.retry
requests
requests.packages.urllib3.poolmanager
in my version, you can find them using the logging.Logger.manager.loggerDict attribute. So, you could do something like this:
for name,logger in logging.Logger.manager.loggerDict.iteritems():
logger = logging.getLogger(name) # because of lazy initialization
if name.startswith('requests.'):
logger.addFilter(TaskAddingFilter())
The TaskAddingFilter can be a bit smarter of course, e.g. adding a particular task entry depending on where you are in your code. I've added only the simplest solution for the code you've provided - the exception doesn't occur anymore - but the wide range of possibilities seems obvious ;)

Regarding memory consumption in python with urllib2 module

PS: I have a similar question with Requests HTTP library here.
I am using python v2.7 on windows 7 OS. I am using urllib2 module. I have two code snippets. One file is named as myServer.py The server class has 2 methods named as getName(self,code) and getValue(self).
The other script named as testServer.py simply calls the methods from the server class to retrieve the values and prints them. The server class basically retrieves the values from a Server in my local network. So, unfortunately I can't provide you the access for testing the code.
Problem: When I execute my testServer.py file, I observed in the task manager that the memory consumption keeps increasing. Why is it increasing and how to avoid it? If I comment out the following line
print serverObj.getName(1234)
in testServer.py then there is no increase in memory consumption.
I am sure that the problem is with the getName(self,code) of the server class. But unfortunately, I couldn't figure out what the problem is.
Code: Please find the code snippets below:
#This is the myServer.py file
import urllib2
import json
import random
class server():
def __init__(self):
url1 = 'https://10.0.0.1/'
username = 'user'
password = 'passw0rd'
passwrdmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passwrdmgr.add_password(None, url1, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passwrdmgr)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
def getName(self, code):
code = str(code)
url = 'https://10.0.0.1/' + code
response = urllib2.urlopen(url)
data = response.read()
name = str(data).strip()
return name
def getValue(self):
value = random.randrange(0,11)
return value
The following is the testServer.py snippet
from myServer import server
import time
serverObj = server()
while True:
time.sleep(1)
print serverObj.getName(1234)
print serverObj.getValue()
Thank you for your time!
This is question is quite similar to my other question. So I think the answer is also quite similar. The answer can be found here https://stackoverflow.com/a/23172330/2382792

Trouble finding library that supports http PUT in Python 2.7

I need to perform http PUT operations from python Which libraries have been proven to support this? More specifically I need to perform PUT on keypairs, not file upload.
I have been trying to work with the restful_lib.py, but I get invalid results from the API that I am testing. (I know the results are invalid because I can fire off the same request with curl from the command line and it works.)
After attending Pycon 2011 I came away with the impression that pycurl might be my solution, so I have been trying to implement that. I have two issues here. First, pycurl renames "PUT" as "UPLOAD" which seems to imply that it is focused on file uploads rather than key pairs. Second when I try to use it I never seem to get a return from the .perform() step.
Here is my current code:
import pycurl
import urllib
url='https://xxxxxx.com/xxx-rest'
UAM=pycurl.Curl()
def on_receive(data):
print data
arglist= [\
('username', 'testEmailAdd#test.com'),\
('email', 'testEmailAdd#test.com'),\
('username','testUserName'),\
('givenName','testFirstName'),\
('surname','testLastName')]
encodedarg=urllib.urlencode(arglist)
path2= url+"/user/"+"99b47002-56e5-4fe2-9802-9a760c9fb966"
UAM.setopt(pycurl.URL, path2)
UAM.setopt(pycurl.POSTFIELDS, encodedarg)
UAM.setopt(pycurl.SSL_VERIFYPEER, 0)
UAM.setopt(pycurl.UPLOAD, 1) #Set to "PUT"
UAM.setopt(pycurl.CONNECTTIMEOUT, 1)
UAM.setopt(pycurl.TIMEOUT, 2)
UAM.setopt(pycurl.WRITEFUNCTION, on_receive)
print "about to perform"
print UAM.perform()
httplib should manage.
http://docs.python.org/library/httplib.html
There's an example on this page http://effbot.org/librarybook/httplib.htm
urllib and urllib2 are also suggested.
Thank you all for your assistance. I think I have found an answer.
My code now looks like this:
import urllib
import httplib
import lxml
from lxml import etree
url='xxxx.com'
UAM=httplib.HTTPSConnection(url)
arglist= [\
('username', 'testEmailAdd#test.com'),\
('email', 'testEmailAdd#test.com'),\
('username','testUserName'),\
('givenName','testFirstName'),\
('surname','testLastName')\
]
encodedarg=urllib.urlencode(arglist)
uuid="99b47002-56e5-4fe2-9802-9a760c9fb966"
path= "/uam-rest/user/"+uuid
UAM.putrequest("PUT", path)
UAM.putheader('content-type','application/x-www-form-urlencoded')
UAM.putheader('accepts','application/com.internap.ca.uam.ama-v1+xml')
UAM.putheader("Content-Length", str(len(encodedarg)))
UAM.endheaders()
UAM.send(encodedarg)
response = UAM.getresponse()
html = etree.HTML(response.read())
result = etree.tostring(html, pretty_print=True, method="html")
print result
Updated: Now I am getting valid responses. This seems to be my solution. (The pretty print at the end isn't working, but I don't really care, that is just there while I am building the function.)

Categories