Passing a list as a url value to urlopen - python

Motivation
Motivated by this problem - the OP was using urlopen() and accidentally passed a sys.argv list instead of a string as a url. This error message was thrown:
AttributeError: 'list' object has no attribute 'timeout'
Because of the way urlopen was written, the error message itself and the traceback is not very informative and may be difficult to understand especially for a Python newcomer:
Traceback (most recent call last):
File "test.py", line 15, in <module>
get_category_links(sys.argv)
File "test.py", line 10, in get_category_links
response = urlopen(url)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 420, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
Problem
Here is the shortened code I'm working with:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import sys
def get_category_links(url):
response = urlopen(url)
# do smth with response
print(response)
get_category_links(sys.argv)
I'm trying to think whether this kind of an error can be caught statically with either smart IDEs like PyCharm, static code analysis tools like flake8 or pylint, or with language features like type annotations.
But, I'm failing to detect the problem:
it is probably too specific for flake8 and pylint to catch - they don't warn about the problem
PyCharm does not warn about sys.argv being passed into urlopen, even though, if you "jump to source" of sys.argv it is defined as:
argv = [] # real value of type <class 'list'> skipped
if I annotate the function parameter as a string and pass sys.argv, no warnings as well:
def get_category_links(url: str) -> None:
response = urlopen(url)
# do smth with response
get_category_links(sys.argv)
Question
Is it possible to catch this problem statically (without actually executing the code)?

Instead of keeping it editor specific, you can use mypy to analyze your code. This way it will run on all dev environments instead of just for those who use PyCharm.
from urllib.request import urlopen
import sys
def get_category_links(url: str) -> None:
response = urlopen(url)
# do smth with response
get_category_links(sys.argv)
response = urlopen(sys.argv)
The issues pointed out by mypy for the above code:
error: Argument 1 to "get_category_links" has incompatible type List[str]; expected "str"
error: Argument 1 to "urlopen" has incompatible type List[str]; expected "Union[str, Request]"
Mypy here can guess the type of sys.argv because of its definition in its stub file. Right now some standard library modules are still missing from typeshed though, so you will have to either contribute them or ignore the errors related till they get added :-).
When to run mypy?
To catch such errors you can run mypy on the files with annotations with your tests in your CI tool. Running it on all files in project may take some time, for a small project it is your choice.
Add a pre-commit hook that runs mypy on staged files and points out issues right away(could be a little annoying to the dev if it takes a while).

Firstly, you need to check whether the url type is string or not and if string then check for ValueError exception(Valid url)
import sys
from urllib2 import urlopen
def get_category_links(url):
if type(url) != type(""): #Check if url is string or not
print "Please give string url"
return
try:
response = urlopen(url)
# do smth with response
print(response)
except ValueError: #If url is string but invalid
print "Bad URL"
get_category_links(sys.argv)

Related

Python - Web Scraping exercise - Attribute Error

I am learning how to scrape web information. Below is a snippet of the actual code solution + output from datacamp.
On datacamp, this works perfectly fine, but when I try to run it on Spyder (my own macbook), it doesn't work...
This is because on datacamp, the URL has already been pre-loaded into a variable named 'response'.. however on Spyder, the URL needs to be defined again.
So, I first defined the response variable as response = requests.get('https://www.datacamp.com/courses/all') so that the code will point to datacamp's website..
My code looks like:
from scrapy.selector import Selector
import requests
response = requests.get('https://www.datacamp.com/courses/all')
this_url = response.url
this_title = response.xpath('/html/head/title/text()').extract_first()
print_url_title( this_url, this_title )
When I run this on Spyder, I got an error message
Traceback (most recent call last):
File "<ipython-input-30-6a8340fd3a71>", line 11, in <module>
this_title = response.xpath('/html/head/title/text()').extract_first()
AttributeError: 'Response' object has no attribute 'xpath'
Could someone please guide me? I would really like to know how to get this code working on Spyder.. thank you very much.
The value returned by requests.get('https://www.datacamp.com/courses/all') is a Response object, and this object has no attribute xpath, hence the error: AttributeError: 'Response' object has no attribute 'xpath'
I assume response from your tutorial source, probably has been assigned to another object (most likely the object returned by etree.HTML) and not the value returned by requests.get(url).
You can however do this:
from lxml import etree #import etree
response = requests.get('https://www.datacamp.com/courses/all') #get the Response object
tree = etree.HTML(response.text) #pass the page's source using the Response object
result = tree.xpath('/html/head/title/text()') #extract the value
print(response.url) #url
print(result) #findings

python-TypeError: a bytes-like object is required, not 'str'(Learn Python the hard way)

I am a newbie to Python, and am reading Zed Shaw's Learn Python the Hard Way but I met a problem in ex51, which is about automated testing in gettin input in Browser.
The automated testing code is just the same as in this website:
https://github.com/CreaturePhil/Learn-Python-the-Hard-Way/tree/master/ex51
However, the problem is:
ERROR: tests.app_tests.test_index
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\lenovo\Anaconda3\lib\site-packages\nose\case.py", line 197, in
runTest
self.test(*self.arg)
File "C:\Users\lenovo\projects\gothonweb\tests\app_tests.py", line 13, in test
_index
assert_response(resp,contains="Nobody")
File "C:\Users\lenovo\projects\gothonweb\tests\tools.py", line 8, in assert_re
sponse
assert contains in resp.data,"Response does not contain %r"%contains
TypeError: a bytes-like object is required, not 'str'
----------------------------------------------------------------------
Ran 1 test in 2.965s
FAILED (errors=1)
So I got a Type Error: TypeError: a bytes-like object is required, not 'str'.
I really tried some ways but still can not fix it, could someone help me?
Here are some of my codes:
tests\app_tests.py:
from nose.tools import *
from bin.app import app
from tests.tools import assert_response
def test_index():
#check that we get a 404 on the /URL
resp=app.request("/")
assert_response(resp,status="404")
#test our first GET request to /hello
resp=app.request("/hello")
assert_response(resp)
#make sure default values work for the form
resp=app.request("/hello",method="POST")
assert_response(resp,contains="Nobody")
#test that we get expected values
data={'name':'Zed','greet':'Hola'}
resp=app.request("/hello",method="POST",data=data)
assert_response(resp,contains="Zed")
tests\tools.py:
from nose.tools import *
import re
def assert_response(resp,contains=None,matches=None,headers=None,status="200"):
assert status in resp.status,"Expected response %r not in %r"% (status,resp.status)
if status == "200":
assert resp.data,"Response data is empty."
if contains:
assert contains in resp.data,"Response does not contain %r"%contains
if matches:
reg=re.compile(matches)
assert reg.matches(resp.data),"Response does not match %r" %matches
if headers:
assert_equal(resp.headers,headers)
bin\app.py:
import web
urls=(
'/hello','index'
)
app=web.application(urls,globals())
render=web.template.render('templates/',base="layout")
class index(object):
def GET(self):
return render.hello_form()
def POST(self):
form=web.input(name="Nobody",greet="Hello")
greeting="%s,%s"%(form.greet,form.name)
return render.index(greeting=greeting)
if __name__=="__main__":
app.run()
I know that the author has advised to install Python 2.7. However,I installed
the new edition of Anaconda, and use Python 3.6. I tried to create an environment
of Python 2.7 version but the same error still occurred.
I don't know why it happens and how to fix it, could someone help me?

python add http:// to url open sys args

Very new to python, I'm trying to take in a command line argument which is a website and set it to a variable. I use the line below to do this.
sitefile = ur.urlopen(sys.argv[1])
The problem is it only works when formatted perfectly in the command line as 'python filename.py http://mywebsite.com/contacts.html'. I want to be able to drop the http://. I've tried the following 2 ways:
sitefile = ur.urlopen(sys.argv[1].startswith('http://'))
gets error message: AttributeError: 'bool' object has no attribute 'timeout'
sitefile = 'http://' + (ur.urlopen(sys.argv[1]))
gets error message: ValueError: unknown url type: 'mywebsite.com/contacts.html'. It appears to just ignore the first half of this concat.
What's the ur? give the complete code.
sys.argv[1].startswith('http://') return a bool object, drop the http:// should use sys.argv[1].replace('http://', '').
I think the code should like the following lines:
#!/usr/bin/env python
# encoding: utf-8
import sys
import urllib2
if len(sys.argv) != 2:
print('usage: {} <url>'.format(sys.argv[0]))
sys.exit()
url = sys.argv[1]
req = urllib2.urlopen(url)
print(req.read())

How do I search for text in a page using regular expressions in Python?

I'm trying to create a simple module for phenny, a simple IRC bot framework in Python. The module is supposed to go to http://www.isup.me/websitetheuserrequested to check is a website was up or down. I assumed I could use regex for the module seeing as other built-in modules use it too, so I tried creating this simple script although I don't think I did it right.
import re, urllib
import web
isupuri = 'http://www.isup.me/%s'
check = re.compile(r'(?ims)<span class="body">.*?</span>')
def isup(phenny, input):
global isupuri
global cleanup
bytes = web.get(isupuri)
quote = check.findall(bytes)
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
phenny.say(result)
isup.commands = ['isup']
isup.priority = 'low'
isup.example = '.isup google.com'
It imports the required web packages (I think), and defines the string and the text to look for within the page. I really don't know what I did in those four lines, I kinda just ripped the code off another phenny module.
Here is an example of a quotes module that grabs a random quote from some webpage, I kinda tried to use that as a base: http://pastebin.com/vs5ypHZy
Does anyone know what I am doing wrong? If something needs clarified I can tell you, I don't think I explained this enough.
Here is the error I get:
Traceback (most recent call last):
File "C:\phenny\bot.py", line 189, in call
try: func(phenny, input)
File "C:\phenny\modules\isup.py", line 18, in isup
result = re.sub(r'<[^>]*?>', '', str(quote[0]))
IndexError: list index out of range
try this (from http://docs.python.org/release/2.6.7/library/httplib.html#examples):
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("HEAD","/index.html")
res = conn.getresponse()
if res.status >= 200 and res.status < 300:
print "up"
else:
print "down"
You will also need to add code to follow redirects before checking the response status.
edit
Alternative that does not need to handle redirects but uses exceptions for logic:
import urllib2
request = urllib2.Request('http://google.com')
request.get_method = lambda : 'HEAD'
try:
response = urllib2.urlopen(request)
print "up"
print response.code
except urllib2.URLError, e:
# failure
print "down"
print e
You should do your own tests and choose the best one.
The error means your regexp wasn't found anywhere on the page (the list quote has no element 0).

lxml.etree.iterparse closes input file handler?

filterous is using iterparse to parse a simple XML StringIO object in a unit test. However, when trying to access the StringIO object afterwards, Python exits with a "ValueError: I/O operation on closed file" message. According to the iterparse documentation, "Starting with lxml 2.3, the .close() method will also be called in the error case," but I get no error message or Exception from iterparse. My IO-foo is obviously not up to speed, so does anyone have suggestions?
The command and (hopefully) relevant code:
$ python2.6 setup.py test
setup.py:
from setuptools import setup
from filterous import filterous as package
setup(
...
test_suite = 'tests.tests',
tests/tests.py:
from cStringIO import StringIO
import unittest
from filterous import filterous
XML = '''<posts tag="" total="3" ...'''
class TestSearch(unittest.TestCase):
def setUp(self):
self.xml = StringIO(XML)
self.result = StringIO()
...
def test_empty_tag_not(self):
"""Empty tag; should get N results."""
filterous.search(
self.xml,
self.result,
{'ntag': [u'']},
['href'],
False)
self.assertEqual(
len(self.result.getvalue().splitlines()),
self.xml.getvalue().count('<post '))
filterous/filterous.py:
from lxml import etree
...
def search(file_pointer, out, terms, includes, human_readable = True):
...
context = etree.iterparse(file_pointer, tag='posts')
Traceback:
ERROR: Empty tag; should get N results.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/victor/dev/filterous/tests/tests.py", line 149, in test_empty_tag_not
self.xml.getvalue().count('<post '))
ValueError: I/O operation on closed file
PS: The tests all ran fine on 2010-07-27.
Seems to work fine with StringIO, try using that instead of cStringIO. No idea why it's getting closed.
Docs-fu is the problem. What you quoted "Starting with lxml 2.3, the .close() method will also be called in the error case," is nothing to do with iterparse. It appears on your linked page before the section on iterparse. It is part of the docs for the target parser interface. It is referring to the close() method of the target (output!) object, nothing to do with your StringIO. In any case, you also seem to have ignored that little word also. Before 2.3, lxml closed the target object only if the parse was successful. Now it also closes it upon error.
Why do you want to "access" the StringIO object after parsing has finished?
Update By trying to access the database afterwards, do you mean all those self.xml.getvalue() calls in your tests? [Show the ferschlugginer traceback in your question so we don't need to guess!] If that's causing the problem (it does count as an IO operation), forget getvalue() ... if it were to work, wouldn't it return the (unconventionally named) (invariant) XML?

Categories