I've written the following code in python that goes to the url in the array and finds specific info about that page - a web scraper of sorts. This one takes in an array of Reddit threads and outputs the score of each thread. This program almost never executes completely. Usually, i'll get through 5 or so iterations before receiving the error message below. Could someone please help me get to the bottom of this?
import urllib2
from bs4 import BeautifulSoup
urls = ['http://www.reddit.com/r/videos/comments/1i12o2/soap_precursor_to_a_lot_of_other_hilarious_shows/', 'http://www.reddit.com/r/videos/comments/1i12nx/kid_reporter_interviews_ryan_reynolds/', 'http://www.reddit.com/r/videos/comments/1i12ml/just_my_two_boys_going_full_derp_shocking_plot/']
for x in urls:
f = urllib2.urlopen(x)
data = f.read()
soup = BeautifulSoup(data)
span = soup.find('span', attrs={'class':'number'})
print '{}:{}'.format(x, span.text)
The error message I am getting is:
Traceback (most recent call last):
File "C:/Users/jlazarus/Documents/YouTubeparse2.py", line 7, in <module>
f = urllib2.urlopen(x)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 429: Unknown
Ignore with a try and except rule to catch the error, this is what you want if you just want to skip past the error.
import urllib2
from bs4 import BeautifulSoup
urls = ['http://www.reddit.com/r/videos/comments/1i12o2/soap_precursor_to_a_lot_of_other_hilarious_shows/', 'http://www.reddit.com/r/videos/comments/1i12nx/kid_reporter_interviews_ryan_reynolds/', 'http://www.reddit.com/r/videos/comments/1i12ml/just_my_two_boys_going_full_derp_shocking_plot/']
for x in urls:
try:
f = urllib2.urlopen(x)
data = f.read()
soup = BeautifulSoup(data)
span = soup.find('span', attrs={'class':'number'})
print '{}:{}'.format(x, span.text)
except HTTPError:
print("HTTP Error, continuing")
Related
I'm trying to use the pastebin api with docs: python https://pastebin.com/doc_api. Using the urllib library: https://docs.python.org/3/library/urllib.html.
import urllib.request
import urllib.parse
def main():
def pastebinner():
site = 'https://pastebin.com/api/api_post.php'
dev_key =
code = "12345678910, test"
our_data = urllib.parse.urlencode({"api_dev_key": dev_key, "api_option": "paste", "api_paste_code": code})
our_data = our_data.encode()
resp = urllib.request.urlopen(site, our_data)
print(resp.read())
pastebinner()
if __name__ == "__main__":
main()
Here's the error i get:
File "C:\Program
Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 214, in urlopen
return opener.open(url, data, timeout) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 523, in open
response = meth(req, response) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 632, in http_response
response = self.parent.error( File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 561, in error
return self._call_chain(*args) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 494, in _call_chain
result = func(*args) File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.1520.0_x64__qbz5n2kfra8p0\lib\urllib\request.py",
line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 422: Unprocessable entity
Any ideas regarding the reason for getting this error?
bump: I still have no idea please help.
bump2: :v
You are using urllib.request.urlopen(site, our_data) which is an HTTP GET (default for anything in urllib). You need to do an HTTP POST instead. Obligatory w3 link
Please note that the code below is untested
import urllib.request
import urllib.parse
def main():
def pastebinner():
site = 'https://pastebin.com/api/api_post.php'
dev_key = 'APIKEYGOESHERE'
code = "12345678910, test"
our_data = urllib.parse.urlencode({"api_dev_key": dev_key, "api_option": "paste", "api_paste_code": code})
our_data = our_data.encode()
request = urllib.request.Request(site, method='POST')
resp = urllib.request.urlopen(request, our_data)
print(resp.read())
pastebinner()
if __name__ == "__main__":
main()
The error is very unhelpful. I mean, why not return a teapot response instead?
leaving this here in case anyone else runs into this issue. Not 100% sure about this, will test later DONT USE URLLIB2 USE httplib2. I believe that will fix your problem.
i am attempting to make a program that downloads a series of product pictures from a site using python. The site stores its images under a certain url format https://www.sitename.com/XYZabcde where XYZ are three letters that represent the brand of the product and abcde are a series of numbers in between 00000 and 30000.
here is my code:
import urllib.request
def down(i, inp):
full_path = 'images/image-{}.jpg'.format(i)
url = "https://www.sitename.com/{}{}.jpg".format(inp,i)
urllib.request.urlretrieve(url, full_path)
print("saved")
return None
inp = input("brand :" )
i = 20100
while i <= 20105:
x = str(i)
y = x.zfill(5)
z = "https://www.sitename.com/{}{}.jpg".format(inp,y)
print(z)
down(y, inp)
i += 1
With the code i have written i can successfully download a series of pictures from it which i know exist for example brand RVL from 20100 to 20105 will succesfully download those six pictures.
however when i broaden the while loop to include links i dont know will give me an image i get this error code :
Traceback (most recent call last):
File "c:/Users/euan/Desktop/university/programming/Python/parser/test - Copy.py", line 20, in <module>
down(y, inp)
File "c:/Users/euan/Desktop/university/programming/Python/parser/test - Copy.py", line 6, in down
urllib.request.urlretrieve(url, full_path)
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\euan\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
what can i do to check and avoid any url that would yield this result?
You cannot as such know in advance which URLs you don't have access to, but you can surround the download with a try-except:
import urllib.request, urllib.error
...
def down(i, inp):
full_path = 'images/image-{}.jpg'.format(i)
url = "https://www.sitename.com/{}{}.jpg".format(inp,i)
try:
urllib.request.urlretrieve(url, full_path)
print("saved")
except urllib.error.HTTPError as e:
print("failed:", e)
return None
In that case it will just print e.g. "failed: HTTP Error 403: Forbidden" whenever a URL cannot be fetched, and the program will continue.
I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
I am learning python API testing using urllib2 module.I tried to execute the code.but throwing the following msg.Can anybody help me.Thanks in advance.
code:
url = "http://localhost:8000/HPFlights_REST/FlightOrders/"
data = {"Class" : "Business","CustomerName" :"Bhavani","DepartureDate" : "2015-10-12","FlightNumber" : "1304","NumberOfTickets": "3"}
encoded_data = urllib.urlencode(data)
'''print encoded_data
print urllib2.urlopen(url, encoded_data).read()'''
request = urllib2.Request(url, encoded_data)
print request.get_method()
request.add_data(encoded_data)
response = urllib2.urlopen(request)
Error:
Traceback (most recent call last):
File "C:/Users/kanakadurga/PycharmProjects/untitled/API.py", line 44, in <module>
createFlightOrder()
File "C:/Users/kanakadurga/PycharmProjects/untitled/API.py", line 39, in createFlightOrder
response = urllib2.urlopen(request)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
Process finished with exit code 1
It looks like you are trying to post data to the server.
From the URL, I can make a wild guess and assume the server accepts the data in json format, probably.
If that is the case then you can do
import json
url = "http://localhost:8000/HPFlights_REST/FlightOrders/"
data = {"Class": "Business", "CustomerName": "Bhavani", "DepartureDate": "2015-10-12", "FlightNumber": "1304", "NumberOfTickets": "3"}
encoded_data = json.dumps(data)
request = urllib2.Request(url, encoded_data, {'Content-Type': 'application/json'})
f = urllib2.urlopen(req) # issue the request
response = f.read() # read the response
f.close()
... # your next operations follow
The point is that you need to encode the data correctly (json) and also set the proper content-type header in the HTTP post request, which the server probably checks.
Otherwise, the default content-type would be application/x-www-form-urlencoded, as if the data came from a form.
I have some code which is very similar to code used here:
https://github.com/jeysonmc/python-google-speech-scripts/blob/master/stt_google.py
Here is my code:
f = open(filename, 'rb')
speech = f.read()
f.close()
LANG_CODE = 'en-US' # Language to use
GOOGLE_SPEECH_URL = 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=%s&maxresults=6' % (LANG_CODE)
f = open(filename, 'rb')
flac_cont = f.read()
f.close()
hrs = {"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7",
'Content-type': 'audio/x-flac; rate=16000'}
req = urllib2.Request(GOOGLE_SPEECH_URL, data=flac_cont, headers=hrs)
print "Sending request to Google TTS"
p = urllib2.urlopen(req)
response = p.read()
print "response", response
res = eval(response)['hypotheses']
It seems to get stuck on the urllib2.urlopen(req) line. It gives back this error:
Traceback (most recent call last):
File "google-speech.py", line 443, in <module>
GoogleSpeech.text_from_speech(filename)
File "google-speech.py", line 274, in text_from_speech
p = urllib2.urlopen(req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
I'm not sure what the issue could be
EDIT: Added the end of my backtrace, which was missing earlier
If the error happens randomly, you can use a graceful retry algorithm, such as the one implemented here:
https://wiki.python.org/moin/PythonDecoratorLibrary#Retry
The idea is that, if for example the URL is currently not reachable, you don't keep retrying blindly, but increase the retry interval to allow the target location to recover, and backoff eventually if the URL cannot be opened at all.
If the error happens everytime, you have a different problem and should post the complete stacktrace.
This is what I do to overcome this problem:
while True:
try:
p = urllib2.urlopen(req)
break
except Exception as e:
print(e, 'Trying again...')