I am trying to make some image filters for my API. Some URLs work while most do not. I wanted to know why and how to fix it. I have Looked through another stack overflow post but have not had much luck as I don't know the problem.
Here is an example of a working URL
And one that does not work
Edit Here is another URL that does not work
Here is the API I am trying to make
Here is my code
def generate_image_Wanted(imageUrl):
with urllib.request.urlopen(imageUrl) as url:
f = io.BytesIO(url.read())
im1 = Image.open("images/wanted.jpg")
im2 = Image.open(f)
im2 = im2.resize((300, 285))
img = im1.copy()
img.paste(im2, (85, 230))
d = BytesIO()
d.seek(0)
img.save(d, "PNG")
d.seek(0)
return d
Here is my error
Traceback (most recent call last):
File "c:\Users\micha\OneDrive\Desktop\MicsAPI\test.py", line 23, in <module>
generate_image_Wanted("https://cdn.discordapp.com/avatars/902240397273743361/9d7ce93e7510f47da2d8ba97ec32fc33.png")
File "c:\Users\micha\OneDrive\Desktop\MicsAPI\test.py", line 11, in generate_image_Wanted
with urllib.request.urlopen(imageUrl) as url:
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\micha\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Thank you for looking at this and have a good day.
maybe sites you can't scrape has server prevention for known bot and spiders and block your request from urllib.
You need to provide some headers - see more about python request lib
Working example:
import urllib.request
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
url = "https://cdn.discordapp.com/avatars/902240397273743361/9d7ce93e7510f47da2d8ba97ec32fc33.png"
req = urllib.request.Request(url, headers=hdr)
response = urllib.request.urlopen(req)
response.read()
Related
I'm trying to map this website, but I got a problem while trying to fully crawl it. I'm getting an error 404 even though the URL exists.
Here is my code:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
global paginas
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
writer = csv.writer(csvFile)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
if link.attrs['href'] not in paginas:
#nova página encontrada
newPage = link.attrs['href']
print(newPage)
paginas.add(newPage)
getLinks(newPage)
csvRow = []
csvRow.append(newPage)
writer.writerow(csvRow)
getLinks("")
csvFile.close()
And this is the error message I got, after I tried to run that code:
#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
getLinks("")
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
[Previous line repeated 4 more times]
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>
I've tried to do it only with the main link, and it works fine, but as soon as i add the pageurl variable to the url, it gives me this error. How can I fix this error?
From what I can see, you're right- the page is there... for us people on browsers. What I assume is happening is some basic anti-botting mechanism which bans uncommon UserAgents, or in other words, only lets browsers view the page. However, as the User Agent is a header that we can control, we can manipulate it so it won't throw the 404 error.
I can't type out the code for it at the moment but you will need to pair this StackOverflow answer describing how to change a header in urllib, and you must write some code which takes that answer and changes the "UserAgent" header to a value like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36, which I've taken from here.
After you've changed the UserAgent header, you should be able to download the page successfully.
I'm trying to download the HTML of a page (http://www.guangxindai.com in this case) but I'm getting back an error 403. Here is my code:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
but I get error response.
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I have tried different request headers, but still can not get correct response. I can view the web through browser. It seems strange for me. I guess the web use some method to block web spider. Does anyone know what is happening? How can I get the HTML of page correctly?
I was having the same problem that you and I found the answer in this link.
The answer provided by Stefano Sanfilippo is quite simple and worked for me:
from urllib.request import Request, urlopen
url_request = Request("http://www.guangxindai.com",
headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()
If your aim is to read the html of the page you can use the following code. It worked for me on Python 2.7
import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()
I have a very basic script to download a website using Python urllib2.
This has been working brilliantly for the past 6 months, and then this morning it no longer works?
#!/usr/bin/python
import urllib2
proxy_support = urllib2.ProxyHandler({'http': 'http://DOMAIN\USER:PASS#PROXY:PORT/'})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
translink = open('/tmp/trains.html' ,'w')
response = urllib2.urlopen('http://translink.com.au')
html = response.read()
translink.write(html)
translink.close()
I am now getting the following error
Traceback (most recent call last):
File "./gettrains.py", line 7, in <module>
response = urllib2.urlopen('http://translink.com.au')
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 407, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 445, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 502: Proxy Error ( The HTTP message includes an unsupported header or an unsupported combination of headers. )
I am new to Python, any help would be very much appreciated.
Cheers
#!/usr/bin/python
import requests
proxies = {
"http": "http://domain\user:pass#proxy:port",
"https": "http:// domain\user:pass#proxy:port",
}
html = requests.get("http://translink.com.au", proxies=proxies)
translink = open('/tmp/trains.html' ,'w')
translink.write(html.content)
translink.close()
Try to change a header. For example:
opener = urllib2.build_opener(proxy_support)
opener.addheaders = ([('User-Agent' , 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)')])
urllib2.install_opener(opener)
I had same problem some days ago. My proxy didn't admit the default header user-agent='Python-urllib/2.7'
To simplify things a little bit, I would avoid the proxy setup from within python and simply let your OS manage it for you. You can do this by setting an environment variable (like export http_proxy="your_proxy" in Linux). Then simply grab the file directly through python, which you can do with urllib2 or requests, you may also consider the wget module.
It's totally possible that there may have been some changes to your proxy that forwards the requests with headers that are no longer acceptable by your final destination. In that case there's very little you can do.
I have some code which is very similar to code used here:
https://github.com/jeysonmc/python-google-speech-scripts/blob/master/stt_google.py
Here is my code:
f = open(filename, 'rb')
speech = f.read()
f.close()
LANG_CODE = 'en-US' # Language to use
GOOGLE_SPEECH_URL = 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=%s&maxresults=6' % (LANG_CODE)
f = open(filename, 'rb')
flac_cont = f.read()
f.close()
hrs = {"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7",
'Content-type': 'audio/x-flac; rate=16000'}
req = urllib2.Request(GOOGLE_SPEECH_URL, data=flac_cont, headers=hrs)
print "Sending request to Google TTS"
p = urllib2.urlopen(req)
response = p.read()
print "response", response
res = eval(response)['hypotheses']
It seems to get stuck on the urllib2.urlopen(req) line. It gives back this error:
Traceback (most recent call last):
File "google-speech.py", line 443, in <module>
GoogleSpeech.text_from_speech(filename)
File "google-speech.py", line 274, in text_from_speech
p = urllib2.urlopen(req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
I'm not sure what the issue could be
EDIT: Added the end of my backtrace, which was missing earlier
If the error happens randomly, you can use a graceful retry algorithm, such as the one implemented here:
https://wiki.python.org/moin/PythonDecoratorLibrary#Retry
The idea is that, if for example the URL is currently not reachable, you don't keep retrying blindly, but increase the retry interval to allow the target location to recover, and backoff eventually if the URL cannot be opened at all.
If the error happens everytime, you have a different problem and should post the complete stacktrace.
This is what I do to overcome this problem:
while True:
try:
p = urllib2.urlopen(req)
break
except Exception as e:
print(e, 'Trying again...')
I am trying to pull information from a site ever 5 seconds but it doesn't seem to be working and I get errors every time I run it.
Code below:
import urllib2, threading
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
I get these errors:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 808, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 1080, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\Jordan\Desktop\username.py", line 3, in readpage
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').rea
()
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
Help would be appreciated, thanks!
The site is rejecting the default User-Agent reported by urllib2. You can change it for all requests in the script using install_opener.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
You'll also need to split the data from by the site to read it line by line
urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
and change
line.split('/runescape-2007-prices/player/'[1])
to
line.split('/runescape-2007-prices/player/')[1]
Working:
import urllib2, threading
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0')]
urllib2.install_opener(opener)
def readpage():
data = urllib2.urlopen('http://forums.zybez.net/runescape-2007-prices').read().splitlines()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/')[1]
print(a.split('">')[0])
t = threading.Timer(5.0, readpage)
t.start()
Did you try opening that url without the thread? The error code says 403: Forbidden, maybe you need authentication for that web page.
This has nothing to do with Python -- the server is denying your requests to that URL.
I suspect that either the URL is incorrect or you've hit some kind of rate limiting and are being blocked.
EDIT: how to make it work
The site is blocking Python's User-Agent. Try this:
import urllib2, threading
def readpage():
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://forums.zybez.net/runescape-2007-prices', None, headers)
data = urllib2.urlopen(req).read()
for line in data:
if 'forums.zybez.net/runescape-2007-prices/player/' in line:
a = line.split('/runescape-2007-prices/player/'[1])
print(a.split('">')[0])