Wikipedia | ERROR: The requested URL could not be retrieved - python

I am trying to get a page from wikipedia. I have already added a 'User-Agent' header to my request. However, when I open the page using urllib2.urlopen I get the following page as a result:
ERROR: The requested URL could not be retrieved
ERROR
The requested URL could not be retrieved
While trying to retrieve the URL the following error was encountered:
Access Denied.
Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.
Here is the code I use to open the page:
def get_site(request_user_link,request): # request_user_link is request for url entered by user
# request is request generated by current page - used to get HTTP_USER_AGENT
# tag for WIKIPEDIA and other sites
request_user_link.add_header('User-Agent',str(request.META['HTTP_USER_AGENT']))
try:
response = urllib2.urlopen(request_user_link)
except urllib2.HTTPError, err:
logger.error('HTTPError = ' +str(err.code))
response=None
except urllib2.URLError, err:
logger.error('HTTPError = ' +str(err.reason))
response=None
except httplib.HTTPException, err:
logger.error('HTTPException')
response=None
except Exception:
import traceback
logger.error('generic exception' + traceback.format_exec())
response=None
return response
I pass the value of the HTTP_USER_AGENT from the current user object as the "User-Agent" header for the request I send to wikipedia.
If there are any other headers I need to add to this request, please let me know. Otherwise, please advise an alternate solution.
EDIT: Please note that I was able to get the page successfully yesterday after I added the 'User-Agent' header. Today, I seem to be getting this Error page.

Wikipedia is not very forgiving if violate their crawling rules. As you first exposed your IP with the standard urllib2 user-agent you were branded in the logs. When the logs were 'processed' your IP was banned. This should be easily tested by running your script for another IP. Be careful since Wikipedia is also known to block IP ranges.
IP bans are usually temporary, but if you have multiple offenses it can become permanent.
Wikipedia also have autoban on known proxy servers. I suspect that they are them selves parsing anon proxy sites like proxy-list.org and commercial proxy sites like hidemyass.com for the IP's.
Wikipedia does this of course to protect the content from vandalism and spam. Please respect the rules.
If possible I suggest the use of a local copy of wikipedia on your own servers. This copy you can violate to your harts content.

i wrote a script that reads from wikipedia, this is a simplified version.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this
resource = opener.open(URL)
data = resource.read()
resource.close()
#data is your website.

Related

Python : 403 client error : Forbidden for url

While scraping without any problem, access is suddenly denied.
The error code is as below.
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.investing.com/equities/alibaba
Most of solution on google is to set User Agent in headers to avoid robot detection. This part was already applied to the code and the scrapping has done well, but it is being rejected from one day.
So, I tried fakeuseragent and random user agent to send user-agent randomly, but it was always rejected.
Secondly, to check the case through IP information, I tried to use VPN to switch IP but still denied.
The last thing I found was regarding Referer Control. So I installed the Referer Control extension, but I can't find any information on how to use it.
My code is below.
url = "https://www.investing.com/equities/alibaba"
ua = generate_user_agent()
print(ua)
headers = {"User-agent":ua}
res = requests.get(url,headers=headers)
print("Response Code :", res.status_code)
res.raise_for_status()
print("Start")
Any help will be appreciated.

Python Request not allowing redirects

I am using Python request library to scrape robots.txt data from a list of URLs:
for url in urls:
url = urllib.parse.urljoin(url, "robots.txt")
try:
r = requests.get(url, headers=headers, allow_redirects=False)
r.raise_for_status()
extract_robots(r)
except (exceptions.RequestException, exceptions.HTTPError, exceptions.Timeout) as err:
handle_exeption(err)
In my list of urls, I have this webpage: https://reward.ff.garena.com. When I am requesting https://reward.ff.garena.com/robots.txt, I am directly redirected to https://reward.ff.garena.com/en. However, I specified in my request parameters that I don't want redirects allow_redirects=False.
How can I skip this kind of redirect and make sure I only have domain/robots.txt data calling my extract_robots(data) method?
Do you know for sure that there is a robots.txt at that location?
I note that if I request https://reward.ff.garena.com/NOSUCHFILE.txt that I get the same result as for robots.txt
The allow_redirects=False only stops requests from automatically following 302/location= responses - i.e. it doesn’t stop the server you’re trying to access from returning a redirect as the response to the request you’re making.
If you get this type of response I guess it indicates the file you requested isn’t available, or some other error preventing you accessing it, perhaps in the general case of file access this might indicate need for authentication but for robots.txt that shouldn’t be the problem - simplest to assume the robots.txt isn’t there.

urllib.error.HTTPError: HTTP Error 403: Forbidden with urllib.requests

I am trying to read an image URL from the internet and be able to get the image onto my machine via python, I used example used in this blog post https://www.geeksforgeeks.org/how-to-open-an-image-from-the-url-in-pil/ which was https://media.geeksforgeeks.org/wp-content/uploads/20210318103632/gfg-300x300.png, however, when I try my own example it just doesn't seem to work I've tried the HTTP version and it still gives me the 403 error. Does anyone know what the cause could be?
import urllib.request
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
Output:
urllib.error.HTTPError: HTTP Error 403: Forbidden
The server at prntscr.com is actively rejecting your request. There are many reasons why that could be. Some sites will check for the user agent of the caller to make see if that's the case. In my case, I used httpie to test if it would allow me to download through a non-browser app. It worked. So then I simply reused made up a user header to see if it's just the lack of user-agent.
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
It worked! Now I don't know what logic the server uses. For instance, I tried a standard Mozilla/5.0 and that did not work. You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent. If nothing works, try using the same user-agent as your browser for instance.
I had the same problem and it was due to an expired URL. I checked the response text and I was getting "URL signature expired" which is a message you wouldn't normally see unless you checked the response text.
This means some URLs just expire, usually for security purposes. Try to get the URL again and update the URL in your script. If there isn't a new URL for the content you're trying to scrape, then unfortunately you can't scrape for it.

urllib2 gets an http code of 404 on site where firefox gets code 200

I'm trying to scrape data off an internal website with urllib2. When I run
try:
resp = urllib2.urlopen(urlBase)
data = resp.read()
except HTTPError as e1:
print("HTTP Error %d trying to reach %s" % (e1.code, urlBase))
except URLError as e2:
print("URLError %d" % e2.code)
print(e2.read())
I get an HTTPError with e1.code of 404. If I navigate to the site on Firefox and use the developer tools I see an HTTP code of 200. Does anyone know what the problem could be?
Edit 1 Before I call this, I also install an empty proxy handler so urllib2 doesn't try to use the proxy settings set by my shell:
handler = urllib2.ProxyHandler({})
opener = urllib2.build_opener(handler)
urllib2.intall_opener(opener)
Edit 2 FWIW the url I'm navigating to is an apache index and not an html document. However, the status code as read by Firefox is still saying HTTP/1.1 Status 200
This sometimes happens to me after I've been using an HTTP proxy like Charles. In my case, The fix is simply opening and closing the HTTP proxy.
Turns out a function inside the try I stripped out was trying to access another page that was triggering the 404 error.

urllib2 basic authentication oddites

I'm slamming my head against the wall with this one. I've been trying every example, reading every last bit I can find online about basic http authorization with urllib2, but I can not figure out what is causing my specific error.
Adding to the frustration is that the code works for one page, and yet not for another.
logging into www.mysite.com/adm goes absolutely smooth. It authenticates no problem. Yet if I change the address to 'http://mysite.com/adm/items.php?n=201105&c=200' I receive this error:
<h4 align="center" class="teal">Add/Edit Items</h4>
<p><strong>Client:</strong> </p><p><strong>Event:</strong> </p><p class="error">Not enough information to complete this task</p>
<p class="error">This is a fatal error so I am exiting now.</p>
Searching google has lead to zero information on this error.
The adm is a frame set page, I'm not sure if that's relevant at all.
Here is the current code:
import urllib2, urllib
import sys
import re
import base64
from urlparse import urlparse
theurl = 'http://xxxxxmedia.com/adm/items.php?n=201105&c=200'
username = 'XXXX'
password = 'XXXX'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl,username,password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
pagehandle = urllib2.urlopen(theurl)
url = 'http://xxxxxxxmedia.com/adm/items.php?n=201105&c=200'
values = {'AvAudioCD': 1,
'AvAudioCDDiscount': 00, 'AvAudioCDPrice': 50,
'ProductName': 'python test', 'frmSubmit': 'Submit' }
#opener2 = urllib2.build_opener(urllib2.HTTPCookieProcessor())
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
This is just one of the many versions I've tried. I've followed every example from Urllib2 Missing Manual but still receive the same error.
Can anyone point to what I'm doing wrong?
Run into a similar problem today. I was using basic authentication on the website I am developing and I couldn't authenticate any users.
Here are a few things you can use to debug your problem:
I used slumber.in and httplib2 for testing purposes. I ran both from ipython shell to see what responses I was receiving.
Slumber actually uses httplib2 beneath the covers so they acted similarly. I used tcpdump and later tcpflow (which shows information in a much more readable form) to see what was really being sent and received. If you want a GUI, see wireshark or alternatives.
I tested my website with curl and when I used curl with my username/password it worked correctly and showed the requested page. But slumber and httplib2 were still not working.
I tested my website and browserspy.dk to see what were the differences. Important thing is browserspy's website works for basic authentication and my web site did not, so I could compare between the two. I read in a lot of places that you need to send HTTP 401 Not Authorized so that the browser or the tool you are using could send the username/password you provided. But what I didn't know was, you also needed the WWW-Authenticate field in the header. So this was the missing piece.
What made this whole situation odd was while testing I would see httplib2 send basic authentication headers with most of the requests (tcpflow would show that). It turns out that the library does not send username/password authentication on the first request. If "Status 401" AND "WWW-Authenticate" is in the response, then the credentials are sent on the second request and all the requests to this domain from then on.
So to sum up, your application may be correct but you might not be returning the standard headers and status code for the client to send credentials. Use your debug tools to find which is which. Also, there's debug mode for httplib2, just set httplib2.debuglevel=1 so that debug information is printed on the standard output. This is much more helpful then using tcpdump because it is at a higher level.
Hope this helps someone.
About an year ago, I went thro' the same process and documented how I solved the problem - The direct and simple way to authentication and the standard one. Choose what you deem fit.
HTTP Authentication in Python
There is an explained description, in the missing urllib2 document.
From the HTML you posted, it still think that you authenticate successfully but encounter an error afterwards, in the processing of your POST request. I tried your URL and failing authentication, I get a standard 401 page.
In any case, I suggest you try again running your code and performing the same operation manually in Firefox, only this time with Wireshark to capture the exchange. You can grab the full text of the HTTP request and response in both cases and compare the differences. In most cases that will lead you to the source of the error you get.
I also found the passman stuff doesn't work (sometimes?). Adding the base64 user/pass header as per this answer https://stackoverflow.com/a/18592800/623159 did work for me. I am accessing jenkins URL like this: http:///job//lastCompletedBuild/testR‌​‌​eport/api/python
This works for me:
import urllib2
import base64
baseurl="http://jenkinsurl"
username=...
password=...
url="%s/job/jobname/lastCompletedBuild/testReport/api/python" % baseurl
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request = urllib2.Request(url)
request.add_header("Authorization", "Basic %s" % base64string)
result = urllib2.urlopen(request)
data = result.read()
This doesn't work for me, error 403 each time:
import urllib2
baseurl="http://jenkinsurl"
username=...
password=...
##urllib2.HTTPError: HTTP Error 403: Forbidden
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username,password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(passman)))
req = urllib2.Request(url)
result = urllib2.urlopen(req)
data = result.read()

Categories