While scraping without any problem, access is suddenly denied.
The error code is as below.
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.investing.com/equities/alibaba
Most of solution on google is to set User Agent in headers to avoid robot detection. This part was already applied to the code and the scrapping has done well, but it is being rejected from one day.
So, I tried fakeuseragent and random user agent to send user-agent randomly, but it was always rejected.
Secondly, to check the case through IP information, I tried to use VPN to switch IP but still denied.
The last thing I found was regarding Referer Control. So I installed the Referer Control extension, but I can't find any information on how to use it.
My code is below.
url = "https://www.investing.com/equities/alibaba"
ua = generate_user_agent()
print(ua)
headers = {"User-agent":ua}
res = requests.get(url,headers=headers)
print("Response Code :", res.status_code)
res.raise_for_status()
print("Start")
Any help will be appreciated.
Related
I am trying to scrape some actors images from Wikipedia using python and constantly get the following error HTTPError: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/0/0a/Christian_Bale-7837.jpg
The code I am using is as follows:
url = 'https://en.wikipedia.org/wiki/'
headers = {'User-Agent': 'copied user agent that came out when I googled it'}
response = requests.get(url, headers)
I can post the entire code if the problem clearly is not from the code above.
I googled it for the past 30 minutes. Wikipedia has a documentation page about the User Agent and I followed their steps to do this, but it still did not work.
It should be requests.get(url, headers=headers).
That said, don't just put in some random string you pulled from Google as the user agent. That's impolite and might get you banned if you generate significant traffic. Indicate who you are and what tool you are using, as asked in the user-agent policy. Something like NicoBot/0.1 (your#email.address) would work.
I am trying to read an image URL from the internet and be able to get the image onto my machine via python, I used example used in this blog post https://www.geeksforgeeks.org/how-to-open-an-image-from-the-url-in-pil/ which was https://media.geeksforgeeks.org/wp-content/uploads/20210318103632/gfg-300x300.png, however, when I try my own example it just doesn't seem to work I've tried the HTTP version and it still gives me the 403 error. Does anyone know what the cause could be?
import urllib.request
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
Output:
urllib.error.HTTPError: HTTP Error 403: Forbidden
The server at prntscr.com is actively rejecting your request. There are many reasons why that could be. Some sites will check for the user agent of the caller to make see if that's the case. In my case, I used httpie to test if it would allow me to download through a non-browser app. It worked. So then I simply reused made up a user header to see if it's just the lack of user-agent.
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
It worked! Now I don't know what logic the server uses. For instance, I tried a standard Mozilla/5.0 and that did not work. You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent. If nothing works, try using the same user-agent as your browser for instance.
I had the same problem and it was due to an expired URL. I checked the response text and I was getting "URL signature expired" which is a message you wouldn't normally see unless you checked the response text.
This means some URLs just expire, usually for security purposes. Try to get the URL again and update the URL in your script. If there isn't a new URL for the content you're trying to scrape, then unfortunately you can't scrape for it.
I have a application deployed in private server.
ip = "100.10.1.1"
I want to read the source code/Web content of that page.
On browser when I am visiting to the page. It shows me "Your connection is not private".
So then after proceed to unsafe connection it takes me to that page.
Below is the code I am trying to read the HTML content. But it is not giving me the correct HTML content thought it is showing as 200 OK response.
I tried with ignoring ssl certificate with below code but not happening
import requests
url = "http://100.10.1.1"
requests.packages.urllib3.disable_warnings()
response = requests.get(url, verify=False)
print response.text
Can Someone share some idea on how to proceed or am i doing anything wrong on top of it?
The task I want to complete is very simple. To do a http get request using python.
Below is the code I used:
url = 'http://www.costcobusinessdelivery.com/AjaxWarehouseBrowseLookupView?storeId=11301&catalogId=11701&langId=-1&parentGeoNode=10112'
requests.get(url)
Then I got:
<Response [401]>
I am new to python, can someone help? Thanks!
Update:
Based on the comments. It seems the code is okay, but I do get the 401 response. I doubt my company's network has some restrictions? But I can access and get a valid response through a browser. Is there a way to bypass my company's firewall/proxy or whatever? Just to pretend that I am using a browser in python? Thanks again!
If your browser is accessing the web via a proxy server, look that up on your browser settings and use that in python.
r = requests.get(url,
proxies={"http": "http://61.233.25.166:80"})
your proxy server will have a different address.
I am trying to get a page from wikipedia. I have already added a 'User-Agent' header to my request. However, when I open the page using urllib2.urlopen I get the following page as a result:
ERROR: The requested URL could not be retrieved
ERROR
The requested URL could not be retrieved
While trying to retrieve the URL the following error was encountered:
Access Denied.
Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.
Here is the code I use to open the page:
def get_site(request_user_link,request): # request_user_link is request for url entered by user
# request is request generated by current page - used to get HTTP_USER_AGENT
# tag for WIKIPEDIA and other sites
request_user_link.add_header('User-Agent',str(request.META['HTTP_USER_AGENT']))
try:
response = urllib2.urlopen(request_user_link)
except urllib2.HTTPError, err:
logger.error('HTTPError = ' +str(err.code))
response=None
except urllib2.URLError, err:
logger.error('HTTPError = ' +str(err.reason))
response=None
except httplib.HTTPException, err:
logger.error('HTTPException')
response=None
except Exception:
import traceback
logger.error('generic exception' + traceback.format_exec())
response=None
return response
I pass the value of the HTTP_USER_AGENT from the current user object as the "User-Agent" header for the request I send to wikipedia.
If there are any other headers I need to add to this request, please let me know. Otherwise, please advise an alternate solution.
EDIT: Please note that I was able to get the page successfully yesterday after I added the 'User-Agent' header. Today, I seem to be getting this Error page.
Wikipedia is not very forgiving if violate their crawling rules. As you first exposed your IP with the standard urllib2 user-agent you were branded in the logs. When the logs were 'processed' your IP was banned. This should be easily tested by running your script for another IP. Be careful since Wikipedia is also known to block IP ranges.
IP bans are usually temporary, but if you have multiple offenses it can become permanent.
Wikipedia also have autoban on known proxy servers. I suspect that they are them selves parsing anon proxy sites like proxy-list.org and commercial proxy sites like hidemyass.com for the IP's.
Wikipedia does this of course to protect the content from vandalism and spam. Please respect the rules.
If possible I suggest the use of a local copy of wikipedia on your own servers. This copy you can violate to your harts content.
i wrote a script that reads from wikipedia, this is a simplified version.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this
resource = opener.open(URL)
data = resource.read()
resource.close()
#data is your website.