I am using Python 3.7 with requests 2.23.0 library and trying to scrape a website, but get the following error message:
('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None))
I used agent but no luck, I also tried to specify the timeout, still facing the same problem.
The website works fine when I access it through the browser
I used the same code with some other websites and it just worked fine.
Any kind of help is really appreciated.
-I am able to catch the exception, but I want to avoid it and actually access the website
Here is the code (just as simple as trying to access the website):
from requests import get
try:
agent = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
url = "the url I'm trying to access"
html = get(url, headers = agent)
except (Exception) as error :
print ("Error", error)
Could it be something with the security of the website? I'd like to find a way to workaround
I could not comment due to low reputation,So posting as answer,
I think you will find your answer in below link:
Python3 error
I used selenium with user-agent option and I was able to access the website
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
options.add_argument('user-agent={0}'.format(user_agent))
Many thanks
Related
I have this code
from requests.sessions import Session
url = "https://www.yell.com/s/launderettes-birmingham.html"
s = Session()
headers = {
'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
}
r = s.get(url,headers=headers)
print(r.status_code)
but I get 403 output, instead 200
I can scrape this data with selenium, but is there a way to scrape this with requests
If you modify your code like so:
print(r.text)
print(r.status_code)
you will see, that the reason you are getting a 400 error code is due to yell using Cloudflare browser check.
As it uses javascript, there is no way to reliably use the requests module.
Since you mentioned you are going to use selenium, make sure to use the undetected driver package
Also, be sure to rotate your IP to avoid getting your IP blocked.
I am attempting to run my code on a aws ec2(ubuntu) instance. The codes work perfectly fine on my local but doesnt seem to be able to connect to website inside server.
Im assuming it has to do something with the headers. I have installed firefox and chrome on the server but doesnt seem to do anything.
Any ideas on how to fix this problem would be appreciated.
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}
# Making a get request
response = requests.get("https://us.louisvuitton.com/eng-us/products/pocket-organizer-monogram-other-nvprod2380073v", headers=HEADERS) #hangs here, cant make request in server
# print response
print(response.status_code)
Output:
Doesn't give me one, just stays blank until I KeyboardInterrupt.
I want to collect some title of posts on reddit to do analysis. Through constant debugging of my code, I can got some title of posts. Suddenly I got a Forbidden 403 when attempting to use PRAW to collect posts. The online explanation is that:" Accessing the page or resource you were trying to reach is absolutely forbidden. In other words, a 403 error means that you don't have access to whatever you're trying to view".
Please, tell me what should I do. Thanks
try to add some headers and use time delay
url="https://www.reddit.com"
my_headers=["Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html",
"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31"
]
def get_content(url,headers):
randdom_header=random.choice(headers)
req=urllib.Request(url)
req.add_header("User-Agent",randdom_header)
req.add_header("Host","www.reddit.com")
req.add_header("Referer","https://www.reddit.com")
req.add_header("GET",url)
content=urllib.urlopen(req).read()
return content
print (get_content(url,my_headers))
So, I used to run into this problem a lot, and no matter where I googled, I found no "fix", per se. The easiest way to solve the problem is to just catch the error, and then do something with it. See below.
First, you need to import the right "error", I guess. (I'm not sure how to word it)
from prawcore.exceptions import Forbidden
Then, you can try to something that returns our Forbidden error. Do this with try:. Our code should look something like this:
from prawcore.exceptions import Forbidden
try:
comment.reply("I can comment here!")
Here, we try to make a comment on some (in our case) imaginary comment object, if we are allowed to comment, ie aren't banned from reddit, or this sub, than the reply will successfully "send". If this isn't the case we will recieve that dreadful (in our case) prawcore.exceptions.Forbidden: received 403 HTTP response
To fix this, we simply need to except xxx: our error. This would look like except Forbidden:. We can then do something if we are forbidden from a certain action. Our final code looks like this:
#importing our error
from prawcore.exceptions import Forbidden
#trying to comment (we may be banned)
try:
comment.reply("I can comment here!")
#doing something if we can't comment
except Forbidden:
print(f"We\'ve been banned on r/{comment.subreddit}!"
I hope this helped and I'd love to elaborate further!
Just as an addition to get to the root of the issue for some of you, I ran into this issue because the subreddit I was trying to connect to had gone private.
So my suggestion would be to first check if the subreddit you are connecting to will allow the account the bot is setup with to access that subreddit.
I think the first answer gives a great response on how to handle it programmatically.
I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'
The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.
I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)