I am attempting to scrape websites and I sometimes get this error and it is concerning as I randomly get this error but after i retry i do not get the error.
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.somewebsite.com', port=443): Read timed out. (read timeout=None)
My code looks like the following
from bs4 import BeautifulSoup
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
import requests
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
pages_to_scrape = ['https://www.somewebsite1.com/page', 'https://www.somewebsite2.com/page242']
for page in pages_to_scrape:
time.sleep(2)
page = requests.get(page, headers={'User-Agent':user_agent_rotator.get_random_user_agent()})
soup = BeautifulSoup(page.content, "html.parser")
# scrape info
As you can see from my code I even use Time to sleep my script for a couple of seconds before requesting another page. I also use a random user_agent. I am not sure if i can do anything else to make sure I never get the Read Time out error.
I also came across this but it seems they are suggesting to add additional values to the headers but I am not sure if that is a generic solution because that may have to be specific from website to website. I also read on another SO Post that we should base64 the request and retry. It went over my head as I have no idea how to do that and there was not a example provided by the person.
Any advice by those who have experience in scraping would highly be appreciated.
well, I've verified your issue. Basically that site is using AkamaiGHost firewall.
curl -s -o /dev/null -D - https://www.uniqlo.com/us/en/men/t-shirts
which will block your requests if it's without valid User-Agent and should be stable. you don't need to change it on each request. also you will need to use requests.Session() to persist the session and not causing TCP layer to drop the packets, I've been able to send 1k requests within the second and didn't get blocked. even i verified if the bootstrap will block the request if i parsed the HTML source but it didn't at all.
being informed that i launched all my tests using Google DNS which will never cause a latency on my threading which can lead the firewall to drop the requests and define it as DDOS attack. One point to be noted as well. DO NOT USE timeout=None as that's will cause the request to wait forever for a response where in the back-end the firewall is automatically detecting any TCP listener which in pending state and automatically drop it and block the origin IP which is you. that's based on time configured :) –
import requests
from concurrent.futures.thread import ThreadPoolExecutor
from bs4 import BeautifulSoup
def Test(num):
print(f"Thread# {num}")
with requests.session() as req:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
r = req.get(
"https://www.uniqlo.com/us/en/men/t-shirts", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
if r.status_code == 200:
return soup.title.text
else:
return f"Thread# {num} Failed"
with ThreadPoolExecutor(max_workers=20) as executor:
futures = executor.map(Test, range(1, 31))
for future in futures:
print(future)
Run It Online
ReadTimeout exceptions are commonly caused by the following
Making too many requests in a givin time period
Making too many requests at the same time
Using too much bandwidth, either on your end or theirs
It looks like your are making 1 request every 2 seconds. For some websites this is fine, others could be call this a denial-of-service attack. Google for example will slow down or block requests that occur to frequently.
Some sites will also limit the requests if you don't provide the right information in the header, or if they think your a bot.
To solve this try the following:
Increase the time between requests. For Google, 30-45 seconds works for me if I am not using an API
Decrease the number of concurrent requests.
Have a look at the network requests that occur when you visit the site in your browser, and try to mimic them.
Use a package like selenium to make your activity look less like a bot.
Related
I am trying to scrape data from CME but the code seems to freeze at requests.get() function.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.cmegroup.com/markets/interest-rates/us-treasury/2-year-us-treasury-note.settlements.html'
page = requests.get(URL)
Seems that they are checking for user-agent
The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.
Not a specific one, so just give them your favorite agent:
requests.get(URL, headers={'user-agent':'SALT'}).text
More about user-agent check the docs
So I'm trying to scrape https://craft.co/tesla
When I visit from the browser, it opens correctly. However, when I use scrapy, it fetches the site but when I view the response,
view(response)
It shows the cloudfare site instead of the actual site.
Please how do I go about this??
Cloudflare changes their techniques periodically and anyway you can just use a simple Python module to bypass Cloudflare's anti-bot page.
The module can be useful if you wish to scrape or crawl a website protected with Cloudflare. Cloudflare's anti-bot page currently just checks if the client supports Javascript, though they may add additional techniques in the future.
Due to Cloudflare continually changing and hardening their protection page, cloudscraper requires a JavaScript Engine/interpreter to solve Javascript challenges. This allows the script to easily impersonate a regular web browser without explicitly deobfuscating and parsing Cloudflare's Javascript.
Any script using cloudscraper will sleep for ~5 seconds for the first visit to any site with Cloudflare anti-bots enabled, though no delay will occur after the first request.
[ https://pypi.python.org/pypi/cloudscraper/ ]
Please check this python module.
The simplest way to use cloudscraper is by calling create_scraper().
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print(scraper.get("http://somesite.com").text) # => "<!DOCTYPE html><html><head>..."
Any requests made from this session object to websites protected by Cloudflare anti-bot will be handled automatically. Websites not using Cloudflare will be treated normally. You don't need to configure or call anything further, and you can effectively treat all websites as if they're not protected with anything.
You use cloudscraper exactly the same way you use Requests. cloudScraper works identically to a Requests Session object, just instead of calling requests.get() or requests.post(), you call scraper.get() or scraper.post().
Use requests-HTML. You can use this code to avoid block:
# pip install requests-html
from requests_html import HTMLSession
url = 'your url come here'
s = HTMLSession()
s.headers['user-agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
r = s.get(url)
r.html.render(timeout=8000)
print(r.status_code)
print(r.content)
I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.
I am required to retrieve 8000 answers from a website for research purposes (auto filling a form and submitting it 8000 times). I wrote the below script but when I run it after 20 submits python stops working and I'm unable to get what I need. Could you please help me find the problem with my script?
from mechanize import ParseResponse, urlopen, urljoin
import urllib2
from urllib2 import Request, urlopen, URLError
import mechanize
import time
URL = "url of the website"
br = mechanize.Browser() # Creates a browser
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
def fetch(val):
br.open(URL) # Open the login page
br.select_form(nr=0) # Find the login form
br['subject']='question'
br['value'] =val
br.set_all_readonly(False)
resp = br.submit()
data = resp.read()
br.reload()
x=data.find("the answer is:")
if x!=-1:
ur=data[x:x+100]
print ur
val_list =val_list # This list is available and contains 8000 different values
for i in range(0,8000):
fetch(val_list[i])
Having used mechanize in the past to do a similar data-scraping kind of thing, you're almost certainly getting limited by the website as Erbureth mentioned. Usually websites have a way to monitor connections to filter out exactly the type of thing you're attempting, and for good reason.
Putting aside for a moment whatever the purpose of your script may be and moving to your question of why is doesn't work: At the very least, I would put some delays in there so you're not trying to access the site repeatedly in such a short time span. Put a few seconds of pause between calls, and maybe it will work. (Although then you'll have to let it run for hours.)
I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!
You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:
hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
I think the winner here might be to use requests's session framework, which takes care of the cookies for you.
That would look something like this:
import requests
import cookielib
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)
r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text
Try that and see if that works?