How to get passed a 503 error while scraping - python

I am trying to scrape ETFs from the website https://www.etf.com/channels. However no matter what I try it returns a 503 error when trying to access it. I've tried using different user agents as well as headers but it still wouldn't let me access it. Sometimes when I try to access the website by browser a page pops up that "checks if the connection is secure" So I assume they have things in place to stop scraping. I've seen others ask the same question and the answer always says to add a user agent but that didn't work for this site.
Scrapy
class BrandETFs(scrapy.Spider):
name = "etfs"
start_urls = ['https://www.etf.com/channels']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "www.etf.com",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0"
}
custom_settings = {'DOWNLOAD_DELAY': 0.3, "CONCURRENT_REQUESTS": 4}
def start_requests(self):
url = self.start_urls[0]
yield scrapy.Request(url=url)
def parse(self, response):
test = response.css('div.discovery-slat')
yield {
"test": test
}
Requests
import requests
url = 'https://www.etf.com/channels'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Referer': 'https://google.com',
'Origin': 'https://www.etf.com'
}
r = requests.post(url, headers=headers)
r.raise_for_status()
Is there anyway to get around these blocks and access the website?

Status 503 - Service Unavailable is often seen in such cases, you are probably right with your assumption that they have taken measures against scraping.
For the sake of completeness, they prohibit what you are attempting in their Terms of Service (No. 7g):
[...] You agree that you will not [...]
Use automated means, including spiders, robots, crawlers [...]
Technical point of view
The User-Agent in the header is just one of many things that you should consider when you try to hide the fact that you automated the requests you are sending.
Since you see a page that seems to verify that you are still/again a human, it is likely that they have figured out what is going on and
have an eye on your IP. It might not be blacklisted (yet) because they notice changes whenever you try to access the page.
How did they find out? Based on your question and code, I guess it's just your IP that did not change in combination with
Request rate: You have sent (too many) requests too quickly, i.e. faster than they consider a human to do this.
Periodic requests: Static delays between requests, so they see pretty regular timing on their side.
There are several other aspects that might or might not be monitored. However, using proxies (i.e. changing IP addresses) would be a step in the right direction.

Related

Error status code 403 even with headers, Python Requests

I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)
print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
f.write(response.text)
I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error
The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?
The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:
First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
You can use urllib instead of requests, it seems to be able to deal with cloudflare
req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
f.write(r)
It works on my machine, so I am not sure what the problem is.
However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.
With playwright you can try the following:
from playwright.sync_api import sync_playwright
url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page(user_agent=ua)
page.goto(url)
page.wait_for_timeout(1000)
html = page.content()
print(html)
A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.
Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.
Good luck!
Had the same problem recently.
Using the javascript fetch-api with Selenium-Profiles worked for me.
example js:
fetch('http://example.com/movies.json')
.then((response) => response.json())
.then((data) => console.log(data));o
Example Python with Selenium-Profiles:
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
"content-type": "application/json",
# "cookie": cookie_str, # optional
"sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
"sec-ch-ua-mobile": "?0", # "?1" for mobile
"sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"user-agent": profile['cdp']['useragent']['userAgent']
}
answer = driver.requests.fetch("https://www.example.com/",
options={
"body": json.dumps(post_data),
"headers": headers,
"method":"POST",
"mode":"same-origin"
})
I don't know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.

Trying to access variables while scraping website

I'm currently trying to scrape a website (url in code below), however when I pull out the section of html I'm looking to work with, all I get is the variable name of the information I'm looking for. The actual value for the variable is present when I manually inspect the page's html but I assume when I scrape the page that all I see is the website referencing the variable from elsewhere.
I'm hoping someone can help me try to access this information. I have tried just scraping the website's html using selenium, however I seem just get back the same html that I scrape when using requests (maybe I'm doing it incorrectly and somebody can show me the correct approach).
This is a refined version of my code:
import scrapy #For scraping
from scrapy import Selector
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept-Language': 'en-GB,en;q=0.5',
'Referer': 'https://google.com',
'DNT': '1'}
url = 'https://groceries.aldi.ie/en-GB/drinks/beer-ciders'
html = requests.get(url, headers=headers).content
sel = Selector(text=html)
x = sel.xpath('//*[#id="vueSearchResults"]/div//span')#/div[2]/div/div[4]/div/div/span/span
print((x.extract())[8])
This then returns the following:
<span>{{Product.ListPrice}}</span>
From which I want to get the actual value of 'Product.ListPrice'. I'd appreciate it if some can point me in the right direction as to accessing this variable's information, or a way to scrape the website's html - as seen by a user traversing the webpage.
** It was recommended to me to send a POST request through this API: 'https://groceries.aldi.ie/api/product/calculatePrices' along with passing request headers and payload, but I'm not entirely sure how to pull this off (I'm new to this aspect of scraping), if someone could provide me an example of how to carry this out I'd greatly appreciate it!
Thanks!
This is how you can replicate the POST request through Scrapy.
Code
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
headers = {
"authority": "groceries.aldi.ie",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept-language": "en-GB",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
"websiteid": "a763fb4a-0224-4ca8-bdaa-a33a4b47a026",
"content-type": "application/json",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"origin": "https://groceries.aldi.ie",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://groceries.aldi.ie/en-GB/drinks/beer-ciders?sortDirection=asc&page=1"
}
body = '{"products":["4088600284026","5391528370382","5391528372836","5391528372850","5391528372874","4088600298696","4088600103709","4088600388700","5035766046028","5000213021934","4088600012551","4088600325934","4088600300153","25389111","4072700001171","4088600012537","4088600012544","4088600013138","4088600013145","4088600103525","4088600103532","4088600103570","4088600103600","4088600135182","4088600141848","4088600142050","4088600158105","4088600217024","4088600217208","4088600241302","4088600249292","4088600249308","4088600280615","4088600281445","4088600283043","4088600284088","4088600295688","4088600295800","4088600295817","4088600303925"]}'
def start_requests(self):
url = 'https://groceries.aldi.ie/api/product/calculatePrices'
yield scrapy.Request(url=url,method='POST', headers=self.headers,body=self.body)
def parse(self,response):
data = json.loads(response.body)
for i in data.get('ProductPrices'):
print('Listing price is', i.get('ListPrice'),'\t','Unit Price is',i.get('UnitPrice') )
Output(Truncated)
Listing price is €4.19 Unit Price is €4.23
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98

Python - WebScraping using Request module-URL throws an error -403- forbidden

I'm trying to get the data from https://www.ecfr.gov/cgi-bin/ECFR?page=browse
using requests module in python
Somehow I'm getting HTTP 403-forbidden.
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Host": "httpbin.org",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ef3288f-10e678d0e55c0670c0807730"}
r = requests.get(url , headers= header)
I have also requested using user-agent and all the parameters in headers info(which I'm seeing in developer tools) .
I have tried using free proxies / rotating user header /cookies and everything i can get my hands on. But somehow website is able to know that I'm not using header.
In the html response - I'm seeing that website is asking to complete captcha.
Is there anyways I can skip that ?
Inspecting the http requests, I've found the cloudflare server response trace:
The Cloudflare or ScrapeShield is famous for its scrape protection, security levels. Read more here.
Is there anyways I can skip that ?
There are 2 ways out:
Apply (plug-in) a captcha solving service. That is not that easy providing you use sole python coding.
Leverage the browser automation, making ScrapeShield to think that a real user browses the website. It does take much more resources and time (incl. development time). See a scrape speed comparison table of Chromium headless instance automation vs bare http requests.

Why does it take much longer time for worldcat REST API to return response while using urllib2 python library than while using browser?

I am using worldcat python package which uses worldcat open REST API and fetches book data using search query and other parameters.Basically it does this -
self.response = urllib2.urlopen(_query_url).read()
where _query_url is the url made of base url and some parameters such as search string, no of records per page etc. By using timeit package I found out it took 18-20 seconds for every call to the API.
However if I make that request from the browser it takes just 3-4 seconds.What is causing the delay in the python lib.Is it normal? How can I make API requests faster in python?
my lucky guess is to use custom headers - they might have some kind of protection or whatever.
try:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Content-Type": "text/html;charset=utf-8"
}
r = urllib2.Request(_query_url, None, headers)
self.response = urllib2.urlopen(r)
Anyway, i love using python 'requets', why not give them a try. Simple and stable. Apart from some SSL-keys problem but thats another story.

Python Requests Create Cookie Failure

I'm attempting to post the data of a pop-out form to a local web site. To do this I'm emulating the requests header and data and cookie information provided by the site. (Note: I am largely redacting my email and password from the code (for obvious reasons), but all other code will remain the same.)
I have tried mulitple permutations of the cookie, header, requests, data, etc. Additionally, I have verified in a network inspector the cookie and expected headers and data. I am able to easily set a cookie using requests' sample code. I cannot explain why my code won't work on a live site, and I'd be very grateful for any assistance. Please see the following code for further details.
import requests
import robobrowser
import json
br = robobrowser.RoboBrowser(user_agent="Windows Chrome",history=True)
url = "http://posting.cityweekly.net/gyrobase/API/Login/CookieV2"
data ={"passwordChallengeResponse":"....._SYGwbDLkSyU5gYKGg",
"email": "<email>%40bu.edu",
"ttl":"129600",
"sessionOnly": "1"
}
headers = {
"Origin": "http://posting.cityweekly.net",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.8,ru;q=0.6",
"User-Agent": "Windows Chrome", #"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
"Referer": "http://posting.cityweekly.net/utah/Events/AddEvent",
"X-Requested-With": "XMLHttpRequest",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Host":"posting.cityweekly.net"
}
cookie = {"Cookie": "__utma=25975215.1299783561.1416894918.1416894918.1416897574.2; __utmc=25975215; __utmz=25975215.1416894918.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __qca=P0-2083194243-1416894918675; __gads=ID=e3b24038c9228b00:T=1416894918:S=ALNI_MY7ewizuxK0oISnqPJWlLDAeKFMmw; _cb_ls=1; _chartbeat2=D6vh2H_ZbNJDycc-t.1416894962025.1416897589974.1; __utmb=25975215.3.10.1416897574; __utmt=1"}
r = br.session.get(url, data=json.dumps(data), cookies=cookie, headers=headers)
print r.headers
print [item for item in r.cookies.__dict__.items()]
Note that I print the cookies object and that the cookies attribute (a dictionary) is empty.
You need to perform a POST to login to the site. Once you do that, I believe the cookies will then have the correct values, (not 100% on that...). This post clarifies how to properly set cookies.
Note: I don't think you need to do the additional import of requests unless you're using it outside of RoboBrowser.

Categories