I'm currently trying to scrape a website (url in code below), however when I pull out the section of html I'm looking to work with, all I get is the variable name of the information I'm looking for. The actual value for the variable is present when I manually inspect the page's html but I assume when I scrape the page that all I see is the website referencing the variable from elsewhere.
I'm hoping someone can help me try to access this information. I have tried just scraping the website's html using selenium, however I seem just get back the same html that I scrape when using requests (maybe I'm doing it incorrectly and somebody can show me the correct approach).
This is a refined version of my code:
import scrapy #For scraping
from scrapy import Selector
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
'Accept-Language': 'en-GB,en;q=0.5',
'Referer': 'https://google.com',
'DNT': '1'}
url = 'https://groceries.aldi.ie/en-GB/drinks/beer-ciders'
html = requests.get(url, headers=headers).content
sel = Selector(text=html)
x = sel.xpath('//*[#id="vueSearchResults"]/div//span')#/div[2]/div/div[4]/div/div/span/span
print((x.extract())[8])
This then returns the following:
<span>{{Product.ListPrice}}</span>
From which I want to get the actual value of 'Product.ListPrice'. I'd appreciate it if some can point me in the right direction as to accessing this variable's information, or a way to scrape the website's html - as seen by a user traversing the webpage.
** It was recommended to me to send a POST request through this API: 'https://groceries.aldi.ie/api/product/calculatePrices' along with passing request headers and payload, but I'm not entirely sure how to pull this off (I'm new to this aspect of scraping), if someone could provide me an example of how to carry this out I'd greatly appreciate it!
Thanks!
This is how you can replicate the POST request through Scrapy.
Code
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
headers = {
"authority": "groceries.aldi.ie",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept-language": "en-GB",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
"websiteid": "a763fb4a-0224-4ca8-bdaa-a33a4b47a026",
"content-type": "application/json",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"origin": "https://groceries.aldi.ie",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://groceries.aldi.ie/en-GB/drinks/beer-ciders?sortDirection=asc&page=1"
}
body = '{"products":["4088600284026","5391528370382","5391528372836","5391528372850","5391528372874","4088600298696","4088600103709","4088600388700","5035766046028","5000213021934","4088600012551","4088600325934","4088600300153","25389111","4072700001171","4088600012537","4088600012544","4088600013138","4088600013145","4088600103525","4088600103532","4088600103570","4088600103600","4088600135182","4088600141848","4088600142050","4088600158105","4088600217024","4088600217208","4088600241302","4088600249292","4088600249308","4088600280615","4088600281445","4088600283043","4088600284088","4088600295688","4088600295800","4088600295817","4088600303925"]}'
def start_requests(self):
url = 'https://groceries.aldi.ie/api/product/calculatePrices'
yield scrapy.Request(url=url,method='POST', headers=self.headers,body=self.body)
def parse(self,response):
data = json.loads(response.body)
for i in data.get('ProductPrices'):
print('Listing price is', i.get('ListPrice'),'\t','Unit Price is',i.get('UnitPrice') )
Output(Truncated)
Listing price is €4.19 Unit Price is €4.23
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Listing price is €1.99 Unit Price is €3.98
Related
I am sending a request to some url. I Copied the curl url to get the code from curl to python tool. So all the headers are included, but my request is not working and I recieve status code 403 on printing and error code 1020 in the html output. The code is
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
}
response = requests.get('https://v2.gcchmc.org/book-appointment/', headers=headers)
print(response.status_code)
print(response.cookies.get_dict())
with open("test.html",'w') as f:
f.write(response.text)
I also get cookies but not getting the desired response. I know I can do it with selenium but I want to know the reason behind this. Thanks in advance.
Note:
I have installed all the libraries installed with request with same version as computer and still not working and throwing 403 error
The site is protected by cloudflare which aims to block, among other things, unauthorized data scraping. From What is data scraping?
The process of web scraping is fairly simple, though the
implementation can be complex. Web scraping occurs in 3 steps:
First the piece of code used to pull the information, which we call a scraper bot, sends an HTTP GET request to a specific website.
When the website responds, the scraper parses the HTML document for a specific pattern of data.
Once the data is extracted, it is converted into whatever specific format the scraper bot’s author designed.
You can use urllib instead of requests, it seems to be able to deal with cloudflare
req = urllib.request.Request('https://v2.gcchmc.org/book-appointment/')
req.add_headers('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8')
req.add_header('Accept-Language', 'en-US,en;q=0.5')
r = urllib.request.urlopen(req).read().decode('utf-8')
with open("test.html", 'w', encoding="utf-8") as f:
f.write(r)
It works on my machine, so I am not sure what the problem is.
However, when I want send a request which does not work, I often try if it works using playwright. Playwright uses a browser driver and thus mimics your actual browser when visiting the page. It can be installed using pip install playwright. When you try it for the first time it may give an error which tells you to install the drivers, just follow the instruction to do so.
With playwright you can try the following:
from playwright.sync_api import sync_playwright
url = 'https://v2.gcchmc.org/book-appointment/'
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page(user_agent=ua)
page.goto(url)
page.wait_for_timeout(1000)
html = page.content()
print(html)
A downside of playwright is that it requires the installation of the chromium (or other) browsers. This is a downside as it may complicate deployment, as the browser can not simply be added to requirements.txt, and a container image is required.
Try running Burp Suite's Proxy to see all the headers and other data like cookies. Then you could mimic the request with the Python module. That's what I always do.
Good luck!
Had the same problem recently.
Using the javascript fetch-api with Selenium-Profiles worked for me.
example js:
fetch('http://example.com/movies.json')
.then((response) => response.json())
.then((data) => console.log(data));o
Example Python with Selenium-Profiles:
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": profile["cdp"]["useragent"]["acceptLanguage"],
"content-type": "application/json",
# "cookie": cookie_str, # optional
"sec-ch-ua": "'Google Chrome';v='107', 'Chromium';v='107', 'Not=A?Brand';v='24'",
"sec-ch-ua-mobile": "?0", # "?1" for mobile
"sec-ch-ua-platform": "'" + profile['cdp']['useragent']['userAgentMetadata']['platform'] + "'",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"user-agent": profile['cdp']['useragent']['userAgent']
}
answer = driver.requests.fetch("https://www.example.com/",
options={
"body": json.dumps(post_data),
"headers": headers,
"method":"POST",
"mode":"same-origin"
})
I don't know why this occurs, but I assume cloudfare and others are able to detect, whether a request is made with javascript.
I am trying to scrape ETFs from the website https://www.etf.com/channels. However no matter what I try it returns a 503 error when trying to access it. I've tried using different user agents as well as headers but it still wouldn't let me access it. Sometimes when I try to access the website by browser a page pops up that "checks if the connection is secure" So I assume they have things in place to stop scraping. I've seen others ask the same question and the answer always says to add a user agent but that didn't work for this site.
Scrapy
class BrandETFs(scrapy.Spider):
name = "etfs"
start_urls = ['https://www.etf.com/channels']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "www.etf.com",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0"
}
custom_settings = {'DOWNLOAD_DELAY': 0.3, "CONCURRENT_REQUESTS": 4}
def start_requests(self):
url = self.start_urls[0]
yield scrapy.Request(url=url)
def parse(self, response):
test = response.css('div.discovery-slat')
yield {
"test": test
}
Requests
import requests
url = 'https://www.etf.com/channels'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Referer': 'https://google.com',
'Origin': 'https://www.etf.com'
}
r = requests.post(url, headers=headers)
r.raise_for_status()
Is there anyway to get around these blocks and access the website?
Status 503 - Service Unavailable is often seen in such cases, you are probably right with your assumption that they have taken measures against scraping.
For the sake of completeness, they prohibit what you are attempting in their Terms of Service (No. 7g):
[...] You agree that you will not [...]
Use automated means, including spiders, robots, crawlers [...]
Technical point of view
The User-Agent in the header is just one of many things that you should consider when you try to hide the fact that you automated the requests you are sending.
Since you see a page that seems to verify that you are still/again a human, it is likely that they have figured out what is going on and
have an eye on your IP. It might not be blacklisted (yet) because they notice changes whenever you try to access the page.
How did they find out? Based on your question and code, I guess it's just your IP that did not change in combination with
Request rate: You have sent (too many) requests too quickly, i.e. faster than they consider a human to do this.
Periodic requests: Static delays between requests, so they see pretty regular timing on their side.
There are several other aspects that might or might not be monitored. However, using proxies (i.e. changing IP addresses) would be a step in the right direction.
I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?
Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!
Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.
I am trying to webscrape Crunch Base to find the total funding amount for certain companies. Here is a link to an example.
At first, I tried just using beautiful soup but I keep getting an error saying:
Access to this page has been denied because we believe you are using automation tools to browse the\nwebsite.
So then I looked up how to fake a browser visit and I changed my code, but I still get the same error. What am I doing wrong??
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
All in all your code looks great! It appears that the website you are trying to scrape requires a more complex header than the one you have. The following code should solve your issue:
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)
I'm attempting to post the data of a pop-out form to a local web site. To do this I'm emulating the requests header and data and cookie information provided by the site. (Note: I am largely redacting my email and password from the code (for obvious reasons), but all other code will remain the same.)
I have tried mulitple permutations of the cookie, header, requests, data, etc. Additionally, I have verified in a network inspector the cookie and expected headers and data. I am able to easily set a cookie using requests' sample code. I cannot explain why my code won't work on a live site, and I'd be very grateful for any assistance. Please see the following code for further details.
import requests
import robobrowser
import json
br = robobrowser.RoboBrowser(user_agent="Windows Chrome",history=True)
url = "http://posting.cityweekly.net/gyrobase/API/Login/CookieV2"
data ={"passwordChallengeResponse":"....._SYGwbDLkSyU5gYKGg",
"email": "<email>%40bu.edu",
"ttl":"129600",
"sessionOnly": "1"
}
headers = {
"Origin": "http://posting.cityweekly.net",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.8,ru;q=0.6",
"User-Agent": "Windows Chrome", #"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
"Referer": "http://posting.cityweekly.net/utah/Events/AddEvent",
"X-Requested-With": "XMLHttpRequest",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Host":"posting.cityweekly.net"
}
cookie = {"Cookie": "__utma=25975215.1299783561.1416894918.1416894918.1416897574.2; __utmc=25975215; __utmz=25975215.1416894918.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __qca=P0-2083194243-1416894918675; __gads=ID=e3b24038c9228b00:T=1416894918:S=ALNI_MY7ewizuxK0oISnqPJWlLDAeKFMmw; _cb_ls=1; _chartbeat2=D6vh2H_ZbNJDycc-t.1416894962025.1416897589974.1; __utmb=25975215.3.10.1416897574; __utmt=1"}
r = br.session.get(url, data=json.dumps(data), cookies=cookie, headers=headers)
print r.headers
print [item for item in r.cookies.__dict__.items()]
Note that I print the cookies object and that the cookies attribute (a dictionary) is empty.
You need to perform a POST to login to the site. Once you do that, I believe the cookies will then have the correct values, (not 100% on that...). This post clarifies how to properly set cookies.
Note: I don't think you need to do the additional import of requests unless you're using it outside of RoboBrowser.