So I want to build a basic scraper that scrapes Amazon reviews using BeautifulSoup, and so far this is my code that I have (FYI I have both requests and BeautifulSoup libraries imported)
def spider(max_pages):
pages = 1
while pages <= max_pages:
review_list = []
print("Page Number")
print(pages)
header = {'User-Agent': 'Mozilla/5.0 (Linux; Android 9; KFTRWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/102.2.1 like Chrome/102.0.5005.125 Safari/537.36'} # user agent so site knows we aren't a bot
url = "https://www.amazon.ca/Beats-Fit-Pro-Cancelling-Built/product-reviews/B09JL41N9C/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=" \
+ str(pages) # url of product
response = requests.get(url, headers=header) # passing a GET request to url
html = response.text # get HTML of response
soup = BeautifulSoup(html, features="lxml") # parses the HTML using BeautifulSoup
review = soup.find_all('div', {'data-hook': 'review'})
print(len(review))
for i in review:
print('ok')
re = {
'review_title': i.find('a', {'data-hook': 'review-title'}).text.strip(),
'rating': float(i.find('i', class_=['a-icon', 'a-icon-star']).text.replace('out of 5 stars', '').strip()),
'review_body': i.find('span', {'data-hook': 'review-body'}).text
}
print(re)
review_list.append(re)
print(len(review_list))
pages += 1
spider(1)
When I run this code, I thought the problem might be lying within the for loop since it doesn't run for some reason, however doing a deeper look at it, I tried printing out the number of reviews (I believe it should be 10 since there is usually 10 reviews on one page), but the output ended up being 0, which leads me to believe that the problem isn't with the for loop, but rather with how I'm trying to scrape the reviews. I double-checked all the HTML tags, and it seems to be the correct one for reviews that I'm trying to find, so now I'm confused as to why I'm not scraping any reviews. If anyone could help me it would be highly appreciated
Related
I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)
The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))
Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)
Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)
Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195
Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.
The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png
I'm trying to scrape through an amazon offer. The code block below prints the titles of all results of the first page.
import requests
from bs4 import BeautifulSoup as BS
offer_url = 'https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU?ref=deals_deals_deals-grid_slot-15_8454_dt_dcell_img_14_f3724fb9#'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get(offer_url, headers=headers)
html = response.text
soup = BS(html)
links = soup.find_all('a',class_='a-size-base a-link-normal fw-line3-break a-text-normal')
for link in links:
title = link.text.strip() # remove surrounding spaces and linebreaks etc.
print(title)
So far, so good. Now, how do I access the second page? Clicking on the Next page button (in german it's "Weiter") at the bottom of the page does not add a page argument like ?page=2 to the URL through which I could access the next page.
The questions I have are: How do I access the content of the next page similarly to how I access the first page? Is there a POST request involved and if so: How do I figure out its params/data? How would I use requests to mimic pressing the Next Page button and get the respective page results?
The offer is scheduled to last until March 21st, 2021. Until then, the link provided in the code should be valid.
Maybe it's just a few lines of code, e.g. a tweak in my request. Thanks in advance! Have a wonderful day!
Edit:
Trying to fetch the second page using the following script only yields the results of the first page.
params = {"page":2}
html = requests.post('https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU', data=params, headers=headers).text
I have been using following code to parse web page in the link https://www.blogforacure.com/members.php. The code is expected to return the links of all the members of the given page.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
print(len(headers))
for header in headers:
a = header.find('a')
print(a.attrs['href'])
But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links.
The results are dynamically loaded by making AJAX requests to the https://www.blogforacure.com/site/ajax/scrollergetentries.php endpoint.
Simulate them in your code with requests maintaining a web-scraping session:
from bs4 import BeautifulSoup
import requests
url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
session.get("https://www.blogforacure.com/members.php")
page = 0
members = []
while True:
# get page
response = session.post(url, data={
"p": str(page),
"id": "#scrollbox1"
})
html = response.json()['html']
# parse html
soup = BeautifulSoup(html, "html.parser")
page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
print(page, page_members)
members.extend(page_members)
page += 1
It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.
Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.
I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
Everything was fine till I scraped the description.
I tried and tried to scrape, but I failed so far.
It seems like I can't reach that information. Here is my code:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
Any of you has any suggestion?
You would need to make an additional request to get the description. Here is a complete working example using requests + BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))
I use XML package to web scraping, and I can't get the description section as you described with BeautifulSoup.
However if you just want to scrap this page only, you can download the page. Then:
page = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html",
useInternal = TRUE,encoding="utf8")
unlist(xpathApply(page, '//div[#id="description_section"]', xmlValue))
I tried the R code to download, and I can't find the description_section either.
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",mode="w")
Maybe we have to add some options in the function download.file. I hope that some html experts could help.
I found out how to scrap with R:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[#id="description_section"]', xmlValue) %>%
html_text()