Missing Classes when scraping Google Pages with Beautiful Soup

Missing Classes when scraping Google Pages with Beautiful Soup - python

I am looking to scrape the audience review score for a specific movie.
i.e.
I have been looking to scrape this value using BS4, however I cannot seem to find this rating anywhere when scraping the URL. The critic rating (IMDb, RT, Meta) can be located, however the Google User Score is not there.
I used SerpApi to double check this, and the value doesn't show up on their either. I.e.
I am assuming this means that the Google User Score, may be updated using some other Script, and is not retrievable by these means. Could there possible be another method of retrieving this data?
Looking through the reponse using python shows no result either.
html = requests.get('https://www.google.com/search?q=shawshank+redemption&hl=en')
soup = BeautifulSoup(html.text, 'lxml')

Try:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
url = "https://www.google.com/search?q=shawshank+redemption&hl=en"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
liked = soup.find(
lambda tag: tag.name == "span" and re.match(r"\d+% liked", tag.text)
)
print(liked.text.split()[0])
Prints:
96%

Related

An issue with scraping reviews from Amazon with BeautifulSoup in Python

So I want to build a basic scraper that scrapes Amazon reviews using BeautifulSoup, and so far this is my code that I have (FYI I have both requests and BeautifulSoup libraries imported)
def spider(max_pages):
pages = 1
while pages <= max_pages:
review_list = []
print("Page Number")
print(pages)
header = {'User-Agent': 'Mozilla/5.0 (Linux; Android 9; KFTRWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/102.2.1 like Chrome/102.0.5005.125 Safari/537.36'} # user agent so site knows we aren't a bot
url = "https://www.amazon.ca/Beats-Fit-Pro-Cancelling-Built/product-reviews/B09JL41N9C/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=" \
+ str(pages) # url of product
response = requests.get(url, headers=header) # passing a GET request to url
html = response.text # get HTML of response
soup = BeautifulSoup(html, features="lxml") # parses the HTML using BeautifulSoup
review = soup.find_all('div', {'data-hook': 'review'})
print(len(review))
for i in review:
print('ok')
re = {
'review_title': i.find('a', {'data-hook': 'review-title'}).text.strip(),
'rating': float(i.find('i', class_=['a-icon', 'a-icon-star']).text.replace('out of 5 stars', '').strip()),
'review_body': i.find('span', {'data-hook': 'review-body'}).text
}
print(re)
review_list.append(re)
print(len(review_list))
pages += 1
spider(1)
When I run this code, I thought the problem might be lying within the for loop since it doesn't run for some reason, however doing a deeper look at it, I tried printing out the number of reviews (I believe it should be 10 since there is usually 10 reviews on one page), but the output ended up being 0, which leads me to believe that the problem isn't with the for loop, but rather with how I'm trying to scrape the reviews. I double-checked all the HTML tags, and it seems to be the correct one for reviews that I'm trying to find, so now I'm confused as to why I'm not scraping any reviews. If anyone could help me it would be highly appreciated

Scraping HREF Links contained within a Table

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)

The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))

Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)

Why does BeautifulSoup not get all html on Google?

I've seen this before but I've never seen anything related to Google. When something is searched on Google, all of the links and titles are put in h3 tags. However, if i try to use Beautiful Soup, none of the h3 tags appear and it seems like a lot of the tags are missing. I don't think this is a JavaScript issue. Is there anything I'm missing?
link = "http://google.com/search?q=" + input
soup = BeautifulSoup(link, "lxml")
for item in soup.find_all("h3"):
print (item)
Edit: code

According to your code, you get the empty result because you didn't send any request for example via requests module as other people in the answers mentioned. You just passed it right into beautifulsoup and he doesn't know what to do with it.
Also, it's because there's no user-agent aka headers specified, and Google could block a request eventually. What is my user-agent
Code (CSS selectors reference):
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
# container will all title and links. Iterating over each title and link
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
---------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
...
'''
Alternatively, you can do the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The differences are is that you only need to iterate over JSON string rather than scraping everything from scratch and don't worry about bypass blocks from Google or trying to understand how to scrape data from JavaScript, e.g. Google Maps or extract images from Google Images since it's already done for the end-user.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
---------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
...
'''
P.S. - I wrote a more detailed blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.

Use requests to get the data. And narrow down your search in order to print only the titles:
from bs4 import BeautifulSoup
import requests
link = "http://google.com/search?q=" + input("Enter search")
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'}
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text,'html5lib')
headings = soup.find_all('h3', class_ = 'LC20lb DKV0Md')
for heading in headings:
print(heading.text)
Output:
Enter search>? beautifulsoup
Beautiful Soup Documentation — Beautiful Soup 4.9.0 ...
Beautiful Soup: We called him Tortoise because he taught us.
Beautiful Soup documentation - Crummy
beautifulsoup4 · PyPI
Intro to Beautiful Soup | Programming Historian
Beautiful Soup (HTML parser) - Wikipedia
Implementing Web Scraping in Python with BeautifulSoup ...
Tutorial: Web Scraping with Python Using Beautiful Soup
Beautiful Soup - Quick Guide - Tutorialspoint

You need to first get the source code of the webpage using requests module and then you can pass it to BeautifulSoup constructor (also don't use input as variable name):
import requests
from bs4 import BeautifulSoup
Input = input("Enter search string: ")
link = "http://google.com/search?q=" + Input
html = requests.get(link).content
soup = BeautifulSoup(html, "lxml")
for item in soup.find_all("h3"):
print (item.text)
You can made the input fail safe too. In google search URL you should replace the spaces by + and + by %20:
import requests
from bs4 import BeautifulSoup
Input = input("Enter search string: ")
Input = Input.replace(" ","+").replace("+","%20")
link = "http://google.com/search?q=" + Input
html = requests.get(link).content
soup = BeautifulSoup(html, "lxml")
for item in soup.find_all("h3"):
print (item.text)
If you don't have the requests

Scraping: cannot access information from web

I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
Everything was fine till I scraped the description.
I tried and tried to scrape, but I failed so far.
It seems like I can't reach that information. Here is my code:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
Any of you has any suggestion?

You would need to make an additional request to get the description. Here is a complete working example using requests + BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))

I use XML package to web scraping, and I can't get the description section as you described with BeautifulSoup.
However if you just want to scrap this page only, you can download the page. Then:
page = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html",
useInternal = TRUE,encoding="utf8")
unlist(xpathApply(page, '//div[#id="description_section"]', xmlValue))
I tried the R code to download, and I can't find the description_section either.
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",mode="w")
Maybe we have to add some options in the function download.file. I hope that some html experts could help.

I found out how to scrap with R:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[#id="description_section"]', xmlValue) %>%
html_text()

Why am I getting repetitive output while trying to scrape data from Google Scholar?

I am trying to scrape the PDF links from the search results from Google Scholar. I have tried to set a page counter based on the change in URL, but after the first eight output links, I am getting repetitive links as output.
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
urlPart1 = "http://scholar.google.com/scholar?start="
urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
url = urlPart1 + str(urlCounter) + urlPart2
page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
resp = urllib2.urlopen(page)
html = resp.read()
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
recordCount = 0
while recordCount <=9:
recordPart1 = "gs_ggsW"
finRecord = recordPart1 + str(recordCount)
recordCount = recordCount+1
#printing the links
for link in soup.find_all('div', id = finRecord):
linkstring = str(link)
soup1 = BeautifulSoup(linkstring)
for link in soup1.find_all('a'):
print(link.get('href'))

Change the following line in your code:
finRecord = recordPart1 + str(recordCount)
To
finRecord = recordPart1 + str(recordCount+urlCounter-10)
The real problem: div ids in the first page are gs_ggsW[0-9], but ids on the second page are gs_ggsW[10-19]. So beautiful soup will find no links on the 2nd page.
Python's variable scope may confuse people from other languages, like Java. After the for loop below being executed, the variable link still exists. So the link is referenced to the last link on the 1st page.
for link in soup1.find_all('a'):
print(link.get('href'))
Updates:
Google may not provide pdf download links for some papers, so you can't use id to match the link of each paper. You can use css selecters to match all the links together.
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
for link in soup.select('div.gs_ttss a'):
print(link.get('href'))

Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE to extract PDF's:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "entity resolution", # search query
"hl": "en" # language
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for pdf_link in soup.select(".gs_or_ggsm a"):
pdf_file_link = pdf_link["href"]
print(pdf_file_link)
# output from the first page:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The main difference is that you only need to grab the data from structured JSON instead of figuring out how to extract the data from HTML, how to bypass blocks from search engines.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY", # SerpApi API key
"engine": "google_scholar", # Google Scholar organic reuslts
"q": "entity resolution", # search query
"hl": "en" # language
}
search = GoogleSearch(params)
results = search.get_dict()
for pdfs in results["organic_results"]:
for link in pdfs.get("resources", []):
pdf_link = link["link"]
print(pdf_link)
# output:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
If you want to scrape more data from organic results, there's a dedicated Scrape Google Scholar with Python blog post of mine.
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing Classes when scraping Google Pages with Beautiful Soup - python

Related

An issue with scraping reviews from Amazon with BeautifulSoup in Python

Scraping HREF Links contained within a Table

Why does BeautifulSoup not get all html on Google?

Scraping: cannot access information from web

Why am I getting repetitive output while trying to scrape data from Google Scholar?

Categories

Resources