scraping rotten tomatoes 12 years a slave movie - python

I'm trying to scrape the number of audience ratings from this page
https://www.rottentomatoes.com/m/12_years_a_slave (which is 100,000+) using python selenium.
I tried all kinds of selenium locators but every time i get NoSuchElementException: error.
Here is my code :
import selenium
from selenium import webdriver
driver = webdriver.Chrome('path.exe')
url = 'https://www.rottentomatoes.com/m/12_years_a_slave'
driver.get(url)
def scrape_dom(element):
shadow_root = driver.execute_script('return
arguments[0].shadowRoot', element)
retuen shadow_root
host = driver.find_element_by_tag_name('score-board')
root_1 = scrape_dom(host)
views = root_1.find_element_by_link_text(
'/m/12_years_a_slave/reviews?type=user&intcmp=rt-' + \
'scorecard_audience-score-reviews')
I also tried xpath , css_selector but always error.may you tell me what's wrong with my code?

A simple CSS selector work here.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.rottentomatoes.com/m/12_years_a_slave'
driver.get(url)
print(driver.find_element_by_css_selector('a[slot=audience-count]').text)
I get 100,000+ Ratings printed out to my console.

See if this xpath works:-
driver.find_element_by_xpath(".//a[#data-qa='audience-rating-count']").text

You don't need selenium. You can use requests and bs4. Also, you can use a faster css class selector, rather than slower attribute selector given in other answers so far.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.rottentomatoes.com/m/12_years_a_slave')
soup = bs(r.content, 'lxml')
soup.select_one('.scoreboard__link--audience').text

Related

How to Get the Webpage Title of Chrome?

Using python, is there a way to obtain the title of the current active tab in Chrome?
If it is impossible, getting the list of titles of all tabs also works for my purpose.
Thanks.
There's multiple way of getting the title of a the current tab using Python,
You could use BeatifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
Or if your using Selenium
from selenium import webdriver
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
driver.maximize_window()
driver.get(YOUR_URL)
print(driver.title)
print(driver.current_url)
driver.refresh()
driver.close()

Missing Elements from HTML File Using BeautifulSoup

I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks
website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)
Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)
The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

HTML Scraping when there are no html tags

I'm trying to get the elevation data, and start and end pass times from this website.
So far I have looked at the source code and been unable to use Beautiful Soup to get what I want as the source code doesn't have any tags around the information I am interested in. That information is contained in functions by the name of spStart, and it's corresponding arguments. I had a go at using selenium to obtain the Javascript processed code, but I ended up getting the same as the source code on the page and now I'm stuck.
Here is my attempt at using selenium:
import datetime
import time
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import selenium.webdriver.chrome.service as service
from lxml import html
try:
#Launching chrome in headless mode to access inspect element code''
service = service.Service('/correct_path/chromedriver.exe')
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'/correct_path/chromedriver.exe')
driver.get("https://www.n2yo.com/passes/?s=39090&a=1")
print("Chrome Browser Initialized in Headless Mode")
soup = BeautifulSoup(driver.execute_script("return document.documentElement.innerHTML;"), "lxml")
print(soup)
except KeyboardInterrupt:
driver.quit()
print("Driver Exited")
When I run this code it gives me the html that I see when using the "view source" option in chrome. I was under the impression that by using selenium to get the source this way, I would be seeing what is available when using the "inspect element" option on the same page in chrome.
Would someone mind explaining where I'm going wrong and suggesting a feasible approach to get the data I want, possibly with an explained example? I'd really appreciate it.
Thanks for your time.
No is not the same, Inspect Element inspects the DOM, the source page although is practically the original seed page for the DOM, the DOM can dynamically change and usually changes by JS code,
sometimes quite dramatically. Also you will notice that Inspect Element shows the shadow elements which the source show not.
To see how dramatic is the difference visit chrome://settings/ and click Inspect element and then look at the View page source and compare.
You should target the element after has loaded and take arguments[0] and not the entire page via document
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'lxml')
This has 2 practical cases:
1
the element is not yet loaded in the DOM and you need to wait for the element:
browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time
try:
element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
print "element is ready do the thing!"
html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
print "Somethings wrong!"
2
the element is in a shadow root and you need to expand first the shadow root, probably not your situation but I will mention it here since it is relevant for future reference. ex:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')
html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande
shadow_root1 = expand_shadow_element(root1)
html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup
I don't know what data from that page you are interested in. However, If the tabular data you are after then the below script is worth trying for:
from selenium.webdriver import Chrome
from contextlib import closing
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
URL = "https://www.n2yo.com/passes/?s=39090&a=1"
chrome_options = Options()
chrome_options.add_argument("--headless")
with closing(Chrome(chrome_options=chrome_options)) as driver:
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
for items in soup.select("#passestable tr"):
data = [item.text for item in items.select("th,td")]
print(data)
Partial output:
['Start ', 'Max altitude', 'End ', 'All passes']
['Date, Local time', 'Az', 'Local time', 'Az', 'El', 'Local time', 'Mag ', 'Info']
['20-Feb 19:17', 'N13°', '19:25', 'E76°', '81°', '19:32', 'S191°', '-', 'Map and details']
['21-Feb 06:24', 'SSE151°', '06:31', 'E79°', '43°', '06:38', 'N358°', '-', 'Map and details']

Python Scraping JavaScript using Selenium and Beautiful Soup

I'm trying to scrape a JavaScript enables page using BS and Selenium.
I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
comments = soup("div", {"class":"postText"})
print comments
There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code.
My revised version of your code was tested and is working on multiple websites.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})
print comments

Categories