How to Get the Webpage Title of Chrome? - python

Using python, is there a way to obtain the title of the current active tab in Chrome?
If it is impossible, getting the list of titles of all tabs also works for my purpose.
Thanks.

There's multiple way of getting the title of a the current tab using Python,
You could use BeatifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
Or if your using Selenium
from selenium import webdriver
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
driver.maximize_window()
driver.get(YOUR_URL)
print(driver.title)
print(driver.current_url)
driver.refresh()
driver.close()

Related

Selenium browser is getting an enable cookies page, not the page I am sending it to

I am trying to scrape a js website with selenium. When beautiful soup reads what selenium retrieved I get an html page that says: "Cookies must be enabled in order to view this page."
If anyone could help me past this stumbling block I would appreciate it. Here is my code:
# import libraries and specify URL
import lxml as lxml
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
from selenium import webdriver
import urllib.request
import csv
url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/06/09"
#new chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path= '/Users/susanwhite/PycharmProjects/Horse
Racing/chromedriver', chrome_options=chrome_options)
# Wait for the page to fully load
driver.implicitly_wait(time_to_wait=10)
# Load the web page
driver.get(url)
cookies = driver.get_cookies()
# Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(soup)
Try removing this line: chrome_options.add_argument("--incognito"). There's no need for it, as Selenium naturally doesn't save cookies or any other information from websites.
Removing below code solved it for me, but headless mode will be disabled and the browser window will be visible.
chrome_options.add_argument("--headless")
Your issues might also be with the specific website you're accessing. I had the same issue, and after poking around with it, it looks like something in the way the HKJC website loads, selenium thinks the page is finished loading prematurely. I was able to get good page_source objects out of fetching the page by putting in a time.sleep(30) after the get statement, so my code looks like:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'C:\Python\webdrivers\geckodriver.exe')
driver.get("https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/01/01&RaceNo=1")
time.sleep(30)
html = driver.page_source
with open('Date_2023-01-01_Race1.html', 'wb') as f:
f.write(html.encode('utf-8'))
f.close()
You might not have to sleep that long. I found manually loading the pages takes 20+ seconds for me because I have slow internet over VPNs. It also works headless for me as above.
You do have to make sure the Firefox geckodriver is the latest (at least according to other posts, as I only tried this over ~2 days, so not long enough for my installed Firefox and geckodriver to get out of sync)

Headless chrome and html parser string

I'm currently using selenium and BeautifulSoup to scrape a website but I'm running into two major issues, first of all, I can't get Chrome to launch in headless mode and it says there are multiple unexpected ends of inputs (photo of said errors). The other problem I have is that I keep getting an error on the line that contains "html.parser" saying that a 'str' is not a callable object. Any advice on these issues would be greatly appreciated thank you.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup
#config options
options = Options()
options.headless = True
# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'
# Connect to the URL
browser = webdriver.Chrome(options=options, executable_path='D:\chromedriver') #chrome_options=options
browser.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source(), "html.parser")
browser.quit()
# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value')
print(links)
As for the headless you need to call this way:
from selenium import webdriver
options = webdriver.ChromeOptions()
...
the page_source is not a method. So you need to remove the brackets:
browser.page_source

scraping rotten tomatoes 12 years a slave movie

I'm trying to scrape the number of audience ratings from this page
https://www.rottentomatoes.com/m/12_years_a_slave (which is 100,000+) using python selenium.
I tried all kinds of selenium locators but every time i get NoSuchElementException: error.
Here is my code :
import selenium
from selenium import webdriver
driver = webdriver.Chrome('path.exe')
url = 'https://www.rottentomatoes.com/m/12_years_a_slave'
driver.get(url)
def scrape_dom(element):
shadow_root = driver.execute_script('return
arguments[0].shadowRoot', element)
retuen shadow_root
host = driver.find_element_by_tag_name('score-board')
root_1 = scrape_dom(host)
views = root_1.find_element_by_link_text(
'/m/12_years_a_slave/reviews?type=user&intcmp=rt-' + \
'scorecard_audience-score-reviews')
I also tried xpath , css_selector but always error.may you tell me what's wrong with my code?
A simple CSS selector work here.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.rottentomatoes.com/m/12_years_a_slave'
driver.get(url)
print(driver.find_element_by_css_selector('a[slot=audience-count]').text)
I get 100,000+ Ratings printed out to my console.
See if this xpath works:-
driver.find_element_by_xpath(".//a[#data-qa='audience-rating-count']").text
You don't need selenium. You can use requests and bs4. Also, you can use a faster css class selector, rather than slower attribute selector given in other answers so far.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.rottentomatoes.com/m/12_years_a_slave')
soup = bs(r.content, 'lxml')
soup.select_one('.scoreboard__link--audience').text

Selenium not getting HTML of PDF links

I'm trying to download the PDF slides off this website using Python and selenium but I think the the links to the slides only appear after loading a script. I tried waiting for the javascript to load but it's still not finding anything. Any ideas?
import os, sys, time, random
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.implicitly_wait(3)
html = browser.page_source
links = browser.find_elements_by_class_name('flip-entry')
print(links)
browser.quit()
The reason is that there are no links on the main page. You are getting links inside an IFrame. This IFrame points to https://drive.google.com/embeddedfolderview?hl=fr&id=0ByUKRdiCDK7-c0k1TWlLM1U1RXc#list
You can either directly browse that URL in your code instead of main page. Or you can switch to the frame
browser.switch_to_frame(browser.find_element_by_class_name("iframe-class"))
links = browser.find_elements_by_css_selector('.flip-entry a')
for link in links:
print(link.get_attribute("href"))
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to_frame(browser.find_element_by_class_name('iframe-class'))
links = browser.find_elements_by_class_name('.flip-entry a')
for link in links:
print(link.get_attribute("href"))
browser.quit()

Splinter or Selenium: Can we get current html page after clicking a button?

I'm trying to crawl the website "http://everydayhealth.com". However, I found that the page will dynamically rendered. So, when I click the button "More", some new news will be shown. However, using splinter to click the button doesn't let "browser.html" automatically changes to the current html content. Is there a way to let it get newest html source, using either splinter or selenium? My code in splinter is as follows:
import requests
from bs4 import BeautifulSoup
from splinter import Browser
browser = Browser()
browser.visit('http://everydayhealth.com')
browser.click_link_by_text("More")
print(browser.html)
Based on #Louis's answer, I rewrote the program as follows:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
driver.get("http://www.everydayhealth.com")
more_xpath = '//a[#class="btn-more"]'
more_btn = WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath(more_xpath))
more_btn.click()
more_news_xpath = '(//a[#href="http://www.everydayhealth.com/recipe-rehab/5-herbs-and-spices-to-intensify-flavor.aspx"])[2]'
WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath(more_news_xpath))
print(driver.execute_script("return document.documentElement.outerHTML;"))
driver.quit()
However, in the output text, I still couldn't find the text in the updated page. For example, when I search "Is Milk Your Friend or Foe?", it still returns nothing. What's the problem?
With Selenium, assuming that driver is your initialized WebDriver object, this will give you the HTML that corresponds to the state of the DOM at the time you make the call:
driver.execute_script("return document.documentElement.outerHTML;")
The return value is a string so you could do:
print(driver.execute_script("return document.documentElement.outerHTML;"))
When I use Selenium for tasks like this, I know browser.page_source does get updated.

Categories