driver.page_source is not defined - python

I'm trying to scrape a client-side-rendered web page using Selenium.
I started by creating a virtual environment and installing the required dependencies. Then I downloaded the Chrome Driver for my Chrome version and pasted it in the project's folder.
import os
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver_path = os.path.abspath('') + '/chromedriver'
driver = webdriver.Chrome(executable_path = driver_path)
print(' > Getting web page...')
url = 'https://www.someurl.com'
driver.get(url)
print(' > Sleeping...')
time.sleep(10)
print(' > Done. Html below:')
page_html = driver.page_source
print(page_source)
The browser open and the page loads. But after the program wakes up I get NameError: name 'page_source' is not defined. Any clues about what I might be doing wrong?
One thing that got me concerned is that I'm using 64-bit Windows, but the only driver available on Chrome's webpage was 32-bit. Anyways, it seems that this isn't a problem since the browser and the page are rendered correctly by the script.

Typo from print.
print(page_html)
Instead of
print(page_source)
page_source is never initialized in your code.

Related

Selenium browser is getting an enable cookies page, not the page I am sending it to

I am trying to scrape a js website with selenium. When beautiful soup reads what selenium retrieved I get an html page that says: "Cookies must be enabled in order to view this page."
If anyone could help me past this stumbling block I would appreciate it. Here is my code:
# import libraries and specify URL
import lxml as lxml
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
from selenium import webdriver
import urllib.request
import csv
url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/06/09"
#new chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path= '/Users/susanwhite/PycharmProjects/Horse
Racing/chromedriver', chrome_options=chrome_options)
# Wait for the page to fully load
driver.implicitly_wait(time_to_wait=10)
# Load the web page
driver.get(url)
cookies = driver.get_cookies()
# Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(soup)
Try removing this line: chrome_options.add_argument("--incognito"). There's no need for it, as Selenium naturally doesn't save cookies or any other information from websites.
Removing below code solved it for me, but headless mode will be disabled and the browser window will be visible.
chrome_options.add_argument("--headless")
Your issues might also be with the specific website you're accessing. I had the same issue, and after poking around with it, it looks like something in the way the HKJC website loads, selenium thinks the page is finished loading prematurely. I was able to get good page_source objects out of fetching the page by putting in a time.sleep(30) after the get statement, so my code looks like:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'C:\Python\webdrivers\geckodriver.exe')
driver.get("https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/01/01&RaceNo=1")
time.sleep(30)
html = driver.page_source
with open('Date_2023-01-01_Race1.html', 'wb') as f:
f.write(html.encode('utf-8'))
f.close()
You might not have to sleep that long. I found manually loading the pages takes 20+ seconds for me because I have slow internet over VPNs. It also works headless for me as above.
You do have to make sure the Firefox geckodriver is the latest (at least according to other posts, as I only tried this over ~2 days, so not long enough for my installed Firefox and geckodriver to get out of sync)

How do I wait until a webpage is loaded before opening another tab

I've made this little python script to automate opening the websites I need in the morning, take a look `
Required Modules
import webbrowser
Open the websites
`webbrowser.get('firefox').open_new_tab('https://www.netflix.com')
webbrowser.get('firefox').open_new_tab('https://www.facebook.com')
webbrowser.get('firefox').open_new_tab('https://www.udemy.com') `
And I don't know how to wait until the webpage is loaded before opening the next one (in another tab), any help?
You could take the approach as mentioned at How to wait for the page to fully load using webbrowser method? and check for a certain element in the page manually.
Another options would be to import time and call it after opening each tab time.sleep(5) which waits for 5 seconds before running the next line of code.
import webbrowser
from time import sleep
links = ['https://www.netflix.com', 'https://www.facebook.com', 'https://www.udemy.com']
for link in links:
webbrowser.get('firefox').open_new_tab(link)
sleep(5)
Selenium Implementation:
Note: This implemenetation opens your URL's in multiple windows rather than a single window and multiple tabs.
I will be using the chrome driver which you can install at https://chromedriver.chromium.org/downloads
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_experimental_option("detach", True) #this is just to keep the windows open even after the script is done running.
urls = ['https://www.netflix.com', 'https://www.facebook.com', 'https://www.udemy.com']
def open_url(url):
driver = webdriver.Chrome(executable_path=os.path.abspath('chromedriver'), chrome_options=chrome_options)
# I've assumed the chromedriver is installed in the same directory as the script. If not, mention the path to the chromedriver executable here.
driver.get(url)
for url in urls:
open_url(url)

How to reload a html page using python

I wan to reload a html page which I created locally in my computer using python. I have tried using this:
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get('C://User/Desktop/total.html')
while True:
time.sleep(20)
driver.refresh()
driver.quit()
but it was throwing FileNotFounError. Any idea on how to do it? Thanks.

Python selenium webdriver gets nothing but the browser normally shows the webpage

Source Code:
from selenium import webdriver
browser = webdriver.Safari()
html_doc = browser.get("http://www.google.com")
#html_doc is empty but the Safari window shows the page normally
#Allow Remote Automation is enabled
This is the first time I use Selenium, at first it worked normally, with html_doc normally gets the content, however, the problem occurred several hours later, and neither restart Python nor restart the computer worked. Thanks for any suggestions!
browser.get doesn't return anything, that's why html_doc is empty. If you want the page source you need to use page_source
browser.get("http://www.google.com")
html_doc = browser.page_source

Starting a browser in selenium

I want to run tests with selenium. IE gives me a modal error after bringing up IE 8 with this text "This is the initial start page for the WebDriver server" :
from selenium import webdriver
import time
browser = webdriver.Ie() # Get local session of IE
browser.get("http://www.google.com") # Load page
time.sleep(5)
browser.close()
So I tried Chrome.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("http://www.google.com")
time.sleep(5)
browser.close()
and Selenium errors for not having the right path to the chrome.exe application. Chrome is installed as expected... C:\Users\%USERNAME%\AppData\Local\Google\Chrome\Application\chrome.exe
A little help here would be greatly appreciated.
Have u downloaded the Chrome Driver?
To get set up, first download the appropriate prebuilt server. Make sure the server can be located on your PATH or specify its location via the webdriver.chrome.driver system property.
Then when u run
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("http://www.google.com")
time.sleep(5)
browser.close()
It should work.

Categories