Headless chrome and html parser string

Headless chrome and html parser string - python

I'm currently using selenium and BeautifulSoup to scrape a website but I'm running into two major issues, first of all, I can't get Chrome to launch in headless mode and it says there are multiple unexpected ends of inputs (photo of said errors). The other problem I have is that I keep getting an error on the line that contains "html.parser" saying that a 'str' is not a callable object. Any advice on these issues would be greatly appreciated thank you.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup
#config options
options = Options()
options.headless = True
# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'
# Connect to the URL
browser = webdriver.Chrome(options=options, executable_path='D:\chromedriver') #chrome_options=options
browser.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source(), "html.parser")
browser.quit()
# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value')
print(links)

As for the headless you need to call this way:
from selenium import webdriver
options = webdriver.ChromeOptions()
...
the page_source is not a method. So you need to remove the brackets:
browser.page_source

Related

Selenium browser is getting an enable cookies page, not the page I am sending it to

I am trying to scrape a js website with selenium. When beautiful soup reads what selenium retrieved I get an html page that says: "Cookies must be enabled in order to view this page."
If anyone could help me past this stumbling block I would appreciate it. Here is my code:
# import libraries and specify URL
import lxml as lxml
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
from selenium import webdriver
import urllib.request
import csv
url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/06/09"
#new chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path= '/Users/susanwhite/PycharmProjects/Horse
Racing/chromedriver', chrome_options=chrome_options)
# Wait for the page to fully load
driver.implicitly_wait(time_to_wait=10)
# Load the web page
driver.get(url)
cookies = driver.get_cookies()
# Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(soup)

Try removing this line: chrome_options.add_argument("--incognito"). There's no need for it, as Selenium naturally doesn't save cookies or any other information from websites.

Removing below code solved it for me, but headless mode will be disabled and the browser window will be visible.
chrome_options.add_argument("--headless")

Your issues might also be with the specific website you're accessing. I had the same issue, and after poking around with it, it looks like something in the way the HKJC website loads, selenium thinks the page is finished loading prematurely. I was able to get good page_source objects out of fetching the page by putting in a time.sleep(30) after the get statement, so my code looks like:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'C:\Python\webdrivers\geckodriver.exe')
driver.get("https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/01/01&RaceNo=1")
time.sleep(30)
html = driver.page_source
with open('Date_2023-01-01_Race1.html', 'wb') as f:
f.write(html.encode('utf-8'))
f.close()
You might not have to sleep that long. I found manually loading the pages takes 20+ seconds for me because I have slow internet over VPNs. It also works headless for me as above.
You do have to make sure the Firefox geckodriver is the latest (at least according to other posts, as I only tried this over ~2 days, so not long enough for my installed Firefox and geckodriver to get out of sync)

Missing Elements from HTML File Using BeautifulSoup

I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks

website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)

driver.page_source isn't taking entire html codes

I am trying to make 'Google Patent Crawler' by using python.
I used modules like Requests, BS4, and Selenium, But I am totally stuck on one thing.
The problem is my code is not parsing entire html sources. something is missing after parsing.
I found the error from 'driver.page_source'
It is not parsing all html.
So I wanna ask about another good way to solve it.
Thank you.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_driver_path = 'C:/Users/Kay/Documents/Python/Driver/chromedriver.exe'
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver_path)
URL = 'https://patents.google.com/?q=engine'
driver.get(URL)
html = driver.page_source
gp_soup = BeautifulSoup(html, 'html5lib')

web scraping w/age verification

Hello I want to web scrape data from a site with an age verification pop-up using python 3.x and beautifulsoup. I can't get to the underlying text and images without clicking "yes" for "are you over 21". Thanks for any support.
EDIT: Thanks, with some help from a comment I see that I can use the cookies but am not sure how to manage/store/call cookies with the requests package.
So with some help from another user I am using selenium package so that it will work also in case it's a graphical overlay (I think?). Having trouble getting it to work with the gecko driver but will keep trying! Thanks for all the advice again, everyone.
EDIT 3: OK I have made progress and I can get the browser window to open, using the gecko driver!~ Unfortunately it doesn't like that link specification so I'm posting again. The link to click "yes" on the age verification is buried on that page as something called mlink...
EDIT 4: Made some progress, updated code is below. I managed to find the element in the XML code, now I just need to manage to click the link.
#
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_class_name('hhc_modal-body').click(Yes)
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())
Edit new: Stuck again, here is the current code. I seem to have isolated the element "mBtnYes" but I get an error when running the code :
ElementClickInterceptedException: Message: Element is not clickable at point (625,278.5500030517578) because another element obscures it
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox(executable_path=r'/Users/jeff/Documents/geckodriver') # Optional argument, if not specified will search path.
driver.get('https://www.shopharborside.com/oakland/#/shop/412');
url = 'https://www.shopharborside.com/oakland/#/shop/412'
driver.get(url)
#
driver.find_element_by_id('myBtnYes').click()
#wait.1.second
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print(soup.prettify())

if your aim is to click the verification get to selenium:
ps install selenium && get geckodriver(firefox) or chromedriver(chrome)
#Mossein~King(hi i'm here to help)
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options
from BeautifulSoup import BeautifulSoup
#this.is.for.headless.This.will.save.you.a.bunch.of.research.time(Trust.me)
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
#for.graphical(you.need.gecko.driver.for.firefox)
# driver = webdriver.Firefox()
url = 'your-url'
driver.get(url)
#get.the.link.to.clicking
#exaple if<a class='MosseinKing'>
driver.find_element_by_xpath("//a[#class='MosseinKing']").click()
#wait.1.secong.in.case.of.transitions
time.sleep(1)
pagesource = driver.page_source
soup = BeautifulSoup(pagesource)
#you.can.now.enjoy.soup
print soup.prettify()

Python Scraping JavaScript using Selenium and Beautiful Soup

I'm trying to scrape a JavaScript enables page using BS and Selenium.
I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
comments = soup("div", {"class":"postText"})
print comments

There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code.
My revised version of your code was tested and is working on multiple websites.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})
print comments

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Headless chrome and html parser string - python

As for the headless you need to call this way: from selenium import webdriver options = webdriver.ChromeOptions() ... the page_source is not a method. So you need to remove the brackets: browser.page_source

Related

Selenium browser is getting an enable cookies page, not the page I am sending it to

Missing Elements from HTML File Using BeautifulSoup

driver.page_source isn't taking entire html codes

web scraping w/age verification

Python Scraping JavaScript using Selenium and Beautiful Soup

Categories

Resources