driver.page_source isn't taking entire html codes - python

I am trying to make 'Google Patent Crawler' by using python.
I used modules like Requests, BS4, and Selenium, But I am totally stuck on one thing.
The problem is my code is not parsing entire html sources. something is missing after parsing.
I found the error from 'driver.page_source'
It is not parsing all html.
So I wanna ask about another good way to solve it.
Thank you.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_driver_path = 'C:/Users/Kay/Documents/Python/Driver/chromedriver.exe'
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver_path)
URL = 'https://patents.google.com/?q=engine'
driver.get(URL)
html = driver.page_source
gp_soup = BeautifulSoup(html, 'html5lib')

Related

Selenium browser is getting an enable cookies page, not the page I am sending it to

I am trying to scrape a js website with selenium. When beautiful soup reads what selenium retrieved I get an html page that says: "Cookies must be enabled in order to view this page."
If anyone could help me past this stumbling block I would appreciate it. Here is my code:
# import libraries and specify URL
import lxml as lxml
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
from selenium import webdriver
import urllib.request
import csv
url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/06/09"
#new chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path= '/Users/susanwhite/PycharmProjects/Horse
Racing/chromedriver', chrome_options=chrome_options)
# Wait for the page to fully load
driver.implicitly_wait(time_to_wait=10)
# Load the web page
driver.get(url)
cookies = driver.get_cookies()
# Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(soup)
Try removing this line: chrome_options.add_argument("--incognito"). There's no need for it, as Selenium naturally doesn't save cookies or any other information from websites.
Removing below code solved it for me, but headless mode will be disabled and the browser window will be visible.
chrome_options.add_argument("--headless")
Your issues might also be with the specific website you're accessing. I had the same issue, and after poking around with it, it looks like something in the way the HKJC website loads, selenium thinks the page is finished loading prematurely. I was able to get good page_source objects out of fetching the page by putting in a time.sleep(30) after the get statement, so my code looks like:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'C:\Python\webdrivers\geckodriver.exe')
driver.get("https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/01/01&RaceNo=1")
time.sleep(30)
html = driver.page_source
with open('Date_2023-01-01_Race1.html', 'wb') as f:
f.write(html.encode('utf-8'))
f.close()
You might not have to sleep that long. I found manually loading the pages takes 20+ seconds for me because I have slow internet over VPNs. It also works headless for me as above.
You do have to make sure the Firefox geckodriver is the latest (at least according to other posts, as I only tried this over ~2 days, so not long enough for my installed Firefox and geckodriver to get out of sync)

Missing Elements from HTML File Using BeautifulSoup

I'm very new to the web-scraping world, and I'm trying to scrape the names of shoes from a website. When I use inspect on the website, there's a div tag that has basically the entire webpage inside it, but when I print out the html code, the div tag is completely empty! Here's my current code:
from bs4 import BeautifulSoup
import requests
import time
def findShoeNames():
html_file = requests.get('https://www.goat.com/sneakers/brand/air-jordan').text
soup = BeautifulSoup(html_file, 'lxml')
print(soup)
if __name__ == "__main__":
findShoeNames()
When I call my function and print(soup), the div tag looks like this:
<div id="root"></div>
But as previously mentioned, when I hit inspect on the website, this div tag has basically the entire webpage inside it. So I'm unable to scrape any data from the website.
Please help! Thanks
website use js to load. so you should use selenium and chromedriver.
install selenium
install chromedriver from here (unzip and copy your python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.goat.com/sneakers/brand/air-jordan"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(1)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify)

Headless chrome and html parser string

I'm currently using selenium and BeautifulSoup to scrape a website but I'm running into two major issues, first of all, I can't get Chrome to launch in headless mode and it says there are multiple unexpected ends of inputs (photo of said errors). The other problem I have is that I keep getting an error on the line that contains "html.parser" saying that a 'str' is not a callable object. Any advice on these issues would be greatly appreciated thank you.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import urllib.request
import lxml
import html5lib
import time
from bs4 import BeautifulSoup
#config options
options = Options()
options.headless = True
# Set the URL you want to webscrape from
url = 'https://tokcount.com/?user=mrsam993'
# Connect to the URL
browser = webdriver.Chrome(options=options, executable_path='D:\chromedriver') #chrome_options=options
browser.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(browser.page_source(), "html.parser")
browser.quit()
# for i in range(10):
links = soup.findAll('span', class_= 'odometer-value')
print(links)
As for the headless you need to call this way:
from selenium import webdriver
options = webdriver.ChromeOptions()
...
the page_source is not a method. So you need to remove the brackets:
browser.page_source

lxml ignore part of HTML tag

Okay, so I've got HTML code like this one:
<span class="lista_td_calendar" rel="1617096300">finished</span>
And I would like to fetch it using lxml, though there are many spans of this class, and each of them has different rel attribute, and I've written something like this:
from lxml import html
import requests
page = requests.get(link)
tree = html.fromstring(page.content)
series = tree.xpath('//span[#class="lista_td_calendar"]/text()')
print(series)
Though it doesn't fetch anything, is there ayway to make it undependant from rel argument?
Problem is that the value I was trying to reach was generated by javascript so it's unreachable through request module, using selenium solved the problem
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://blackclover.wbijam.pl/pierwsza_seria-170.html')
elements = driver.find_elements_by_class_name('lista_td_calendar')

BeautifulSoup sports scraper gives back empty list

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)
Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)
The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

Categories