BeautifulSoup sports scraper gives back empty list - python

I am trying to scrape the results of tennis matches from this website using Python's BeautifulSoup. I have tried a lot of things but I always get back an empty list. Is there an obvious mistake I am making? There are multiple instances of this class on the website when I inspect it, but it does not seem to find it.
import requests
from bs4 import BeautifulSoup
url = 'https://www.flashscore.com/tennis/atp-singles/french-open/results/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
match_container = soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine')
print(match_container)

Results table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium. Here is link for chromedriver.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<PATH_TO_CHROMEDRIVER>',chrome_options=chrome_options)
# load page via selenium
wd.get("https://www.flashscore.com/tennis/atp-singles/french-open/results/")
# wait 5 seconds until results table will be loaded
table = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.ID, 'live-table')))
# parse content of the grid
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', class_='event__match event__match--static event__match--last event__match--twoLine'):
print(tag)

The score data is pulled into the page dynamically, and you're only getting the initial HTML with requests.
As user70 suggested in the comments, the way to do this is to use a tool like Selenium first so you get all the dynamic content you see in your web browser's inspection tool.
There are few guides online showing how this works - you could start with this one maybe:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25

Related

Why is HTML returned by requests different from the real page HTML?

I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrape the web page using:
import requests
h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content
And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.
For example imagine I want to scrape:
<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>
I use a command like this:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
print(html_text)
soup = BeautifulSoup(html_text,'lxml')
job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text
This will return me Shopify Inc.
But it doesn't because the html code y load or get from the web page with the requests' library, gets me another complete different html.
I want to know how to get the original html code from the web page.
If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python library
It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.
You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source
jobs = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
print(job.text)
Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.
Outputs
Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

driver.page_source isn't taking entire html codes

I am trying to make 'Google Patent Crawler' by using python.
I used modules like Requests, BS4, and Selenium, But I am totally stuck on one thing.
The problem is my code is not parsing entire html sources. something is missing after parsing.
I found the error from 'driver.page_source'
It is not parsing all html.
So I wanna ask about another good way to solve it.
Thank you.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_driver_path = 'C:/Users/Kay/Documents/Python/Driver/chromedriver.exe'
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver_path)
URL = 'https://patents.google.com/?q=engine'
driver.get(URL)
html = driver.page_source
gp_soup = BeautifulSoup(html, 'html5lib')

BeautifulSoup4 doesn't find elements properly

I am using requests and bs4 to extract the first preview from the link http://duckduckgo.com/?q=who+is+harry+potter
However, when i try to use bs4's find method to find he div with the class 'result__snippet', it returns None. But when I saved the whole webpage to my hard disk and opened it directly and parsed it with bs4, soup.find('div', class_='result__snippet').get_text() returns the perfect output.
Any help?
The website you link to appears to use JavaScript to build the search results, so the page you retrieve using BeautifulSoup doesn't actually contain the search results yet.
If you look at the content of the page that you've retrieved (print(soup.text)) you can see that they suggest that if you don't have JavaScript enabled to use http://duckduckgo.com/html/?q=who+is+harry+potter.
Scraping this URL should provide you with the content that you are looking for.
One way to do this is to use Selenium in combination with BeautifulSoup. Try this, it works.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent
url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'
profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
print('Page is ready!')
break
except TimeoutException:
print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()
b_html = bs(html,'html.parser')
x = b_html.find_all('div', class_='result__snippet')[0].get_text()
Output:
Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...

Source data doesn't match actual content when scraping dynamic content with Beautiful Soup + Selenium

I'm trying to teach myself how to scrape data and found a nice dynamic website to test this on (releases.com in this case).
Since it's dynamic, I figured I'd have to use selenium to fetch its data.
However, the retrieved page source still only contains the initial html and its js: not the actual elements shown in the browser.
Why does that happen?
I'm assuming it's because I'm fetching the page source, but what other option is there?
My code looks like this:
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
import chromedriver_binary
import time
#constants
my_url = "https://www.releases.com/l/Games/2018/1/"
# Start the WebDriver and load the page
wd = webdriver.Chrome()
wd.get(my_url)
# Wait for the elements to appear
time.sleep(10)
# And grab the page HTML source
html_page = wd.page_source
wd.quit()
#Make soup
pageSoup = soup(html_page, "html.parser")
#Get data
print(pageSoup.text)

Parsing a website with BeautifulSoup and Selenium

Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".
This is what I have so far, with the last line being an attempt to get the high temps only:
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})
First of all, these are two different classes - align_right and temperature_red - you've joined them and added that table_data_td for some reason. And, the elements having these two classes are td elements, not table.
In any case, to get the climate table, it looks like you should be looking for the div element having id="climate_table":
climate_table = soup.find(id="climate_table")
Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
try:
browser.get(url)
# wait for the climate data to be loaded
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))
soup = BeautifulSoup(browser.page_source, "lxml")
climate_table = soup.find(id="climate_table")
print(climate_table.prettify())
finally:
browser.quit()
Note the addition of the try/finally that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.
And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.

Categories