Can't identify a table when scraping

Can't identify a table when scraping - python

Beginner question.. I'm attempting to scrape data from a table but I can't seem to recognize it, I've tried using the class and the id to identify it but my result is 0. The code and output are below.
# Import necessary packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url="https://fbref.com/en/comps/9/stats/Premier-League-Stats"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
#print(soup.prettify()) # print the parsed data of html
gdp = soup.find_all("table", attrs={"id": "stats_standard"})
print("Number of tables on site: ",len(gdp))
Output - 'Number of tables on site: 0'

I suggest you to use selenium for such scraping, its performance is very reliable.
This code will work for you:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
gdp = bs.find_all('table', {'id': 'stats_standard'})
driver.quit()
print("Number of tables on site: ",len(gdp))
Output
Number of tables on site: 1

Can you find the table(s) without using attrs={"id": "stats_standard"}?
I have checked and indeed I cannot find any table whose ID is stats_standard (but there is one with ID stats_standard_sh, for example). So I guess you might be using the wrong ID.

Related

HTML parsing with BeautifulSoup in Python unknown error

I know that this code works for other websites that end in .com
However I noticed that the code doesn't work if I try to parse websites that end in .kr
Can somebody help to find why this is happening and an alternate solution to parse these types of websites?
Following is my code.
import requests
from bs4 import BeautifulSoup
URL = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='container')
print(results)
The URL here is a link to my timetable. I need to parse this website so that I can easily collect the information for the subjects and data relevant to the subject (duration, location, professor's name, etc.).
Thanks

Website is serving dynamic content and you get an empty response back - you may use selenium.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://everytime.kr/#nN4K1XC0weHnnM9VB5Qe'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='container')
print(results)
driver.close()

Indexing multiple tables in BeautfulSoup

This page I want to parse - https://fbref.com/en/comps/9/gca/Premier-League-Stats
It has 2 tables, I am trying to get information from the second table, but it keeps displaying the first table every time I run this code.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://fbref.com/en/comps/9/gca/Premier-League-Stats').text
soup = BeautifulSoup(source, 'lxml')
stattable = soup.find('table', class_= 'min_width sortable stats_table min_width shade_zero')[1]
print(stattable)
min_width sortable stats_table min_width shade_zero is the ID of the 'second' table.
It does not give me an error nor does it return anything. It's null.

Since the second table is dynamically generated, why not combine selenium, BeautifulSoup, and pandas to get what you want?
For example:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get("https://fbref.com/en/comps/9/gca/Premier-League-Stats")
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "html.parser").find("div", {"id": "div_stats_gca"})
driver.close()
df = pd.read_html(str(soup), skiprows=[0, 1])
df = pd.concat(df)
df.to_csv("data.csv", index=False)
This spits out a .csv file that, well, looks like that table you want. :)

The HTML you see when you do inspect element are generated using Javascript. However, the same classes are not available in the raw html that you get using the script.
I disabled Javascript for this site and I saw that the table is not visible.
You can try something like Selenium. There is good information in this question.

Beautifulsoup can not find table containing specific class

from bs4 import BeautifulSoup
import requests
url="https://www.calculator.net/currency-calculator.html"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content,'html5lib')
print(soup.prettify()) # print the parsed data of html
conv_table = soup.find("table", attrs={"class":"cinfoT "})
conv_data = gdp_table.tbody.find_all("tr")
I have written the above script to get the table listed on this particular website.
when i run the same conv_table comes as None type object.
If you visit the website, basically i want to extract the 2nd table bigger table and its class name contains "cinfoT ". Also i have checked that there are some blank spaces in the class name.
Please help me out.
Thanks in advance.

It is because this data is loaded by javascript. Try selenium. requests will give you plain html file
from selenium import webdriver
from bs4 import BeautifulSoup
DRIVER_PATH="Your selenium chrome driver path"
url = 'https://www.calculator.net/currency-calculator.html'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
table = html_soup.find_all('table', class_ = 'cinfoT ')
driver.quit()
print(table[0].tbody)

How to get some data in real time from a website using python?

I want to fetch som data from website
https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY
I am using beautifulsoup library to fetch this data, and have tried the following code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
b = soup.find("div", {"class": "style__AtmIVWrapper-idZNMX kUMMRI"})
print(b)
But it shows "None" as the output.
Although there is only one class of this name in the full HTML code, but I also tried this:
for b in soup.find_all('div', attrs={'class':'style__AtmIVWrapper-idZNMX kUMMRI'}):
print(b.get_text())
print(len(b))
But it doesn't work.
Also tried soup.find("div")
But it does not shows the required div tag in the output, maybe due to nested divs present.
Unable to fetch this data and proceed with my work. Please help.

If you are looking for code. This might help:-
from selenium import webdriver
import time
webpage = 'https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY'
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
driver.get(webpage)
time.sleep(10)
nifty_fut = driver.find_element_by_xpath('//*[#id="app"]/div/div[4]/div[2]/div[3]/div/div/div[2]/div[1]/div[1]/div/button/span[1]/div[1]')
print(nifty_fut.text)
atm_iv = driver.find_element_by_xpath('//*[#id="app"]/div/div[4]/div[2]/div[3]/div/div/div[2]/div[1]/div[2]')
print(atm_iv.text)
driver.quit()

Could be a syntax problem try with soup.find_all("div", class_="style__AtmIVWrapper-idZNMX kUMMRI") or just soup.find("div", class_="style__AtmIVWrapper-idZNMX kUMMRI")
If interested in webscraping and bs4 take a look at the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Python BeautifulSoup - trouble parsing table from webpage

I'd like to parse the table data from the following site:
Pricing data and create a dataframe with all of the table values (vCPU, Memory, Storage, Price). However, with the following code, I can't seem to find the table on the page. Can someone help me figure out how to parse out the values?
Using the pd.read_html, an error shows up that no tables are found.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
url = "https://aws.amazon.com/ec2/pricing/on-demand/"
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'html.parser')
data=[]
tables = soup.find_all('table')
df = pd.read_html(url)

If your having trouble because of dynamic content a good work around is selenium, it simulates browser experience so you dont have to worry about managing cookies and other problems that come with dynamic web content. I was able to scrape the page with the following:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://aws.amazon.com/ec2/pricing/on-demand/')
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
driver.close()
data=[]
tables = soup.find_all('table')
print(tables)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't identify a table when scraping - python

Can you find the table(s) without using attrs={"id": "stats_standard"}? I have checked and indeed I cannot find any table whose ID is stats_standard (but there is one with ID stats_standard_sh, for example). So I guess you might be using the wrong ID.

Related

HTML parsing with BeautifulSoup in Python unknown error

Indexing multiple tables in BeautfulSoup

Beautifulsoup can not find table containing specific class

How to get some data in real time from a website using python?

Python BeautifulSoup - trouble parsing table from webpage

Categories

Resources