I'm trying to scrape the "team per game stats" table from this website using this code:
from urllib.request import urlopen as uo
from bs4 import BeautifulSoup as BS
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = uo(url)
soup = BS(html, 'html.parser')
soup.findAll('tr')
headers = [th.getText() for th in soup.findAll('tr')]
headers = headers[1:]
print(headers)
rows = soup.findAll('tr')[1:]
team_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(team_stats, columns=headers)
But it returns this error:
AssertionError: 71 columns passed, passed data had 212 columns
The problem is that the data is hidden in a commented section of the HTML. The table you want to extract is rendered with Javascript in your browser. Requesting the page with requests or urllib just yields the raw HTML.
So be aware that you have to examine the source code of the page with "View page source" rather than the rendered page with "Inspect Element" if you search for the proper tags to find with BeautifulSoup.
Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = requests.get(url)
section_start = '<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats">'
block_start = html.text.split(section_start)[1].split("<!--")[1]
block = block_start.split("-->")[0]
soup = BeautifulSoup(block)
data = [th.get_text(",") for th in soup.findAll('tr')]
header = data[0]
header = [x.strip() for x in header.split(",") if x.strip() !=""]
data = [x.split(",") for x in data[1:]]
pd.DataFrame(data, columns=header)
Explanation: You first need to find the commented section by simply splitting the raw HTML just before the section. You extract the section as text, convert to soup and then parse.
Related
I'm learning web scraping and am trying to web crawl data from the below link. Is there a way for me to crawl the link from each of the td as well?
The website link: http://eecs.qmul.ac.uk/postgraduate/programmes/
Here's what I did so far.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_list = []
rows = soup.find_all('tr')
# For every row in the table, find each cell element and add it to the list
for row in rows:
row_td = row.find_all('td')
row_cells = str(row_td)
row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()
table_list.append((row_cleantext))
print(table_list)
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
main_data=soup.find_all("td")
You can find main_data and iterate over that so you will get specific td tag and now find a and use .get for href extraction and if any Attribute is not present so you can use try-except to handle exceptions
for data in main_data:
try:
link=data.find("a").get("href")
print(link)
except AttributeError:
pass
For Understing only:
main_data=soup.find_all("td")
for data in main_data:
try:
link=data.find("a")
print(link.text)
print(link.get("href"))
except AttributeError:
pass
Output:
H60C
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
H60A
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
..
For creating table you can use pandas module
main_data=soup.find_all("td")
dict1={}
for data in main_data:
try:
link=data.find("a")
dict1[link.text]=link.get("href")
except AttributeError:
pass
import pandas as pd
df=pd.DataFrame(dict1.items(),columns=["Text","Link"])
Output:
Text Link
0 H60C https://www.qmul.ac.uk/postgraduate/taught/cou...
1 H60A https://www.qmul.ac.uk/postgraduate/taught/cou...
2 I4U2 https://www.qmul.ac.uk/postgraduate/taught/cou...
..
Getting table from website
import pandas as pd
data=pd.read_html("http://eecs.qmul.ac.uk/postgraduate/programmes/")
df=data[0]
df
Output
Postgraduate degree programmes Part-time(2 year) Full-time(1 year)
0 Advanced Electronic and Electrical Engineering H60C H60A
1 Artificial Intelligence I4U2 I4U1
.....
https://www.worldometers.info/coronavirus/#countries is the website that I'm using and I'm trying to get the table with All tab selected to pull from html into my jupyter notebook. The problem I seem to be having is if I use class = 'table' it pulls all continent tabs first then the all table and it messes up how my data gets pulled in when I try looking at rows.
import requests
import lxml.html as lh
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/#countries'
page = requests.get(url)
print(page.status_code) #Checking the http response status code. Should be 200
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
all_tables=soup.find_all("table")
right_table = soup.find('table',{'class':'table'})
col_headers = [th.getText() for th in right_table.findAll('th')]
data = [[td.getText() for td in right_table.findAll('td')] for tr in right_table()]
When I try to combine the col_headers and data it says I have13 columns passed, data had 2990 columns. Any guidance would be appreciated.
You have "flattened" the table - created a list of all <td>s. What you need to do is to create a nested list:
data = [ [ td.text for td in tr.find_all("td") ] for tr in right_table.find_all("tr")]
df = pd.DataFrame(data, columns=col_header)
print(df.shape) # (231, 13)
I've been trying to get only one value from a table on a website. I've been following a tutorial but I am currently stuck. My goal is to extract the name of the country from the table and the number of total cases of that specific country and print it on the screen. For example:
China: 80,761 Total cases
I'm using Python 3.7.
This is my code so far:
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.findAll('table',{'id':'main_table_countries'})
If you have <table> tags, just go with pandas' .read_html(). It uses beautifulsoup under the hood, then you can just slice and dice the dataframe as you please:
import pandas as pd
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df.iloc[:,:2])
To do it with beautifulsoup straight forward. First you want to grab the <table> tag. Within the <table> tag get all the <tr> tages (rows). Then iterate through each row to get all the <td> tags (the data). The data you want are in index positions 0 and 1, so just print those out.
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if data != []:
print (data[0].text, data[1].text)
ADDITIONAL:
import pandas as pd
country = 'China'
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df[df['Country,Other'] == country].iloc[:,:2])
OR
import requests
from bs4 import BeautifulSoup
import re
country = 'China'
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find('a', text=re.compile(country))
for row in rows:
data = row.parent.parent.parent.find_all('td')[1].text
print (row, data)
You can get the target info this way:
for t in table[0].find_all('tr'):
target = t.find_all('td')
if len(target)>0:
print(target[0].text, target[1].text)
Output:
China 80,761
Italy 9,172
Iran 8,042
etc.
How can I scrape the yahoo earnings calendar to pull out the dates?
This is for python 3.
from bs4 import BeautifulSoup as soup
import urllib
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find('p')
print(table)
the output is "None"
Beautiful Soup has some find functions that you can use to inspect the DOM , please refer to the documentation
from bs4 import BeautifulSoup as soup
import urllib.request
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find_all('td')
Dates = []
for something in table:
try:
if something['aria-label'] == "Earnings Date":
Dates.append(something.text)
except:
print('')
print(Dates)
Might be off-topic but since you want to get a table from a webpage, you might consider using pandas which works with two lines:
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm')[0]
Here are two succinct ways
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm&guccounter=1')
soup = bs(r.content, 'lxml')
# using attribute = value selector
dates = [td.text for td in soup.select('[aria-label="Earnings Date"]')]
#using nth-of-type to get column
dates = [td.text for td in soup.select('#cal-res-table td:nth-of-type(3)')]
I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you
Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.