I've been trying to get only one value from a table on a website. I've been following a tutorial but I am currently stuck. My goal is to extract the name of the country from the table and the number of total cases of that specific country and print it on the screen. For example:
China: 80,761 Total cases
I'm using Python 3.7.
This is my code so far:
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.findAll('table',{'id':'main_table_countries'})
If you have <table> tags, just go with pandas' .read_html(). It uses beautifulsoup under the hood, then you can just slice and dice the dataframe as you please:
import pandas as pd
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df.iloc[:,:2])
To do it with beautifulsoup straight forward. First you want to grab the <table> tag. Within the <table> tag get all the <tr> tages (rows). Then iterate through each row to get all the <td> tags (the data). The data you want are in index positions 0 and 1, so just print those out.
import requests
from bs4 import BeautifulSoup
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if data != []:
print (data[0].text, data[1].text)
ADDITIONAL:
import pandas as pd
country = 'China'
url='https://www.worldometers.info/coronavirus/'
df = pd.read_html(url)[0]
print (df[df['Country,Other'] == country].iloc[:,:2])
OR
import requests
from bs4 import BeautifulSoup
import re
country = 'China'
url='https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table',{'id':'main_table_countries'})
rows = table.find('a', text=re.compile(country))
for row in rows:
data = row.parent.parent.parent.find_all('td')[1].text
print (row, data)
You can get the target info this way:
for t in table[0].find_all('tr'):
target = t.find_all('td')
if len(target)>0:
print(target[0].text, target[1].text)
Output:
China 80,761
Italy 9,172
Iran 8,042
etc.
Related
I'm learning web scraping and am trying to web crawl data from the below link. Is there a way for me to crawl the link from each of the td as well?
The website link: http://eecs.qmul.ac.uk/postgraduate/programmes/
Here's what I did so far.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_list = []
rows = soup.find_all('tr')
# For every row in the table, find each cell element and add it to the list
for row in rows:
row_td = row.find_all('td')
row_cells = str(row_td)
row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()
table_list.append((row_cleantext))
print(table_list)
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
main_data=soup.find_all("td")
You can find main_data and iterate over that so you will get specific td tag and now find a and use .get for href extraction and if any Attribute is not present so you can use try-except to handle exceptions
for data in main_data:
try:
link=data.find("a").get("href")
print(link)
except AttributeError:
pass
For Understing only:
main_data=soup.find_all("td")
for data in main_data:
try:
link=data.find("a")
print(link.text)
print(link.get("href"))
except AttributeError:
pass
Output:
H60C
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
H60A
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
..
For creating table you can use pandas module
main_data=soup.find_all("td")
dict1={}
for data in main_data:
try:
link=data.find("a")
dict1[link.text]=link.get("href")
except AttributeError:
pass
import pandas as pd
df=pd.DataFrame(dict1.items(),columns=["Text","Link"])
Output:
Text Link
0 H60C https://www.qmul.ac.uk/postgraduate/taught/cou...
1 H60A https://www.qmul.ac.uk/postgraduate/taught/cou...
2 I4U2 https://www.qmul.ac.uk/postgraduate/taught/cou...
..
Getting table from website
import pandas as pd
data=pd.read_html("http://eecs.qmul.ac.uk/postgraduate/programmes/")
df=data[0]
df
Output
Postgraduate degree programmes Part-time(2 year) Full-time(1 year)
0 Advanced Electronic and Electrical Engineering H60C H60A
1 Artificial Intelligence I4U2 I4U1
.....
I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD
You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function
Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)
I'm trying to scrape the "team per game stats" table from this website using this code:
from urllib.request import urlopen as uo
from bs4 import BeautifulSoup as BS
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = uo(url)
soup = BS(html, 'html.parser')
soup.findAll('tr')
headers = [th.getText() for th in soup.findAll('tr')]
headers = headers[1:]
print(headers)
rows = soup.findAll('tr')[1:]
team_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(team_stats, columns=headers)
But it returns this error:
AssertionError: 71 columns passed, passed data had 212 columns
The problem is that the data is hidden in a commented section of the HTML. The table you want to extract is rendered with Javascript in your browser. Requesting the page with requests or urllib just yields the raw HTML.
So be aware that you have to examine the source code of the page with "View page source" rather than the rendered page with "Inspect Element" if you search for the proper tags to find with BeautifulSoup.
Try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2020.html'
html = requests.get(url)
section_start = '<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats">'
block_start = html.text.split(section_start)[1].split("<!--")[1]
block = block_start.split("-->")[0]
soup = BeautifulSoup(block)
data = [th.get_text(",") for th in soup.findAll('tr')]
header = data[0]
header = [x.strip() for x in header.split(",") if x.strip() !=""]
data = [x.split(",") for x in data[1:]]
pd.DataFrame(data, columns=header)
Explanation: You first need to find the commented section by simply splitting the raw HTML just before the section. You extract the section as text, convert to soup and then parse.
As of now, I'm only getting ['1'] as the output of what's being printed with my current code below. I want to grab 1-54 on the Team Batting table in the Rk column on the website https://www.baseball-reference.com/teams/NYY/2019.shtml.
How would I go about modifying colNum so it can print the 1-54 in the Rk column? I'm pointing out the colNum line because I feel the issue lies there but I could be wrong.
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser') # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)
Your mistake was in the last few lines as you mentioned. If I understood right, you wanted a list of all the values in the "Rk" column. In order to get all the rows, you have to use the find_all() function. I tweaked your code a little bit in order to get the text of the first field in each row in the following lines:
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)
How can I scrape the yahoo earnings calendar to pull out the dates?
This is for python 3.
from bs4 import BeautifulSoup as soup
import urllib
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find('p')
print(table)
the output is "None"
Beautiful Soup has some find functions that you can use to inspect the DOM , please refer to the documentation
from bs4 import BeautifulSoup as soup
import urllib.request
url = 'https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm'
response = urllib.request.urlopen(url)
html = response.read()
page_soup = soup(html,'lxml')
table = page_soup.find_all('td')
Dates = []
for something in table:
try:
if something['aria-label'] == "Earnings Date":
Dates.append(something.text)
except:
print('')
print(Dates)
Might be off-topic but since you want to get a table from a webpage, you might consider using pandas which works with two lines:
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm')[0]
Here are two succinct ways
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/calendar/earnings?day=2019-06-13&symbol=ibm&guccounter=1')
soup = bs(r.content, 'lxml')
# using attribute = value selector
dates = [td.text for td in soup.select('[aria-label="Earnings Date"]')]
#using nth-of-type to get column
dates = [td.text for td in soup.select('#cal-res-table td:nth-of-type(3)')]