Fetch html table row data using python 3.6 beautiful soup - python

I have below html table and want to fetch table data i.e "Revenues ($M) $135,987" which exist in first row of table. How to achieve this using python beautifulsoup.
<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
Script to extract data from direct source:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://fortune.com/fortune500/amazon-com/')
soup = bs(r.content, 'html.parser')
result = soup.find('div', {'class': 'small-12 columns'})
table = result.find_all('table')[0] # Grab the first table
print(table.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)

Select the 'data-reactid' with the value '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'} and read it's text.
from bs4 import BeautifulSoup
html = """<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).0">
Profits ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).1">
$2,371.0
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).2">
297.8%
</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)
Outputs:
$135,987
Updated in response to comment:
the page is rendered with JavaScript you can use Selenium to render it:
First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = "http://fortune.com/fortune500/amazon-com/"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
# print (soup)
tds = soup.find_all('td')
print(tds[1].text)
Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

Related

BS4 get the TH data within the table

I am trying to read data from a website which has a table like this:
<table border="0" width = "100%">
<tr>
<th width="33%">Product</th>
<th width="34%">ID</th>
<th width="33%">Serial</th>
</tr>
<tr>
<td align='center'>
<a target="_TOP" href="Link1.html">ProductName</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=1'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link2.html">ProductID</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=2'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link3.html">ProductSerial</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=3'></a>
</td>
</tr>
</table>
and all I want from this table is the ProductID which is content inside of the tag.
The problem is, I am trying to use BS4 for this, to find the TAG, and read inside of it, but how to accurately point BS4 to it?
I have tried:
with open("src/file.html", 'r') as inf:
html = inf.read()
soup = bs4.BeautifulSoup(html, features="html.parser")
for container in soup.find_all("table", {"td": ""}):
print(container)
But doesn't work..Is there Any way to achieve this? To read the content inside of the a tag?
You can use the :nth-of-type CSS selector:
print(soup.select_one("td:nth-of-type(2) a:nth-of-type(1)").text)
Output:
ProductID

Beautiful Soup Not Finding Basic HTML Data

I'm trying to extract data from a page using BeautifulSoup. I obtain my HTML data (type: bs4.element.ResultSet) and it contains mutliple lines such as the following, which I would like to compile into a list:
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
But when I run a line such as one of those shown below...
labels = soup.find_all("va-infobox-label")
labels = soup.find(colspan="1", style="")
...I get an attribute error. Alternatively running...
labels = soup.find_all("va-infobox-label")
...returns a syntax error
What command or tool should I be using if not find to obtain all lines containing va-infobox-label? My end goal is to obtain a list of labels from this HTML, one of which will be 'weight' as per my example (title="">Weight<).
If you need to replicate the error:
import requests
from bs4 import BeautifulSoup
as_val_url = 'https://escapefromtarkov.gamepedia.com/AS_VAL'
as_val_page = requests.get(as_val_url)
as_val_soup = BeautifulSoup(as_val_page.content, 'html.parser')
soup = as_val_soup.find_all(id="va-infobox0-content")
labels = soup.find_all("va-infobox-label")
If a glance at the HTML would help you, a public 'beautified' copy of it is present in my pastebin. My example is from line 36.
You can use soup.select to search via CSS selectors or soup.find_all as below
from bs4 import BeautifulSoup
from io import StringIO
data = '''
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Slot</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">Primary</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">2.587 kg</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Grid size</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">5x2</td>
</tr>
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
for e in soup.find_all('td', {'class': 'va-infobox-label'}):
print('find_all', e)
for e in soup.select('.va-infobox-label'):
print('select', e)

Scrape Table Using Scrapy

apologies for the long post -
I have a table that I am trying to dig into using scrapy, but can't quite figure out how to dig into this table deep enough.
This is the table:
<table class="detail-table" border="0" cellspacing="0">
<tbody>
<tr id="trAnimalID">
...
</tr>
<tr id="trSpecies">
...
</tr>
<tr id="trBreed">
...
</tr>
<tr id="trAge">
...
<tr id="trSex">
...
</tr>
<tr id="trSize">
...
</tr>
<tr id="trColor">
...
</tr>
<tr id="trDeclawed">
...
</tr>
<tr id="trHousetrained">
...
</tr>
<tr id="trLocation">
...
</tr>
<tr id="trIntakeDate">
<td class="detail-label" align="right">
<b>Intake Date</b>
</td>
<td class="detail-value">
<span id="lblIntakeDate">3/31/2020</span>
</td>
</tr>
<tr id="trStage">
<td class="detail-label" align="right">
<b>Stage</b>
</td>
<td class="detail-value">
<span id="lblStage">Reserved</span>
</td>
</tr>
</tbody></table>
I can dig into it using the scrapy shell command:
text = response.xpath('//*[#class="detail-table"]//tr')[10].extract()
I am getting back this:
'<tr id="trIntakeDate">\r\n\t
<td class="detail-label" align="right">\r\n
<b>Intake Date</b>\r\n
</td>\r\n\t
<td class="detail-value">\r\n
<span id="lblIntakeDate">3/31/2020</span>\xa0\r\n
</td>\r\n
</tr>'
I can't quite figure out how to just get the value for lblIntakeDate. I just need 3/31/2020. Additionally, I'd like to run this as a lambda, and can't quite figure out how to get the execute function to dump out a json file like I can using command line. Any ideas?
Try it:
//table[#class='detail-table']/tbody//tr/td/span[#id='lblIntakeDate']/text()
Go https://www.online-toolz.com/tools/xpath-tester-online.php
And please remove redundant characters such as
try:
from urllib.request import urlopen
url = ''
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for i in bs.find_all('a'):
print(i.get_text())

Web scraping in Python - how to capture all <a> elements

I'm using beautifulsoup4 to scrape data from the lyrics.com website, specifically this link: https://www.lyrics.com/album/1447935.
From this block, I'm trying to extract both <a> elements:
[<table class="tdata">
<colgroup>
<col style="width: 50px;"/>
<col style="width: 430px;"/>
<col style="width: 80px;"/>
<col style="width: 80px;"/>
</colgroup>
<thead>
<tr>
<th>#</th>
<th>Song</th>
<th>Duration</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="tal qx">1</td>
<td class="tal qx">
<strong>
Make You Feel My Love
</strong>
</td>
<td class="tal qx">3:32</td>
<td class="tal vam rt">
</td></tr><tr><td class="tal qx">2</td>
<td class="tal qx">
<strong>
Painting Pictures
</strong>
</td>
<td class="tal qx">3:33</td>
<td class="tal vam rt"> </td>
</tr>
</tbody>
</table>]
This is my code:
url = "http://www.lyrics.com" + album_url
page = r.get(url)
soup = bs(page.content, "html.parser")
songs = [a.get('href') for a in (table.find('a') for table in soup.findAll('table')) if a]
However, it's only returning the first <a>:
['/lyric/15183453/Make+You+Feel+My+Love']
What could be wrong?
Edit: Thank you all for the answers! I upvoted but I don't have enough rep for it to show
This will work:
songs = [song['href'] for song in soup.select('table a')]
Output:
['/lyric/15183453/Make+You+Feel+My+Love', '/lyric/15183454/Painting+Pictures']
Was able to make it work with:
for a in soup.findAll('a'):
if a.parent.name == 'strong':
if a.parent.parent.name == 'td':
print(a["href"])
Still not sure why the other method doesn't work, though, since I've used it elsewhere in my program with no issues.
Other solutions work fine, however I prefer using good old selectors
from bs4 import BeautifulSoup as bs
import requests as req
page = req.get('https://www.lyrics.com/album/1447935')
soup = bs(page.content, 'html.parser')
links = soup.select('table.tdata a[href]')
print(links)
This will print
[Make You Feel My Love, Painting Pictures]
If you aren't familiar with selectors, this will grab table elements that has the class tdata and then collect all the href property on the a elements
Looks like you want table.findAll instead of table.find.

scrapy xpath return empty list while response is not empty

I'm using scrapy to scrape information from 2 tables on the website
I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?
import scrapy
from universities.items import UniversitiesItem
class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
name = 'uscc'
allowed_domains = ['sc.edu']
start_urls = ['http://www.sc.edu/about/directory/?name=']
def parse(self, response):
for ln in ['Zhao']:
query = response.url + ln
yield scrapy.Request(query, callback=self.parse_item)
#staticmethod
def parse_item(response):
staffs = response.xpath('//table[#id="directorystaff"]/tbody/tr[#role="row"]')
students = response.xpath('//table[#id="directorystudent"]/tbody/tr[#role="row"]')
print('--------------------------')
print('staffs', staffs)
print('==========================')
print('students', students)
It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.
In response tr-tags do not have attribute 'role'
Please see it:
<table class="display" id="directorystaff" width="100%">
<thead>
<tr>
<th style="text-align: left">Name</th>
<th style="text-align: left">Email</th>
<th style="text-align: left">Phone</th>
<th style="text-align: left">Department</th>
<th style="text-align: left">Office Address</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Zhao, Xia </td>
<td style="text-align: left"> </td>
<td style="text-align: left">(803) 777-8436 </td>
<td style="text-align: left">Chemistry </td>
<td style="text-align: left">537 </td>
</tr>
<tr>
<td style="text-align: left">Zhao, Xing </td>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: left">Mechanical Engineering </td>
<td style="text-align: left"> </td>
</tr>
In this picture we see the response page
and in this picture we see page on browser:
So, if you want got list of staffs , I'm recommend next XPath:
//table[#id="directorystaff"]/tbody/tr/td
And for students, I'm recommend next XPath:
//table[#id="directorystudent"]/tbody/tr/td
If you want something else, you can modify this is XPath query.
This is example for you:
import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[#id="directorystaff"]/tbody/tr/td')
for el in element:
print(el.text)
And output:
>>Zhao, Xia  
>> 
>>(803) 777-8436 
>>Chemistry 
>>537 
>>Zhao, Xing  
>> 
>> 
>>Mechanical Engineering 

Categories