Beautiful Soup Not Finding Basic HTML Data - python

I'm trying to extract data from a page using BeautifulSoup. I obtain my HTML data (type: bs4.element.ResultSet) and it contains mutliple lines such as the following, which I would like to compile into a list:
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
But when I run a line such as one of those shown below...
labels = soup.find_all("va-infobox-label")
labels = soup.find(colspan="1", style="")
...I get an attribute error. Alternatively running...
labels = soup.find_all("va-infobox-label")
...returns a syntax error
What command or tool should I be using if not find to obtain all lines containing va-infobox-label? My end goal is to obtain a list of labels from this HTML, one of which will be 'weight' as per my example (title="">Weight<).
If you need to replicate the error:
import requests
from bs4 import BeautifulSoup
as_val_url = 'https://escapefromtarkov.gamepedia.com/AS_VAL'
as_val_page = requests.get(as_val_url)
as_val_soup = BeautifulSoup(as_val_page.content, 'html.parser')
soup = as_val_soup.find_all(id="va-infobox0-content")
labels = soup.find_all("va-infobox-label")
If a glance at the HTML would help you, a public 'beautified' copy of it is present in my pastebin. My example is from line 36.

You can use soup.select to search via CSS selectors or soup.find_all as below
from bs4 import BeautifulSoup
from io import StringIO
data = '''
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Slot</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">Primary</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">2.587 kg</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Grid size</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">5x2</td>
</tr>
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
for e in soup.find_all('td', {'class': 'va-infobox-label'}):
print('find_all', e)
for e in soup.select('.va-infobox-label'):
print('select', e)

Related

BS4 get the TH data within the table

I am trying to read data from a website which has a table like this:
<table border="0" width = "100%">
<tr>
<th width="33%">Product</th>
<th width="34%">ID</th>
<th width="33%">Serial</th>
</tr>
<tr>
<td align='center'>
<a target="_TOP" href="Link1.html">ProductName</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=1'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link2.html">ProductID</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=2'></a>
</td>
<td align='center'>
<a target="_TOP" href="Link3.html">ProductSerial</a>
<br>
<a href='Link2.html' TARGET='_TOP'><img src='https://?uid=3'></a>
</td>
</tr>
</table>
and all I want from this table is the ProductID which is content inside of the tag.
The problem is, I am trying to use BS4 for this, to find the TAG, and read inside of it, but how to accurately point BS4 to it?
I have tried:
with open("src/file.html", 'r') as inf:
html = inf.read()
soup = bs4.BeautifulSoup(html, features="html.parser")
for container in soup.find_all("table", {"td": ""}):
print(container)
But doesn't work..Is there Any way to achieve this? To read the content inside of the a tag?
You can use the :nth-of-type CSS selector:
print(soup.select_one("td:nth-of-type(2) a:nth-of-type(1)").text)
Output:
ProductID

How to extract data with beautifulsoup with similar attributes

I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)

Fetch html table row data using python 3.6 beautiful soup

I have below html table and want to fetch table data i.e "Revenues ($M) $135,987" which exist in first row of table. How to achieve this using python beautifulsoup.
<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
Script to extract data from direct source:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://fortune.com/fortune500/amazon-com/')
soup = bs(r.content, 'html.parser')
result = soup.find('div', {'class': 'small-12 columns'})
table = result.find_all('table')[0] # Grab the first table
print(table.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)
Select the 'data-reactid' with the value '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'} and read it's text.
from bs4 import BeautifulSoup
html = """<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
<thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
$ millions
</th>
<th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
% change
</th>
</tr>
</thead>
<tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
Revenues ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
$135,987
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
27.1%
</td>
</tr>
<tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M)">
<td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).0">
Profits ($M)
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).1">
$2,371.0
</td>
<td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).2">
297.8%
</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)
Outputs:
$135,987
Updated in response to comment:
the page is rendered with JavaScript you can use Selenium to render it:
First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = "http://fortune.com/fortune500/amazon-com/"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
# print (soup)
tds = soup.find_all('td')
print(tds[1].text)
Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

How to extract elements from html with BeautifulSoup

I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.
The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?
For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

Categories