Beautifulsoup Unable to Find Classes with Hyphens in Their Name - python

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.

The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?

For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

Related

Beautiful soup: why not printing inside the for-loop in my code?

from bs4 import BeautifulSoup
import numpy as np
import requests
from selenium import webdriver
from nltk.tokenize import sent_tokenize,word_tokenize
html = webdriver.Firefox(executable_path=r'D:\geckodriver.exe')
html.get("https://www.tsa.gov/coronavirus/passenger-throughput")
def TSA_travel_numbers(html):
print('NASEEF')
soup = BeautifulSoup(html,'lxml')
print('naseef2')
for i,rows in enumerate(soup.find_all('tr',class_='view-content')):
print('naseef3')
for texts in soup.find('td',header = 'view-field-2021-throughput-table-column'):
print('naseef4')
number = texts.text
if number is None:
continue
print('Naseef')
TSA_travel_numbers(html.page_source)
As you can see NASEEF and naseef2 gets printed into the console, but not naseef3 and naseef4, and no error to this code, it runs fine, I don't know what is happening here, anyone please point me what is really happening here?
In other words it is not going inside the for loops specified in that function.
please help me, and sorry for your time and advance thanks!
Your page does not contain <tr> tags with a class of view-content, so find_all is correctly returning no results. If you remove the class restriction, you get many results:
>>> soup.find_all('tr', limit=2)
[<tr>
<th class="views-align-center views-field views-field-field-today-date views-align-center" id="view-field-today-date-table-column" scope="col">Date</th>
<th class="views-align-center views-field views-field-field-2021-throughput views-align-center" id="view-field-2021-throughput-table-column" scope="col">2021 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2020-throughput views-align-center" id="view-field-2020-throughput-table-column" scope="col">2020 Traveler Throughput </th>
<th class="views-align-center views-field views-field-field-2019-throughput views-align-center" id="view-field-2019-throughput-table-column" scope="col">2019 Traveler Throughput </th>
</tr>, <tr>
<td class="views-field views-field-field-today-date views-align-center" headers="view-field-today-date-table-column">5/9/2021 </td>
<td class="views-field views-field-field-2021-throughput views-align-center" headers="view-field-2021-throughput-table-column">1,707,805 </td>
<td class="views-field views-field-field-2020-throughput views-align-center" headers="view-field-2020-throughput-table-column">200,815 </td>
<td class="views-field views-field-field-2019-throughput views-align-center" headers="view-field-2019-throughput-table-column">2,419,114 </td>
</tr>]
Once you change that, the inner loop is looking for <td> tags with a header of view-field-2021-throughput-table-column. There are no such tags in the page either, but there are those which have a headers field with that name.
This line is also wrong:
number = texts.text
...because texts is a NavigableString and does not have the text attribute.
Additionally, the word naseef is not really clear as to what it means, so it's better to replace that with more descriptive strings. Finally, you don't really need the Selenium connection or the tokenizer, so for the purposes of this example we can leave those out. The resulting code looks like this:
from bs4 import BeautifulSoup
import numpy as np
import requests
html = requests.get("https://www.tsa.gov/coronavirus/passenger-throughput").text
def TSA_travel_numbers(html):
print('Entering parsing function')
soup = BeautifulSoup(html,'lxml')
print('Parsed HTML to soup')
for i,rows in enumerate(soup.find_all('tr')):
print('Found <tr> tag number', i)
for texts in soup.find('td',headers = 'view-field-2021-throughput-table-column'):
print('found <td> tag with headers')
number = texts
if number is None:
continue
print('Value is', number)
TSA_travel_numbers(html)
Its output looks like:
Entering parsing function
Parsed HTML to soup
Found <tr> tag number 0
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 1
found <td> tag with headers
Value is 1,707,805
Found <tr> tag number 2
found <td> tag with headers
...

Beautiful Soup Not Finding Basic HTML Data

I'm trying to extract data from a page using BeautifulSoup. I obtain my HTML data (type: bs4.element.ResultSet) and it contains mutliple lines such as the following, which I would like to compile into a list:
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
But when I run a line such as one of those shown below...
labels = soup.find_all("va-infobox-label")
labels = soup.find(colspan="1", style="")
...I get an attribute error. Alternatively running...
labels = soup.find_all("va-infobox-label")
...returns a syntax error
What command or tool should I be using if not find to obtain all lines containing va-infobox-label? My end goal is to obtain a list of labels from this HTML, one of which will be 'weight' as per my example (title="">Weight<).
If you need to replicate the error:
import requests
from bs4 import BeautifulSoup
as_val_url = 'https://escapefromtarkov.gamepedia.com/AS_VAL'
as_val_page = requests.get(as_val_url)
as_val_soup = BeautifulSoup(as_val_page.content, 'html.parser')
soup = as_val_soup.find_all(id="va-infobox0-content")
labels = soup.find_all("va-infobox-label")
If a glance at the HTML would help you, a public 'beautified' copy of it is present in my pastebin. My example is from line 36.
You can use soup.select to search via CSS selectors or soup.find_all as below
from bs4 import BeautifulSoup
from io import StringIO
data = '''
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Slot</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">Primary</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Weight</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">2.587 kg</td>
</tr>
<tr class="va-infobox-spacing">
<td class="va-infobox-spacing-v" colspan="3"></td>
</tr>
<tr>
<td class="va-infobox-label" colspan="1" style="" title="">Grid size</td>
<td class="va-infobox-spacing-h"></td>
<td class="va-infobox-content" colspan="1" style="" title="">5x2</td>
</tr>
'''
f = StringIO(data)
soup = BeautifulSoup(f, 'html.parser')
for e in soup.find_all('td', {'class': 'va-infobox-label'}):
print('find_all', e)
for e in soup.select('.va-infobox-label'):
print('select', e)

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

How to extract elements from html with BeautifulSoup

I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !

Can't scrape HTML table using BeautifulSoup

I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.
1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for

Categories