exclude tags with beautifulsoup - python

I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?

All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14

You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'

You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41

You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.

Related

Extract <td> Elements using BS4

I am trying to go through a website and extract some information using Chromedriver. The problem that I have when I use BeautifulSoup is that I can't find a way to extract table inside a class.
The way I am trying to extract the information looks like this:
results = soup.find_all("div", class_="widget widgetLarge fpPerfglissanteclassique")
Is there a way to change this line so that it will only return the Information in <td>...</td> that can be found inside the class?!
Thanks for your answers in advance!
Your results variable contains another BeautifulSoup object (ResultSet) which you can iterate though and call find and find_all on the individual result items.
Like this:
from bs4 import BeautifulSoup
html = """
<div class="widget widgetLarge fpPerfglissanteclassique">
<td>item 1</td>
<td>item 2</td>
<td>item 3</td>
</div>
<div class="widget widgetLarge fpPerfglissanteclassique">
<td>item 4</td>
<td>item 5</td>
<td>item 6</td>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all("div", class_="widget widgetLarge fpPerfglissanteclassique")
for result in results:
table_results = result.find_all("td")
print(table_results)
Result:
[<td>item 1</td>, <td>item 2</td>, <td>item 3</td>]
[<td>item 4</td>, <td>item 5</td>, <td>item 6</td>]
If the table is inside this class, you can use this example how to get data from it:
from bs4 import BeautifulSoup
html = """
<div class="widget widgetLarge fpPerfglissanteclassique">
<table>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
<tr>
<td>4</td><td>5</td><td>6</td>
</tr>
</table>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all(
"div", class_="widget widgetLarge fpPerfglissanteclassique"
)
for result in results: # <-- iterate every result
for row in result.find_all("tr"): # <-- find all rows
cell_data = []
for cell in row.find_all("td"): # <-- find all cells inside row
cell_data.append(cell.text)
print(*cell_data)
Prints:
1 2 3
4 5 6

Get contents of a particular row

I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/

Parse the DOM like Javascript using BeautifulSoup

I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)
You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>
You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com
This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds
With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

Segregating text from bold tags within td tags using beautifulsoup

I'm trying to retrieve what is in-between the td tags without any tags in the following:
<td class="formSummaryPosition"><b>1</b>/9</td>
This is what I have written so far
o = []
for race in table:
for pos in race.findAll("td", {"class":"Position"}):
o.append(pos.contents)
I understand that the .contents will provide me with the follwing:
[[<b>1</b>, u'/9'], [<b>4</b>, u'/11'], [<b>2</b>, u'/8'], ...]
Ultimately I would like to have:
o = [[1/9],[4/11],[2/8]...]
I would appreciate if anyone had any idea on how to achieve this most efficiently?
Cheers
Use get_text() method on an element:
If you only want the text part of a document or tag, you can use the
get_text() method. It returns all the text in a document or beneath a
tag, as a single Unicode string
>>> from bs4 import BeautifulSoup
>>> data = """
... <table>
... <tr>
... <td class="formSummaryPosition"><b>1</b>/9</td>
... <td class="formSummaryPosition"><b>4</b>/11</td>
... <td class="formSummaryPosition"><b>2</b>/8</td>
... </tr>
... </table>
... """
>>> soup = BeautifulSoup(data)
>>> print [td.get_text() for td in soup.find_all('td', class_='formSummaryPosition')]
[u'1/9', u'4/11', u'2/8']

Categories