bs4 list comprehension with 'if' statement

bs4 list comprehension with 'if' statement - python

I'm trying to add the text of each table row to my list if the table row contains text. I want to do this using list comprehension.
Here's what I tried
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
Here's the error
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
AttributeError: 'NoneType' object has no attribute 'text'
Here's 1 'tr' that contains a span tag:
<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>
Here's Another that doesn't:
<tr>
<td colspan="5" class="group-header">
Sat 14:30 </td>
</tr>
Please note there are many more tr tags on this page

If you want to get only <tr> tags which contain <span> tag, you can use this list comprehension:
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
EDIT:
from bs4 import BeautifulSoup
html_doc = '''<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>'''
soup = BeautifulSoup(html_doc, 'html.parser')
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
print(listt2)
Prints:
['1. FC Köln v 1899 Hoffenheim']

You just need to check that span is not None before looking for span.text.
listt2 = [s.span.text for s in soup.find_all('tr') if s.span is not None and s.span.text]
Because of short-circuiting, s.span.text is never evaluated if s.span is None because False and * is False

Related

Python HTML Regex

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2
I have a txt file filled with html that looks like below,
I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.
The first time the playername appears is on line 4, the rating appears on line 16.(29.2)
Then the next player name appears on line 22, the rating on line 35.
and so on...
fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)
expression = re.findall(r"(LS=113>.+?", source)
print expression
I was trying to make a expression that would find all the names and ratings but it isnt working..
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>

I would recommend using Beautiful Soup to parse the HTML and get the values you are after.
Use the following code:
from bs4 import BeautifulSoup
with open('sample.html', 'r') as html_doc:
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all('tr', 'stats'):
row_tds = row.find_all_next('td')
print('{0} {1}'.format(
row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None',
row_tds[2].string.strip() if row_tds[2].string else 'None')
)
output:
$ python testparse.py
PlayerA 29.2
PlayerB 32.2
Works.

Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.
In your sample is that missing the closing <tr> tags between the <td>?
Edit: using OP sample as source
Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:
In [1]: import lxml.html
# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")
In [3]: root = tree.getroot()
In [4]: players = root.xpath('.//td[#class="stats"]/a/text()')
In [5]: stats = root.xpath('//td[#class="stats" and normalize-space(text())]/text()')
In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']
In [7]: for player, stat in zip(players, stats):
...: print player.strip(), stat.strip()
...:
PlayerA 29.2
PlayerB 32.2

Using beautiful soup to pull text from multiple <tr>'s

The goal is to output a dictionary of course names and their grade from this:
<tr>
<td class="course">Modern Europe & the World - Dewey</td>
<td class="percent">
92%
</td>
<td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
</tr>
to this:
{Modern Europe & the World - Dewey: 92%, the next couse name: grade...etc}
I know how to find just the percent tag or just the a href tag but I'm unsure how to get the text and compile it into a dictionary so it's more usable. Thanks!

Since each tr contains a sequence of td elements containing the information you want, you just need to use find_all() to collect them into a list, and then extract the information you want:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<tr>
<td class="course">Modern Europe & the World - Dewey</td>
<td class="percent">
92%
</td>
<td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
</tr>
""")
grades = {}
for tr in soup.find_all("tr"):
td_text = [td.text.strip() for td in tr.find_all("td")]
grades[td_text[0]] = td_text[1]
Result:
>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}

Try this: For each tr element, try to find children what you need (those who has course, and percent class) If both exists, then build the grades dict
>>> from bs4 import BeautifulSoup
>>> html = """
... <tr>
... <td class="course">Modern Europe & the World - Dewey</td>
... <td class="percent">
... 92%
... </td>
... <td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
... </tr>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> grades = {}
>>> for tr in soup.find_all('tr'):
... td_course = tr.find("td", {"class" : "course"})
... td_percent = tr.find("td", {"class" : "percent"})
... if td_course and td_percent:
... grades[td_course.text.strip()] = td_percent.text.strip()
...
>>>
>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}

Deleting all content between brackets from a string using python

I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.

Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124

use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])

Using beautifulsoup to get multiple tags and attributes data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I want to use beautifulsoup to get multiple tags and attributes from following HTML
1)div id= home_1039509
2)div id="guest_1039509
3)id="odds_3_1039509
4)id="gs_1039509
5)id="hs_1039509
6)id="time_1039509
HTML:
<tr align="center" height="15" id="tr_1039509" bgcolor="#F7F3F7" index="0">
<td width="10">
<img src="images/lclose.gif" onclick="hidematch(0)" style="cursor:pointer;">
</td>
<td width="63" bgcolor="#d15023">
<font color="#ffffff">U18<br>
<span id="t_1039509">14:05</span>
</font>
</td>
<td width="115" style="text-align:left;">
<div id="home_1039509">
U18()
</div>
<div class="oddsAns">
[
A
-
B
-
</div>
<div id="guest_1039509">
U18
</div>
</td>
<td width="30">
<div id="gs_1039509" class="score">2</div>
<div id="time_1039509">
42
<img src="images/in.gif" border="0">
</div>
<div id="hs_1039509" class="score">1</div></td>
<td width="90" id="odds_1_1039509" title=""></td>
<td width="90" id="odds_4_1039509" title=""></td>
<td width="90" id="odds_3_1039509" title="">
<a class="sb" href="javascript:" onclick="ChangeDetail3(1039509,'3')">0.94</a>
<img src="images/t3.gif">
<br>
<a class="pk" href="javascript:" onclick="ChangeDetail3(1039509,'3')">2.5/3</a>
<br>
0.86
</td>
<td width="90" id="odds_31_1039509" title="nothing"></td>
</tr>
Code:
rows = table.findAll("tr", {"id" : re.compile('tr_*\d')})
for tr in rows:
cols = tr.findAll("span", {"id" : re.compile('t_*\d')}) &
cols = tr.findAll("div", {"id" : re.compile('home_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('odds_3_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('hs_*\d')})
for td in cols:
t = td.find(text=True)
if t:
text = t + ';' # concat
print text,
print

You can pass a function and check if id starts with home_, guest_ etc:
from bs4 import BeautifulSoup
f = lambda x: x and x.startswith(('home_', 'guest_', 'odds_', 'gs_', 'hs_', 'time_'))
soup = BeautifulSoup(open('test.html'))
print [element.get_text(strip=True) for element in soup.find_all(id=f)]
prints:
[u'U18()', u'U18', u'2', u'42', u'1', u'', u'', u'0.942.5/30.86', u'']
Note that startswith() allows to pass a tuple of strings to check.

You can get list of cols like
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all(["div", "span"], id=re.compile('[home|guest|odds_3|gs|hs|time]_\d+'))
regex above just an example
In your case it can be
cols = tr.find_all(["div", "span"], id=re.compile('[home|guest|odds|gs|hs|time]_\d+'))
for tag in cols:
# find(text=True) only returns data if immediate node has text
# incase <div><span>123</span></div> will return None
t = td.find_all(text=True)
if t:
# find_all will return list so need to join
text = ''.join(t).strip() + ';'
print(text)

How can I iterate through tags with different identifiers with BeautifulSoup in Python

This is probably an easy question, but I'd like to iterate through the tags with id = dgrdAcquired_hyplnkacquired_0, dgrdAcquired_hyplnkacquired_1, etc.
Is there any easier way to do this than the code I have below? The trouble is that the number of these tags will be different for each webpage I pull up. I'm not sure how to get the text in these tags when each webpage might have a different number of tags.
html = """
<tr>
<td colspan="3"><table class="datagrid" cellspacing="0" cellpadding="3" rules="rows" id="dgrdAcquired" width="100%">
<tr class="datagridH">
<th scope="col"><font face="Arial" color="Blue" size="2"><b>Name (RSSD ID)</b></font></th><th scope="col"><font face="Arial" color="Blue" size="2"><b>Acquisition Date</b></font></th><th scope="col"><font face="Arial" color="Blue" size="2"><b>Description</b></font></th>
</tr><tr class="datagridI">
<td nowrap="nowrap"><font face="Arial" size="2">
<a id="dgrdAcquired_hyplnkacquired_0" href="InstitutionProfile.aspx?parID_RSSD=3557617&parDT_END=20110429">FIRST CHOICE COMMUNITY BANK (3557617)</a>
</font></td><td><font face="Arial" size="2">
<span id="dgrdAcquired_lbldtAcquired_0">2011-04-30</span>
</font></td><td><font face="Arial" size="2">
<span id="dgrdAcquired_lblAcquiredDescText_0">The acquired institution failed and disposition was arranged of by a regulatory agency. Assets were distributed to the acquiring institution.</span>
</font></td>
</tr><tr class="datagridAI">
<td nowrap="nowrap"><font face="Arial" size="2">
<a id="dgrdAcquired_hyplnkacquired_1" href="InstitutionProfile.aspx?parID_RSSD=104038&parDT_END=20110429">PARK AVENUE BANK, THE (104038)</a>
</font></td>
"""
soup = BeautifulSoup(html)
firm1 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_0"})
data1 = ''.join(firm1.findAll(text=True))
print data1
firm2 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_1"})
data2 = ''.join(firm2.findAll(text=True))
print data2

I would do the following, assuming that if there are n such tags, they are numbered 0...n:
soup = BeautifulSoup(html)
i = 0
data = []
while True:
firm1 = soup.find('a', { "id" : "dgrdAcquired_hyplnkacquired_%s" % i})
if not firm1:
break
data.append(''.join(firm1.findAll(text=True)))
print data[-1]
i += 1

Regex is probably overkill in this particular case.
Nonetheless here's another option:
import re
soup.find_all('a', id=re.compile(r'[dgrdAcquired_hyplnkacquired_]\d+'))
Please note: s/find_all/findAll/g if using BS3.
Result (a bit of whitespace removed for purposes of display):
[<a href="InstitutionProfile.aspx?parID_RSSD=3557617&parDT_END=20110429"
id="dgrdAcquired_hyplnkacquired_0">FIRST CHOICE COMMUNITY BANK (3557617)</a>,
<a href="InstitutionProfile.aspx?parID_RSSD=104038&parDT_END=20110429"
id="dgrdAcquired_hyplnkacquired_1">PARK AVENUE BANK, THE (104038)</a>]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

bs4 list comprehension with 'if' statement - python

You just need to check that span is not None before looking for span.text. listt2 = [s.span.text for s in soup.find_all('tr') if s.span is not None and s.span.text] Because of short-circuiting, s.span.text is never evaluated if s.span is None because False and * is False

Related

Python HTML Regex

Using beautiful soup to pull text from multiple <tr>'s

Deleting all content between brackets from a string using python

Using beautifulsoup to get multiple tags and attributes data [closed]

How can I iterate through tags with different identifiers with BeautifulSoup in Python

Categories

Resources