Using beautiful soup to pull text from multiple <tr>'s - python

The goal is to output a dictionary of course names and their grade from this:
<tr>
<td class="course">Modern Europe & the World - Dewey</td>
<td class="percent">
92%
</td>
<td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
</tr>
to this:
{Modern Europe & the World - Dewey: 92%, the next couse name: grade...etc}
I know how to find just the percent tag or just the a href tag but I'm unsure how to get the text and compile it into a dictionary so it's more usable. Thanks!

Since each tr contains a sequence of td elements containing the information you want, you just need to use find_all() to collect them into a list, and then extract the information you want:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<tr>
<td class="course">Modern Europe & the World - Dewey</td>
<td class="percent">
92%
</td>
<td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
</tr>
""")
grades = {}
for tr in soup.find_all("tr"):
td_text = [td.text.strip() for td in tr.find_all("td")]
grades[td_text[0]] = td_text[1]
Result:
>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}

Try this: For each tr element, try to find children what you need (those who has course, and percent class) If both exists, then build the grades dict
>>> from bs4 import BeautifulSoup
>>> html = """
... <tr>
... <td class="course">Modern Europe & the World - Dewey</td>
... <td class="percent">
... 92%
... </td>
... <td style="display: none;"><img alt="Email" src="/images/email.png?1395938788" /></td>
... </tr>
... """
>>>
>>> soup = BeautifulSoup(html)
>>> grades = {}
>>> for tr in soup.find_all('tr'):
... td_course = tr.find("td", {"class" : "course"})
... td_percent = tr.find("td", {"class" : "percent"})
... if td_course and td_percent:
... grades[td_course.text.strip()] = td_percent.text.strip()
...
>>>
>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}

Related

bs4 list comprehension with 'if' statement

I'm trying to add the text of each table row to my list if the table row contains text. I want to do this using list comprehension.
Here's what I tried
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
Here's the error
listt2 = [s.span.text for s in soup.find_all('tr') if s.span.text]
AttributeError: 'NoneType' object has no attribute 'text'
Here's 1 'tr' that contains a span tag:
<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>
Here's Another that doesn't:
<tr>
<td colspan="5" class="group-header">
Sat 14:30 </td>
</tr>
Please note there are many more tr tags on this page
If you want to get only <tr> tags which contain <span> tag, you can use this list comprehension:
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
EDIT:
from bs4 import BeautifulSoup
html_doc = '''<tr>
<td colspan="2" class="cell--section-end cell--link cell--link__icon">
<a data-analytics="[Competitions] - German Bundesliga" href="/football/german-bundesliga/event/26301018" class="cell--link__link cell-text">
<i class="i accordion__title-icon--green accordion__title-icon--right" data-char=""></i> <b class="cell-text__line cell-text__line--icon">
<span class="competitions-team-name js-ev-desc">1. FC Köln v 1899 Hoffenheim</span>
</b>
</a>
</td>
<tr>'''
soup = BeautifulSoup(html_doc, 'html.parser')
listt2 = [s.span.text for s in soup.select('tr:has(span)') if s.span.text]
print(listt2)
Prints:
['1. FC Köln v 1899 Hoffenheim']
You just need to check that span is not None before looking for span.text.
listt2 = [s.span.text for s in soup.find_all('tr') if s.span is not None and s.span.text]
Because of short-circuiting, s.span.text is never evaluated if s.span is None because False and * is False

Python HTML Regex

correct output using below txt file should be: PlayerA 29.2 PlayerB 32.2
I have a txt file filled with html that looks like below,
I'm trying to use a python 2.6 regular expression to collect all the playernames and ratings.
The first time the playername appears is on line 4, the rating appears on line 16.(29.2)
Then the next player name appears on line 22, the rating on line 35.
and so on...
fileout = open('C:\Python26\hotcold.txt')
read_file = fileout.readlines()
source = str(read_file)
expression = re.findall(r"(LS=113>.+?", source)
print expression
I was trying to make a expression that would find all the names and ratings but it isnt working..
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerA
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
29.2
</td>
<tr class="stats">
<td class="stats" colspan="1" valign="top">
<a href="index.php?c=playerview&P=245&LS=113">
PlayerB
</a>
</td>
<td class="stats" colspan="1" valign="top">
<b>
4
</b>
,
<b>
8
</b>
</td>
<td class="stats" colspan="1" valign="top">
32.2
</td>
I would recommend using Beautiful Soup to parse the HTML and get the values you are after.
Use the following code:
from bs4 import BeautifulSoup
with open('sample.html', 'r') as html_doc:
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all('tr', 'stats'):
row_tds = row.find_all_next('td')
print('{0} {1}'.format(
row_tds[0].find('a').string.strip() if row_tds[0].find('a').string else 'None',
row_tds[2].string.strip() if row_tds[2].string else 'None')
)
output:
$ python testparse.py
PlayerA 29.2
PlayerB 32.2
Works.
Alternatively, I would suggest using a proper html parser instead of relying on regex -- although BeautifulSoup is actually a very good and easy-to-use library.
In your sample is that missing the closing <tr> tags between the <td>?
Edit: using OP sample as source
Anyhow, and using lxml.html with simple xpath to get hopefully what you expected:
In [1]: import lxml.html
# sample.html is the same as in OP sample
In [2]: tree = lxml.html.parse("sample.html")
In [3]: root = tree.getroot()
In [4]: players = root.xpath('.//td[#class="stats"]/a/text()')
In [5]: stats = root.xpath('//td[#class="stats" and normalize-space(text())]/text()')
In [6]: print players, stats
['\nPlayerA\n', '\nPlayerB\n'] ['\n29.2\n', '\n32.2\n']
In [7]: for player, stat in zip(players, stats):
...: print player.strip(), stat.strip()
...:
PlayerA 29.2
PlayerB 32.2

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.
You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

Deleting all content between brackets from a string using python

I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.
Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124
use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])

Using Python and Beautifulsoup how do I select the desired table in a div?

I would like to be able to select the table containing the "Accounts Payable" text but I'm not getting anywhere with what I'm trying and I'm pretty much guessing using findall. Can someone show me how I would do this?
For example this is what I start with:
<div>
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
<tr>
<td class="lft lm">Accrued Expenses
</td>
<td class="r">36.49</td>
<td class="r">33.39</td>
<td class="r">31.39</td>
<td class="r rm">36.47</td>
</tr>
</div>
And this is what I would like to get as a result:
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
You can select the td elements with class lft lm and then examine the element.string to determine if you have the "Accounts Payable" td:
import sys
from BeautifulSoup import BeautifulSoup
# where so_soup.txt is your html
f = open ("so_soup.txt", "r")
data = f.readlines ()
f.close ()
soup = BeautifulSoup ("".join (data))
cells = soup.findAll('td', {"class" : "lft lm"})
for cell in cells:
# You can compare cell.string against "Accounts Payable"
print (cell.string)
If you would like to examine the following siblings for Accounts Payable for instance, you could use the following:
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
print ("\t" + sibling.string)
sibling = sibling.findNextSibling ()
Update for Edit
If you would like to print out the original HTML, just for the siblings that follow the Accounts Payable element, this is the code for that:
lines = ["<tr>"]
for cell in cells:
lines.append (cell.prettify().decode('ascii'))
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
lines.append (sibling.prettify().decode('ascii'))
sibling = sibling.findNextSibling ()
lines.append ("</tr>")
f = open ("so_soup_out.txt", "wt")
f.writelines (lines)
f.close ()

Categories