BeautifulSoup find() returns odd data

BeautifulSoup find() returns odd data - python

I am using BeautifulSoup to get data off a website. I can find the data I want but when I print it, it comes out as "-1" The value in the field is 32.27. Here is the code I'm using
import requests
from BeautifulSoup import BeautifulSoup
import csv
symbols = {'451020'}
with open('industry_pe.csv', "ab") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
writer.writerow(['Industry','PE'])
for s in symbols:
try:
url = 'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/industries.jhtml?tab=learn&industry='
full = url + s
response = requests.get(full)
html = response.content
soup = BeautifulSoup(html)
for PE in soup.find("div", {"class": "sec-fundamentals"}):
print PE
#IndPE = PE.find("td")
#print IndPE
When I print PE it returns this...
<h2>
Industry Fundamentals
<span>AS OF 03/08/2018</span>
</h2>
<table summary="" class="data-tbl">
<colgroup>
<col class="col1" />
<col class="col2" />
</colgroup>
<thead>
<tr>
<th scope="col"></th>
<th scope="col"></th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (Last Year GAAP Actual)</a></th>
<td>
32.27
</td>
</tr>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (This Year's Estimate)</a>.....
I want to get the value 32.27 from 'td' but when i use the code i have commented out to get and print 'td' it gives me this.
-1
None
-1
<td>
32.27
</td>
-1
any ideas?

The find() method returns the tag which is the first match. Iterating over the contents of a tag, will give you all the tags one by one.
So, to get the <td> tags in the table, you should first find the table and store it in a variable. And then iterate over all the td tags using find_all('td').
table = soup.find("div", {"class": "sec-fundamentals"})
for row in table.find_all('td'):
print(row.text.strip())
Partial Output:
32.27
34.80
$122.24B
$3.41
14.14%
15.88%
If you want only the first value, you can use this:
table = soup.find("div", {"class": "sec-fundamentals"})
value = table.find('td').text.strip()
print(value)
# 32.27

Related

Parse HTML table for specific content in one column and print resulting table to file with python

I have a file test_input.htm with a table:
<table>
<thead>
<tr>
<th>Acronym</th>
<th>Full Term</th>
<th>Definition</th>
<th>Product </th>
</tr>
</thead>
<tbody>
<tr>
<td>a1</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a2</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
</td>
</tr>
<tr>
<td>a3</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a4</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: SD-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
</tbody>
</table>
I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).
The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')
with open('test_output.htm', 'a') as f:
data = [[td.findChildren(text=True) for td in inhalte]]
for line in inhalte: #if you see a line in the table
if line.get_text().find('PRISMA') > -1 : #and you find the specific string
f.write("%s\n" % str(line))
I really tried hard but could not figure out how to restict the search to column 4.
The following did not work:
data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]
I would really appreciate if someone could help me find the solution.

Select more specific to get the elements you expect - For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:
soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')
Example
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')
with open('test_output.htm', 'a') as f:
for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
f.write("%s\n" % str(line))

Scraping with requests and BS4

I'd like to get the content in the table to then put in a pandas dataframe in the following website: https://projects.fivethirtyeight.com/soccer-predictions/premier-league/
I'm quite new to BS, but I believe that what I want would be something like:
import requests
from bs4 import BeautifulSoup
r = requests.get(url = "https://projects.fivethirtyeight.com/soccer-predictions/ligue-1/")
soup = BeautifulSoup(r.text, "html.parser")
#print(soup.prettify())
print(soup.find("div", {"class":"forecast-table"}))
But of course, unfortunately this is returning "None". Any help and guidance would be amazing!
I believe that the bit I need to get is somewhere in here (not really sure though):
<div id="forecast-table-wrapper">
<table class="forecast-table" id="forecast-table">
<thead>
<tr class="desktop">
<th class="top nosort">
</th>
<th class="top bordered-right rating nosort drop-6" colspan="3">
Team rating
</th>
<th class="top nosort rating2" colspan="1">
</th>
<th class="top bordered-right nosort drop-1" colspan="5">
avg. simulated season
</th>
<th class="top bordered-right nosort show-1 drop-3" colspan="2">
avg. simulated season
</th>
<th class="top bordered nosort" colspan="4">
end-of-season probabilities
</th>
</tr>
<tr class="sep">
<th colspan="11">
</th>
</tr>

Since you're using pandas anyway, you can use the built-in table processing, like this:
pandas.read_html('https://projects.fivethirtyeight.com/soccer-predictions/premier-league/',
attrs = {
'class': 'forecast-table'
}, header = 1)

That's because you are searching for a div, but it's a table, so it should be:
print(soup.find("table", {"class":"forecast-table"}))

import requests
from bs4 import BeautifulSoup
r = requests.get('https://projects.fivethirtyeight.com/soccer-predictions/ligue-1/')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'forecast-table'})
for i in table:
tr = i.find_all('tr')
for l in tr:
print(l.text)
Output:
Team ratingavg. simulated seasonavg. simulated seasonend-of-season probabilities
teamspioff.def.WDLgoal diff.proj. pts.pts.relegatedrel.qualify for UCLmake UCLwin Ligue 1win league
PSG24 pts90.03.00.530.74.52.9+7897<1%>99%97%
Lyon14 pts76.32.10.719.69.19.3+2768<1%60%2%
Marseille13 pts71.12.00.918.38.311.4+1663<1%40%<1%
Lille19 pts63.71.70.916.78.612.6+9591%24%<1%
St Étienne15 pts62.71.60.914.710.912.4-1553%14%<1%
Montpellier16 pts64.01.50.713.912.411.7+2543%12%<1%
Nice11 pts62.01.60.913.510.014.5-7507%7%<1%
Monaco6 pts65.91.80.913.010.714.2+0508%7%<1%
Rennes8 pts63.41.60.813.010.514.5-3499%6%<1%
Bordeaux14 pts59.21.50.913.09.915.0-6498%5%<1%
Strasbourg12 pts59.21.51.012.610.814.6-2499%5%<1%
Angers11 pts60.41.50.912.610.215.2-54810%4%<1%
Toulouse13 pts58.21.50.911.912.014.1-104811%4%<1%
Dijon FCO10 pts57.71.61.112.28.517.3-124517%2%<1%
Caen10 pts55.61.41.010.812.414.8-104518%3%<1%
Nîmes10 pts54.91.51.110.711.615.6-134420%2%<1%
Reims10 pts55.31.30.910.312.315.4-144321%2%<1%
Nantes6 pts59.01.50.910.410.916.7-144225%1%<1%
Guingamp5 pts57.31.51.010.39.817.9-194130%<1%<1%
Amiens10 pts53.01.31.010.49.018.6-164031%<1%<1%

Beautifulsoup HTML table parsing--only able to get the last row?

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table class='participants-table'>
<thead>
<tr>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
</tr>
</thead>
<tbody>
<tr>
<th class='name'>Grontmij</th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
</tr>
<tr>
<th class='name'>Groupe Bial</th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals & Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>
</tr>
</tbody>
</table>
I use the following codes to get the rows:
table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]
This gets:
rows=[<tr>
<th class="name">Grontmij</th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name">Groupe Bial</th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals & Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>
</tr>]
As expected, it looks like. However, if I continue:
for row in rows:
cells = row.find_all('th')
I'm only able to get the last entry!
cells=[<th class="name">Groupe Bial</th>]
What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks

You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:
cells = []
for row in rows:
cells.extend(row.find_all('th'))
Also since there is only one table you can just use find:
soup = BeautifulSoup(html)
table = soup.find("table", class_="participants-table")
If you want to skip the thead row you can use a css selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
rows = soup.select("table.participants-table thead ~ tr")
cells = [tr.th for tr in rows]
print(cells)
cells will give you:
[<th class="name">Grontmij</th>, <th class="name">Groupe Bial</th>]
To write the whole table to csv:
import csv
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("table.participants-table tr")
with open("data.csv", "w") as out:
wr = csv.writer(out)
wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])
for row in rows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
which for you sample will give you:
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

Find a List of Tags Based on Text Value of Children in Beautiful Soup

I have a question about selecting a list of tags (or single tags) using a condition on one of the attributes of it's children. Specifically, given the HTML code:
<tbody>
<tr class="" data-row="0">
<tr class="" data-row="1">
<tr class="" data-row="2">
<td align="right" csk="13">13</td>
<td align="left" csk="Jones,Andre">Andre Jones
</td>
<tr class="" data-row="3">
<td align="right" csk="7">7</td>
<td align="left" csk="Jones,DeAndre">DeAndre Jones
</td>
<tr class="" data-row="4">
<tr class="" data-row="5">
I have a unicode variable coming from an external loop and I am trying to look through each row in the table to extract the <tr> tags with Player==Table.tr.a.text and to identify duplicate player names in Table. So, for instance, if there is more than one player with Player=Andre Jones the MyRow object returns all <tr> tags that contain that players name, while if there is only one row with Player=Andre Jones, then MyRow just contains the single element <tr> with anchor text attribute equal to Andre Jones. I've been trying things like
Table = soup.find('tbody')
MyRow = Table.find_all(lambda X: X.name=='tr' and Player == X.text)
But this returns [] for MyRow. If I use
MyRow = Table.find_all(lambda X: X.name=='tr' and Player in X.text)
This will pick any <tr> that has Player as a substring of X.text. In the example code above, it extracts both <tr> tags withe Table.tr.td.a.text=='Andre Jones' and Table.tr.td.a.text=='DeAndre Jones'. Any help would be appreciated.

You could do this easily with XPath and lxml:
import lxml.html
root = lxml.html.fromstring('''...''')
td = root.xpath('//tr[.//a[text() = "FooName"]]')
The BeautifulSoup "equivalent" would be something like:
rows = soup.find('tbody').find_all('tr')
td = next(row for row in rows if row.find('a', text='FooName'))
Or if you think about it backwards:
td = soup.find('a', text='FooName').find_parent('tr')

Whatever you desire. :)
Solution1
Logic: find the first tag whose tag name is tr and contains 'FooName' in this tag's text including its children.
# Exact Match (text is unicode, turn into str)
print Table.find(lambda tag: tag.name=='tr' and 'FooName' == tag.text.encode('utf-8'))
# Fuzzy Match
# print Table.find(lambda tag: tag.name=='tr' and 'FooName' in tag.text)
Output:
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>
Solution2
Logic: find the element whose text contains FooName, the anchor tag in this case. Then go up the tree and search for the all its parents(including ancestors) whose tag name is tr
# Exact Match
print Table.find(text='FooName').find_parent('tr')
# Fuzzy Match
# import re
# print Table.find(text=re.compile('FooName')).find_parent('tr')
Output
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>

Best way to remove remove values from variable? create an array, or using regex? or with Xpath?

Im trying to extract some fields from the output at the end of this question with the following code:
doc = LH.fromstring(html2)
tds = (td.text_content() for td in doc.xpath("//td[not(*)]"))
for a,b,c in zip(*[tds]*3):
print (a,b,c)
What i expect is to extract only the fields notificationNodeName,notificationNodeName,packageName,notificationEnabled
The main problem with that is because i want to put the result into a database. and i need to, instead receiveing:
Actual code returns:
('JDBCAdapter', 'JDBCAdapter', 'Package:Notif')
('Package', 'yes', 'Package_2:Notif')
('Package_2', 'yes')
What i need:
('Package:Notif','Package', 'yes')
('Package_2:Notif','Package_2', 'yes')
An unly solution that i found was:
doc = LH.fromstring(html2)
tds = (td.text_content() for td in doc.xpath("//td"))
for td, val in zip(*[tds]*2):
if td == 'notificationNodeName':
notificationNodeName = val
elif td == 'packageName':
packageName = val
elif td == 'notificationEnabled':
notificationEnabled = val
print (notificationNodeName,packageName,notificationEnabled)
It works but doenst seen right for me, im sure it can be a better way to do it.
Original HTML Output:
<tbody><tr>
<td valign="top"><b>adapterTypeName</b></td>
<td>JDBCAdapter</td>
</tr>
<tr>
<td valign="top"><b>adapterTypeNameList</b></td>
<td>
<table>
<tbody><tr>
<td>JDBCAdapter</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td valign="top"><b>notificationDataList</b></td>
<td>
<table>
<tbody><tr>
<td><table bgcolor="#dddddd" border="1">
<tbody><tr>
<td valign="top"><b>notificationNodeName</b></td>
<td>package:Notif</td>
</tr>
<tr>
<td valign="top"><b>packageName</b></td>
<td>Package</td>
</tr>
<tr>
<td valign="top"><b>notificationEnabled</b></td>
<td>unsched</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td><table bgcolor="#dddddd" border="1">
<tbody><tr>
<td valign="top"><b>notificationNodeName</b></td>
<td>Package_2:notif</td>
</tr>
<tr>
<td valign="top"><b>packageName</b></td>
<td>package_2</td>
</tr>
<tr>
<td valign="top"><b>notificationEnabled</b></td>
<td>yes</td>
</tr>
and continues to more ... non relevant repetitive data.

I would recommend using the excellent lxml and it's cssselect functionality for basically most HTML parsing.
You can then select each field you are interested in thusly:
from lxml import html
root = html.parse(open('your/file.html')).getroot()
sibling_content = lambda x: [b.getparent().getnext().text_content() for b in
root.cssselect("td b:contains('{0}')".format(x))]
fields = ['notificationNodeName', 'packageName', 'notificationEnabled']
for item in zip(*[sibling_content(field) for field in fields]):
print item

I would also recommend lxml - it's the de facto standard for parsing XML or HTML with Python.
As an alternative to David's approach, here's a solution using xpaths:
from lxml import html
from lxml import etree
html_file = open('test.html', 'r')
root = html.parse(html_file).getroot()
# Strip those annoying <b> tags for easier xpaths
etree.strip_tags(root,'b')
data_list = root.xpath("//td[text()='notificationDataList']/following-sibling::*")[0]
node_names = data_list.xpath("//td[text()='notificationNodeName']/following-sibling::*/text()")
package_names = data_list.xpath("//td[text()='packageName']/following-sibling::*/text()")
enableds = data_list.xpath("//td[text()='notificationEnabled']/following-sibling::*/text()")
print zip(node_names, package_names, enableds)
Output:
[('package:Notif', 'Package', 'unsched'),
('Package_2:notif', 'package_2', 'yes')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup find() returns odd data - python

Related

Parse HTML table for specific content in one column and print resulting table to file with python

Scraping with requests and BS4

Beautifulsoup HTML table parsing--only able to get the last row?

Find a List of Tags Based on Text Value of Children in Beautiful Soup

Best way to remove remove values from variable? create an array, or using regex? or with Xpath?

Categories

Resources