Beautiful Soup HTML Extraction - python

I am struggling with getting the data I want and I am sure its very simple if you know how to use BS. I have been trying to get this right for hours without avail after reading the docs.
Currently my code outputs this in python:
[<td>0.32%</td>, <td><span class="neg color ">>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>, <td><span class="neu">0.00</span></td>]
How would I just isolate the content of the td tags that do not contain the tags?
i.e. I would like to see 0.32%, 0.29%, 0.38% only.
Thank you.
import urllib2
from bs4 import BeautifulSoup
fturl = 'http://markets.ft.com/research/Markets/Bonds'
ftcontent = urllib2.urlopen(fturl).read()
soup = BeautifulSoup(ftcontent)
ftdata = soup.find(name="div", attrs={'class':'wsodModuleContent'}).find_all(name="td", attrs={'class':''})

Is this ok solution for you:
html_txt = """<td>0.32%</td>, <td><span class="neg color">
>-0.01</span></td>, <td>0.29%</td>, <td>0.38%</td>,
<td><span class="neu">0.00</span></td>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_txt)
print [tag.text for tag in soup.find_all('td') if tag.text.strip().endswith("%")]
output is:
[u'0.32%', u'0.29%', u'0.38%']

Related

Get the content of multiple classes when scraping a website

The problem that I am facing is simple. If I am trying to get some data from a website, there are two classes with the same name. But they both contain a table with different Information. The code that I have only outputs me the content of the very first class. It looks like this:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find("tr", {"class": "table3"})
print(results.prettify())
How can I get the code to put out either the content of both tables or only the content of the second one?
Thanks for your answers in advance!
You can use .find_all() and [1] to get second result. Example:
from bs4 import BeautifulSoup
txt = """
<tr class="table3"> I don't want this </tr>
<tr class="table3"> I want this! </tr>
"""
soup = BeautifulSoup(txt, "html.parser")
results = soup.find_all("tr", class_="table3")
print(results[1]) # <-- get only second one
Prints:
<tr class="table3"> I want this! </tr>

BeautifulSoup: Can't find Tag with text it contains

I have troubles finding a tag using the text it contains on the following page:
Link to web page
I am trying to find the Bloomberg and Reuters codes using the following code.
Using cssSelector i tried:
css_selector = 'tr:has(> td:contains("Bloomberg Code"))'
my_tag: Tag = my_soup.select_one(css_selector)
Using find I tried:
my_tag = my_soup.find(lambda t: t.Tag == 'td' and re.findall('Bloomberg Code', t.text, flags=re.I))
They both return a massive amount of Html code, which does start by the tag "tr", but doesn't match what i was expecting to be:
<tr>
<td style="padding-top:5px">- Bloomberg Code : </td>
<td style="padding-left:10px;padding-top:5px" align="left"> FLTR:ID</td>
</tr>
I think the issue might be that Beautifulsoup sees it as a navigable string, but when i check type of result found for my_tag it says: class 'bs4.element.Tag'
Thanks for the help
Best
You need a User-Agent header and want the adjacent sibling td of the td which contains search term.
from bs4 import BeautifulSoup as bs
import requests
search_strings = ['Bloomberg Code :',' Reuters Code :']
r = requests.get('https://www.marketscreener.com/FLUTTER-ENTERTAINMENT-PLC-59029817/company/', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
for search_string in search_strings:
node = soup.select_one(f'td:contains("{search_string}") + td')
if node is None:
print(f'{search_string} not found')
else:
print(node.text)

BeatifulSoup and single quotes in attributes

I am trying to read an Html page and get some information from it.
In one of the lines, the information I need is inside an Image's alt attribute. like so:
<img src='logo.jpg' alt='info i need'>
The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present.
Because of this, the result is something like this:
<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>
Currently, my code consists in this:
name = row.find("td", {"class": "logo"}).find("img")["alt"]
Which should return "info i need" but is currently returning "\'info"
What can I be doing wrong?
Is there any settings that I need to change in order to beautifulsoup to parse this correctly?
Edit:
my code looks something like this ( I used the standard html parser too, but no difference there )
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://myhtml.html'
with urllib.request.urlopen(url) as page:
text = str(page.read())
html = BeautifulSoup(page.read(), "lxml")
table = html.find("table", {"id": "info_table"})
rows = table.find_all("tr")
for row in rows:
if row.find("th") is not None:
continue
info = row.find("td", {"class": "logo"}).find("img")["alt"]
print(info)
if __name__ == '__main__':
main()
and the html:
<div class="table_container">
<table class="info_table" id="info_table">
<tr>
<th class="logo">Important infos</th>
<th class="useless">Other infos</th>
</tr>
<tr >
<td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
<tr >
<td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
Sorry, I am unable to add a comment.
I have tested your case and for me the output seems correct.
HTML:
<html>
<body>
<td class="logo">
<img src='logo.jpg' alt='info i need'>
</td>
</body>
</html>
Python:
from bs4 import BeautifulSoup
with open("myhtml.html", "r") as html:
soup = BeautifulSoup(html, 'html.parser')
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
Returns:
info i need
I think your problem is a encoding problem while write the file back to html.
Please provide the full code and further information.
html
your python code
Update:
I've tested your code, your code is not working at all :/
After rework i was able to get required output as a result.
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://code.mytesturl.net'
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page, "html.parser")
name = soup.find("td", {"class": "logo"}).find("img")["alt"]
print(name)
if __name__ == '__main__':
main()
Possible problems:
Maybe your parser should be html.parser
Python version / bs version ?

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

Beautiful Soup scrapes only half table

I am trying to learn how to use Beautiful Soup and I have a problem when scraping a table from Wikipedia.
from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
print soup
It seems like I can't get the full Wikipedia table, but the last entry I get with this code is Omnicon Group and it stops before getting the /tr in the source code. If you check in the original link the last entry of the table is Zoetis so it stops about half way.
Everything seems ok in the Wikipedia source code... Any idea of what I might be doing wrong?
try this. read this for more http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
result = soup.find("table", class_="wikitable")
print(result)
this should be the last <tr> in your result
<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:ZTS" rel="nofollow">ZTS</a></td>
<td>Zoetis</td>
<td><a class="external text" href="http://www.sec.gov/cgi-bin/browse-edgar?CIK=ZTS&action=getcompany" rel="nofollow">reports</a></td>
<td>Health Care</td>
<td>Pharmaceuticals</td>
<td>Florham Park, New Jersey</td>
<td>2013-06-21</td>
<td>0001555280</td>
</tr>
You will also need to install requests with pip install requests and i used
python==3.4.3
beautifulsoup4==4.4.1
This is my working answer. It should work for you without even installing lxml.
I used Python 2.7
from bs4 import BeautifulSoup
import urllib2
wiki = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page, "html.parser")
print soup.table

Categories