Beautiful Soup - Get all text, but preserve link html? - python

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.
I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.
As an example, I would like to convert:
<td>
<font><span>Hello</span><span>World</span></font><br>
<span>Foo Bar <span>Baz</span></span><br>
<span>Example Link: Google</span>
</td>
Into:
Hello World
Foo Bar Baz
Example Link: Google
My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)
for tag in tags:
if (tag.name == 'a'):
print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
else:
print(tag.get_text())
Which returns multiple fragments/duplicates as the parser moves down the tree:
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World
Foo Bar Baz
Baz
Example Link: Google
<a href='https://google.com'>Google</a>

One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.
You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def _all_strings(self, strip=False, types=(NavigableString, CData)):
for descendant in self.descendants:
# return "a" string representation if we encounter it
if isinstance(descendant, Tag) and descendant.name == 'a':
yield str(descendant)
# skip an inner text node inside "a"
if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
continue
# default behavior
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
Demo:
In [1]: data = """
...: <td>
...: <font><span>Hello</span><span>World</span></font><br>
...: <span>Foo Bar <span>Baz</span></span><br>
...: <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
...: t-decoration: underline;">Google</a></span>
...: </td>
...: """
In [2]: soup = MyBeautifulSoup(data, "lxml")
In [3]: print(soup.get_text())
HelloWorld
Foo Bar Baz
Example Link: Google

To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
print(tag.text)
print(tag.find('a'))
If you want the text printed on separate lines you will have to process the spans individually.
for tag in tags:
spans = tag.find_all('span')
for span in spans:
print(span.text)
print(tag.find('a'))

Related

How do I use find_all or select more precisely in this case?

When I run the following code after importing stuff:
Fighter1Main = []
for i in range(1,3):
url = Request(f"https://www.sherdog.com/events/a-{page}", headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(url).read()
soup = BeautifulSoup(response, "html.parser")
for test2 in soup.find_all(class_="fighter left_side"):
test3 = test2.find_all(itemprop="url")
Fighter1Main.append(test3)
page = page + 1
I get:
[[<a href="/fighter/Todd-Medina-61" itemprop="url">
<img alt="Todd 'El Tiburon' Medina" itemprop="image" src="/image_crop/200/300/_images/fighter/20140801074225_IMG_5098.JPG" title="Todd 'El Tiburon' Medina">
</img></a>], [<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">
<img alt="Ricco 'Suave' Rodriguez" itemprop="image" src="/image_crop/200/300/_images/fighter/20141225125221_1MG_9472.JPG" title="Ricco 'Suave' Rodriguez">
</img></a>]]
But I was expecting:
<a href="/fighter/Todd-Medina-61" itemprop="url">
<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">
This is the type of webpage in question https://www.sherdog.com/events/a-1
I also tried using css select and got the same result.
for test2 in soup.select('.fighter.left_side [itemprop="url"]'):
Fighter1Main.append(test2)
I thought I was using it correctly but I'm not sure how else to narrow it down to what I want.
If your issue is that you're getting a list of lists, and you just want a flat list, then you should do it like
for test2 in soup.find_all(class_="fighter left_side"):
Fighter1Main += [t for t in test2.find_all(itemprop="url")]
but since you weren't happy with the output from for test2 in soup.select('.fighter.left_side [itemprop="url"]'): Fighter1Main.append(test2), and from your title, I'm guessing that isn't the the problem here.
If you want to filter out any tags that have a nested tag inside them then you can add :not(:has(*)) to your selector
for test2 in soup.select('.fighter.left_side *[itemprop="url"]:not(:has(*))'):
Fighter1Main.append(test2)
however, you can expect an empty list if you do this because [as far as I can tell] all tags matched to .fighter.left_side *[itemprop="url"] will have an img tag nested within.
If you really want something like your expected output, you'll have to either alter the soup or build it up yourself.
You can either remove everything inside the Tags with itemprop="url" [original soup object will be altered]:
for test2 in soup.select('.fighter.left_side *[itemprop="url"]'):
test2.clear()
Fighter1Main.append(test2)
Or you could form new html tags with only the href [if there is any] and itemprop attributes [original soup object will remain unaltered, but you'll be parsing and extracting again for each item]:
soup = BeautifulSoup(response, "html.parser")
Fighter1Main += [BeautifulSoup(
f'<{n}{h} itemprop="url"></{n}>', "html.parser"
).find(n) for n, h in [(
t.name, '' if t.get("href") is None else f' href="{t.get("href")}"'
) for t in soup.select('.fighter.left_side *[itemprop="url"]')]]

How To List All The Attributes Within A Tag Using Beautiful Soup

Let's say I have an HTML file with this in it:
<div class="parentclass">
<img alt="puppydog" class="" src="https://google.com/puppydawgimage.png" title="puppy dog" />
</div>
my Python Script looks like this:
from bs4 import BeautifulSoup
html = BeautifulSoup(html, "html.parser")
stuff = html.find_all("div", class_="parentclass")
parentclassstuff = []
for x in stuff:
#This adds all of the tags specific to the criteria of "div" and "parentclass".
parentclassstuff.append(x)
print(parentclassstuff[0])
'''This outputs:
<div class="parentclass">
<img alt="puppydog" class="" src="https://google.com/puppydawgimage.png" title="puppy dog"/></div>'''
What I want to do is access the individual attributes of parentclassstuff[0] like this:
parentclassstuff[0]["img alt"] = "puppydog"
parentclassstuff[0]["src"] = "https://google.com/puppydawgimage.png"
etc...
However I can't find a way to do that. I have tried the parentclassstuff[0].get(attribute) method among others. The output of parentclassstuff[0] is not a string but a "class 'bs4.element.Tag'". Worse case scenario I could just convert it to a string and slice the data I need but I would think Beautiful Soup would provide an easy way to access this data.
I figured it out. The variable "stuff" in my example is a list of all of the sub-classes of the "parentclass" class. This involved me accessing the sub-classes as such (note: I technically did not need the "x in stuff" loop in my example because the variable "stuff" pretty much acts like a list itself albeit is not technically a "list" type):
from bs4 import BeautifulSoup
html = BeautifulSoup(html, "html.parser")
stuff = html.find_all("div", class_="parentclass")
#An empty list put my attributes in (the attributes will be in a dict format).
attributes = {}
#
for x in range(0, len(stuff)):
#This accesses the subclass by class name:
subclass = stuff[x].find(class__ = "")
attribute = subclass.attr
attributes.append(attribute)
#The output will be:
attributes = [{"img alt": "puppydog", "class": "", "src": "https://google.com/puppydawgimage.png", "title": "puppy dog"}]
#If there is more than one subclass tag then the above list will add it in order accordingly.

python beautifulsoup4 how to get span text in div tag

This is the html code
<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>
I used like this
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
print(item.find('div', class_='salary-snippet'))
i got the result a list such as
<div aria-label="RM 3,500 to RM 8,000 a month" class="salary-snippet"><span>RM 3,500 - RM 8,000 a month</span></div>
if i used
print(item.find('div', class_='salary-snippet').text.strip())
it will return the error
AttributeError: 'NoneType' object has no attribute 'text'
so how can i get only the span text? its my first time web scraping
May be this is what you are looking for.
First select all the <div> tags with class as salary-snippet as this is the parent of the <span> tag that you are looking for. Use .find_all()
Now Iterate over the all the selected <div> tags from above and find the <span> from each <div>.
Based on your question, I assume that All these <div> may not have the <span> tag. In that case you can print the text only if the <div> contains a span tag. See below
# Find all the divs
d = soup.find_all('div', class_='salary-snippet')
# Iterating over the <div> tags
for item in d:
# Find <span> in each item. If not exists x will be None
x = item.find('span')
# Check if x is not None and then only print
if x:
print(x.text.strip())
Here is the complete code.
from bs4 import BeautifulSoup
s = """<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>"""
soup = BeautifulSoup(s, 'lxml')
d = soup.find_all('div', class_='salary-snippet')
for item in d:
x = item.find('span')
if x:
print(x.text.strip())
RM 6,000 a month
I believe the line should be:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
Alternatively, if there is only the span you can simply use:
item.find("span").text.strip()
Considering you used the .find_all() method you might want to ensure that every div returned from your HTML
soup.find_all('div', class_='job_seen_beacon')
contains the element you are looking for as thi could arise if only one element doesn't.
i.e.
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
try:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
except AttributeError:
print("Item Not available")
What this will do is try get the text but if this fails will print the item that failed so you can identify why... perhaps it doesn't have the element you are searching for.

Scraping elements without an id or class from a web page using Python BeautifulSoup

I know how to scrape data from a web page if the element has an id or class.
e.g. Here, soup is a BeautifulSoup object.
for item in soup.findAll('a',{"class":"class_name"}):
title = item.string
print(title+"\n")
How can we do this if the element has no id or class? For example, a paragraph element without an id or class.
Or in an even worse case, what happens if we need to just scrape some plain text like the following?
<body>
<p>YO!</p>
hello world!!
</body>
How can I print only hello world!! in the above page source, for example?
It has no id or class.
If you want to locate an element specifically having no defined id and class attributes:
soup.find("p", class_=False, id=False)
To locate the "text" nodes like hello world!! in your example, you can either get it by the text itself - via a partial match, or a regular expression match:
import re
soup.find(text=re.compile("^hello")) # find text starting with "hello"
soup.find(text="hello world!!") # find text with an exact "hello world!!" text
soup.find(text=lambda text: text and "!!" in text) # find text havin "!!" inside it
Or, you can locate that preceding p element and get the next text node:
soup.find("p", class_=False, id=False).find_next_sibling(text=True)
soup.find("p", text="YO!").find_next_sibling(text=True)
However, if you just want to get the text inside the body tag, but don't want the text in it's any tags.
You could use tag.find_all() to get all of the tags in it, and then use tag.extract() remove them. Then you'll get a body tag with only the text in it.
For example:
>>> soup = BeautifulSoup('''\
... <body>
... <p>YO!</p>
... hello world!!
... </body>
... ''')
>>> print(soup.get_text())
YO!
hello world!!
>>> print(soup.find('body').get_text())
YO!
hello world!!
>>> for tag in soup.find('body').find_all():
... tag.extract()
...
...
<p>YO!</p>
>>> print(soup.find('body').get_text())
hello world!!
>>> print(soup.find('body').get_text(strip=True))
hello world!!
>>>

Using BeautifulSoup to extract an li element based on a string contained within

I have been attempting to use BeautifulSoup to retrieve any <li> element that contains any format of the following word: Ottawa. The problem is that ottawa is never within a tag of it's own such as <p>. So I want to only print li elements that contain Ottawa.
The HTML formatting is like this:
<html>
<body>
<blockquote>
<ul><li><b>name</b>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>
My code is as follows:
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')
The results of the above code finds Ottawa correctly, and when using it to find the li elements, it does find the li elements but it gives me every single one on the page.
I understand that they are currently not in conjunction as trying to do search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL)) results in []
My end goal is basically to get every <li> element that contains any mention of Ottawa and give me the entire <li> element with the name, description, link, etc.
Use the text attribute to filter the results of the findAll:
elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
if link is None:
continue
print(u'{} [{}]: {}'.format(link.text,
item.strip(),
link['href']).encode('utf8'))

Categories