Python + BeautifulSoup: How to get wrapper out of HTML based on text? - python

Would like to get the wrapper of a key text. For example, in HTML:
…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…
And by based on the text “chicken”, would like to get back <div class=“target”>chicken</div>.
Currently, have the following to fetch the HTML:
import requests
from bs4 import BeautifulSoup
req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)
And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.
But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?
Thank you in advance and will be sure to accept/upvote answer!

# coding: utf-8
html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title> Last chicken leg on stock! Only 500$ !!! </title>
</head>
</body>
<div id="layer1" class="class1">
<div id="layer2" class="class2">
<div id="layer3" class="class3">
<div id="layer4" class="class4">
<div id="layer5" class="class5">
<p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
<div id="layer6" class="class6">
<div id="layer7" class="class7">
<div id="chicken_surname" class="chicken">eat me</div>
<div id="layer8" class="class8">
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>"""
from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")
# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)
# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)
# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')
print('\n###### .text attr:')
print( tag.text, type(tag.text) )
print('\n###### .strings generator:')
for s in tag.strings: # strings is an generator object
print s, type(s)
# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:
print s.parent
# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:
print s.parent.parent

Related

Parsing invalid HTML and retrieving tag´s text to replace it

I need to iterate invalid HTML and obtain a text value from all tags to change it.
from bs4 import BeautifulSoup
html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for tag in soup.find_all():
print(tag.name)
if tag.string:
tag.string.replace_with("1")
print(soup)
The result is
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>
I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.
I tried using different parsers such as lxml and html5lib won´t make a difference.
I tried python´s HTML library but that doesn´t support changing HTML only iterating it.
.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if
it has no children or more than one child it will return None.
Scenario is not quiet clear to me, but here is one last approach based on your comment:
I need generic code to iterate any html and find all texts so I can work with them.
for tag in soup.find_all(text=True):
tag.replace_with('1')
Example
from bs4 import BeautifulSoup
html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(text=True):
tag.replace_with('1')
Output
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?
use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.
You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

Python + selenium: extract variable quantity of paragraphs between titles

Fellows, assuming the html below how can extract the paragraphs <p> who belongs to the tile <h3>.
<!DOCTYPE html>
<html>
<body>
...
<div class="main-div">
<h3>Title 1</h3>
<p></p>
<h3>Title 2</h3>
<p></p>
<p></p>
<p></p>
<h3>Title 3</h3>
<p></p>
<p></p>
...
</div>
</body>
As you can see both <h3> and <p> tags are children of the <div> tag but they have no class or id that makes possible to identify them and say that "Title 1" has 1 paragraph, title 2 has 3 paragraphs, title 3 has two paragraphs and so on. I can't see a way to tie the paragraph to the title...
I'm trying to do it using Python 2.7 + selenium. But I'm not sure that I'm working with the right tools, maybe you can suggest the solution or any different combinations like Beautifulsoup, urllib2...
Any suggestion/direction will be very appreciated!
UPDATE
After the brilliant solution pointed by #JustMe I came up with the solution below, hope it helps someone else or if someone can improve it to pythonic. I coming from c/c++/java/perl world so always I hit the wall :)
import bs4
page = """
<!DOCTYPE html>
<html>
<body>
...
<div class="maincontent-block">
<h3>Title 1</h3>
<p>1</p>
<p>2</p>
<p>3</p>
<h3>Title 2</h3>
<p>2</p>
<p>3</p>
<p>4</p>
<h3>Title 3</h3>
<p>7</p>
<p>9</p>
...
</div>
</body>
"""
page = bs4.BeautifulSoup(page, "html.parser")
div = page.find('div', {'class':"maincontent-block"})
mydict = {}
# write to the dictionary
for tag in div.findChildren():
if (tag.name == "h3"):
#print(tag.string)
mydict[tag.string] = None
nextTags = tag.findAllNext()
arr = [];
for nt in nextTags:
if (nt.name == "p"):
arr.append(nt.string)
mydict[tag.string] = arr
elif (nt.name == "h3"):
arr = []
break
# read from dictionary
arrKeys = []
for k in mydict:
arrKeys.append(k)
arrKeys.sort()
for k in arrKeys:
print k
for v in mydict[k]:
print v
It's easy to be done using BeautifulSoup
import bs4
page = """
<!DOCTYPE html>
<html>
<body>
...
<div class="main-div">
<h3>Title 1</h3>
<p></p>
<h3>Title 2</h3>
<p></p>
<p></p>
<p></p>
<h3>Title 3</h3>
<p></p>
<p></p>
...
</div>
</body>
"""
page = bs4.BeautifulSoup(page)
h3_tag = page.div.find("h3").string
print(h3_tag)
>>> u'Title 1'
h3_tag.find_next_siblings("p")
>>> [<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]
len(h3_tag.find_next_siblings("p"))/2
>>> 3
Ok, since You want separated count of paragraphs i came up with this, crude thing.
h_counters = []
count = -1
for child in page.div.findChildren():
if "<h3>" in str(child):
h_counters.append(count)
count = 0
else:
count += 1
h_counters.append(count)
h_counters = h_counters[1:]
print (h_counters)
>> [1, 3, 2]

How can I return all h1´s with the class fluid

The code below must generate a list which contains all the h1 which contain the class fluid. But it returns an empty list. I cannot find the mistake, can anyone help me
allh1= soup.findAll('h1')
classes = [ h1.get('class') for h1 in allh1]
fluid_list = []
for item in classes:
if item == 'fluid':
fluid_list.append(item)
print fluid_list
Your code doesn't work because your classes list contains a list of lists of the classes for each h1 found (or None if there is no class):
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
</head>
<body>
<h1>header 1</h1>
<h1 class="fluid">header 2</h1>
<h1>header 3</h1>
<h1 class="fluid static">header 4</h1>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
allh1= soup.findAll('h1')
classes = [ h1.get('class') for h1 in allh1]
print(classes)
[None, ['fluid'], None, ['fluid', 'static']]
If you're using Beautiful Soup 4.1.2+, you can use class_, however:
fluid_list = soup.find_all('h1', class_='fluid')
print fluid_list
[<h1 class="fluid">header 2</h1>, <h1 class="fluid static">header 4</h1>]
This returns the h1 elements themselves, which I assume is what you want.

BeautifulSoup create a <img /> tag

I need to create a <img /> tag.
BeautifulSoup creates an image tag like this with code I did:
soup = BeautifulSoup(text, "html5")
tag = Tag(soup, name='img')
tag.attrs = {'src': '/some/url/here'}
text = soup.renderContents()
print text
Output: <img src="/some/url/here"></img>
How to make it? : <img src="/some/url/here" />
It can be of course done with REGEX or similar chemistry. However I was wondering maybe there is any standard way to produce tags like this?
Don't use Tag() to create new elements. Use the soup.new_tag() method:
soup = BeautifulSoup(text, "html5")
new_tag = soup.new_tag('img', src='/some/url/here')
some_element.append(new_tag)
The soup.new_tag() method will pass along the correct builder to the Tag() object, and it is the builder that is responsible for recognising <img/> as an empty tag.
Demo:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div></div>', "html5")
>>> new_tag = soup.new_tag('img', src='/some/url/here')
>>> new_tag
<img src="/some/url/here"/>
>>> soup.div.append(new_tag)
>>> print soup.prettify()
<html>
<head>
</head>
<body>
<div>
<img src="/some/url/here"/>
</div>
</body>
</html>
In BS4 you can also do this:
img = BeautifulSoup('<img src="/some/url/here" />', 'lxml').img
print(img)
print(type(img))
which will output:
<img src="/some/url/here"/>
<class 'bs4.element.Tag'>

Categories