Beautiful Soup to Find and Regex Replace text 'not within <a></a> ' - python

I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !

If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('this is some text')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> test1 </div>
<div><br /></div>
<div>test2</div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

Related

Extract all text between two specific empty divs

I have html that looks like the one shown below. I want the text between the innermost two empty divs with class name "start" and "end" respectively. In the code below - I want the text between 2nd <div class ="start"> </div> and the first <div class ="end"> </div> In between these 2 empty divs there can be multiple divs with any number of tags, and I want the text in these divs. I tried accessing the text, multiple ways using just div.attrs['class'] and find_next_siblings methods, but it did not work. How to go about this?
many <divs> </divs> and other tags
<div class ="start"> </div>
<div> bla bla bla </div>
<div class ="start"> </div>
<div> <i> <a> <span> <p> Text I want </p></span></a></i> </div>
<div> <p> Text I want </p> <p> Text I want </p> </div>
<div class ="end"></div>
<div> bla bla bla </div>
<div class ="end"></div>
many <divs> </divs> and other tags
Here is one way to get the text you want:
from bs4 import BeautifulSoup as bs
html = '''
many <divs> </divs> and other tags
<div class ="start"> </div>
<div> bla bla bla </div>
<div class ="start"> </div>
<div> <i> <a> <span> <p> Text I want </p></span></a></i> </div>
<div> <p> Text I want </p> <p> Text I want </p> </div>
<div class ="end"></div>
<div> bla bla bla </div>
<div class ="end"></div>
many <divs> </divs> and other tags
'''
soup = bs(html, 'html.parser')
start_item = soup.select('div[class="start"]')[-1]
for x in start_item.find_next_siblings():
x_class = x.get('class')[0] if x.get('class') else None
if x_class != 'end':
print('Wanted text:', x.text)
else:
print('reached the end')
break
Result in terminal:
Wanted text: Text I want
Wanted text: Text I want Text I want
reached the end
See BeautifulSoup documentation here.
To get the tags between the last class ="start" and first class ="end" tags, you can use either .select with CSS selectors or .find_all with the lambda function
from bs4 import BeautifulSoup
pasted_html = '''many<divs></divs>and other tags<div class="start"></div><div>bla bla bla</div><div class="start"></div><div><i><a><span><p>Text I want</p></span></a></i></div><div><p>Text I want</p><p>Text I want</p></div><div class="end"></div><div>bla bla bla</div><div class="end"></div>many<divs></divs>and other tags'''
soup = BeautifulSoup(pasted_html, 'html5lib')
Parsing with html5lib is more reliable if using .select, but you can use a different parser if you go with .find.
Please note that this will not return anything unless the last .start comes before the first .end.
Using .select
s, e = 'div.start', 'div.end'
mTags = soup.select(f'{s}:not(:has(~ {s})) ~ *:not({e}):not({e} ~ *):has(~ {e})')
should give you the same ResultSet as when you use .find_all
mTags = soup.find_all(
lambda t: t.find_previous_sibling('div', {'class': 'start'}) and
not t.find_next_sibling('div', {'class': 'start'}) and
t.find_next_sibling('div', {'class': 'end'}) and
not t.find_previous_sibling('div', {'class': 'end'})
)
(I prefer .select just because the code is shorter.)
To extract the text, you can either join the texts from each tag in mTags
mText = ' '.join([t.get_text(' ').strip() for t in mTags])
# mText = "Text I want Text I want Text I want"
or you can join the htmls and parse again before using .get_text (less efficient)
mText = BeautifulSoup(
'\n'.join([t.prettify().strip() for t in mTags])
).get_text(' ').strip()
# mText = "Text I want\n \n \n \n \n \n \n \n Text I want\n \n \n Text I want"
If you want to minimize whitespace you can do something like
mText = ' '.join(w for w in mText.split() if w)
then mText should be "Text I want Text I want Text I want" no matter which of the above approaches were used.

How to exclude inner tags with beautifulsoup

Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)
Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1

How to extract the text inside a tag with BeautifulSoup in Python?

Supposing I have an html string like this:
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
I want to extract the content of d2 that is NOT wrapped by other tags, skipping a url. In other words I want to get such result:
Text 2
Text 2 continue
Is there a way to do it with BeautifulSoup?
I tried this, but it is not correct:
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)
Try with .find_all(text=True, recursive=False):
from bs4 import BeautifulSoup
div_test="""
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space
it will return a list with only text:
[u'\n Text 2\n ', u'\n Text 2 continue\n ']
[u'Text 2', u'Text 2 continue']
You can get only the NavigableString objects with a simple list comprehension.
tag = soup.find(id='d2')
s = ''.join(e for e in tag if type(e) is bs4.element.NavigableString)
Alternatively you can use the decompose method to delete all the child nodes, then get all the remaining items with text .
tag = soup.find(id='d2')
for e in tag.find_all() :
e.decompose()
s = tag.text

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Categories