Extract all text between two specific empty divs

Extract all text between two specific empty divs - python

I have html that looks like the one shown below. I want the text between the innermost two empty divs with class name "start" and "end" respectively. In the code below - I want the text between 2nd <div class ="start"> </div> and the first <div class ="end"> </div> In between these 2 empty divs there can be multiple divs with any number of tags, and I want the text in these divs. I tried accessing the text, multiple ways using just div.attrs['class'] and find_next_siblings methods, but it did not work. How to go about this?
many <divs> </divs> and other tags
<div class ="start"> </div>
<div> bla bla bla </div>
<div class ="start"> </div>
<div> <i> <a> <span> <p> Text I want </p></span></a></i> </div>
<div> <p> Text I want </p> <p> Text I want </p> </div>
<div class ="end"></div>
<div> bla bla bla </div>
<div class ="end"></div>
many <divs> </divs> and other tags

Here is one way to get the text you want:
from bs4 import BeautifulSoup as bs
html = '''
many <divs> </divs> and other tags
<div class ="start"> </div>
<div> bla bla bla </div>
<div class ="start"> </div>
<div> <i> <a> <span> <p> Text I want </p></span></a></i> </div>
<div> <p> Text I want </p> <p> Text I want </p> </div>
<div class ="end"></div>
<div> bla bla bla </div>
<div class ="end"></div>
many <divs> </divs> and other tags
'''
soup = bs(html, 'html.parser')
start_item = soup.select('div[class="start"]')[-1]
for x in start_item.find_next_siblings():
x_class = x.get('class')[0] if x.get('class') else None
if x_class != 'end':
print('Wanted text:', x.text)
else:
print('reached the end')
break
Result in terminal:
Wanted text: Text I want
Wanted text: Text I want Text I want
reached the end
See BeautifulSoup documentation here.

To get the tags between the last class ="start" and first class ="end" tags, you can use either .select with CSS selectors or .find_all with the lambda function
from bs4 import BeautifulSoup
pasted_html = '''many<divs></divs>and other tags<div class="start"></div><div>bla bla bla</div><div class="start"></div><div><i><a><span><p>Text I want</p></span></a></i></div><div><p>Text I want</p><p>Text I want</p></div><div class="end"></div><div>bla bla bla</div><div class="end"></div>many<divs></divs>and other tags'''
soup = BeautifulSoup(pasted_html, 'html5lib')
Parsing with html5lib is more reliable if using .select, but you can use a different parser if you go with .find.
Please note that this will not return anything unless the last .start comes before the first .end.
Using .select
s, e = 'div.start', 'div.end'
mTags = soup.select(f'{s}:not(:has(~ {s})) ~ *:not({e}):not({e} ~ *):has(~ {e})')
should give you the same ResultSet as when you use .find_all
mTags = soup.find_all(
lambda t: t.find_previous_sibling('div', {'class': 'start'}) and
not t.find_next_sibling('div', {'class': 'start'}) and
t.find_next_sibling('div', {'class': 'end'}) and
not t.find_previous_sibling('div', {'class': 'end'})
)
(I prefer .select just because the code is shorter.)
To extract the text, you can either join the texts from each tag in mTags
mText = ' '.join([t.get_text(' ').strip() for t in mTags])
# mText = "Text I want Text I want Text I want"
or you can join the htmls and parse again before using .get_text (less efficient)
mText = BeautifulSoup(
'\n'.join([t.prettify().strip() for t in mTags])
).get_text(' ').strip()
# mText = "Text I want\n \n \n \n \n \n \n \n Text I want\n \n \n Text I want"
If you want to minimize whitespace you can do something like
mText = ' '.join(w for w in mText.split() if w)
then mText should be "Text I want Text I want Text I want" no matter which of the above approaches were used.

Related

Get the text between a parent tag and its first child using BeatifulSoup

Let's say I have the following HTML snippet:
<p class="toSelect">
"some text here..."
<br>
"Other text here..."
</p>
any suggestions to how to get the first text between the <p> tag and its <br> child tag using Beautifulsoup in Python?

You can select element <p class="toSelect"> and then .find_next with text=True:
from bs4 import BeautifulSoup
html_doc = """
<p class="toSelect">
"some text here..."
<br>
"Other text here..."
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
text = soup.select_one(".toSelect").find_next(text=True)
print(text)
Prints:
"some text here..."

You can use .contents of <p> to get all the text and children of <p> and select the first item from the list.
from bs4 import BeautifulSoup
s = '''<p class="toSelect">
"some text here..."
<br>
"Other text here..."
</p>'''
soup = BeautifulSoup(s, 'html.parser')
x = soup.find('p')
print(x.contents[0].strip())
Output:
"some text here..."
You can read more about .contents - Docs

Python - BeautifulSoup - Unable to extract Span Value

I have an XML with mutiple Div Classes/Span Classes and I'm struggling to extract a text value.
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span>
So far I have written this:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all('span', attrs={'class': 'html-tag'})[29]
print(spans.text)
This unfortunately only prints out the "This is a Heading that I dont want" value e.g.
This is the heading I dont want
Number [29] in the code is the position where the text I need will always appear.
I'm unsure how to retrieve the span value I need.
Please can you assist. Thanks

You can search by <div class="line"> and then select second <span>.
For example:
txt = '''
# line 1
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 2
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 3
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span> <--- this is I want
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
s = soup.select('div.line')[2].select('span')[1] # select 3rd line 2nd span
print(s.text)
Prints:
This is the text I want

How to extract the text inside a tag with BeautifulSoup in Python?

Supposing I have an html string like this:
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
I want to extract the content of d2 that is NOT wrapped by other tags, skipping a url. In other words I want to get such result:
Text 2
Text 2 continue
Is there a way to do it with BeautifulSoup?
I tried this, but it is not correct:
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)

Try with .find_all(text=True, recursive=False):
from bs4 import BeautifulSoup
div_test="""
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space
it will return a list with only text:
[u'\n Text 2\n ', u'\n Text 2 continue\n ']
[u'Text 2', u'Text 2 continue']

You can get only the NavigableString objects with a simple list comprehension.
tag = soup.find(id='d2')
s = ''.join(e for e in tag if type(e) is bs4.element.NavigableString)
Alternatively you can use the decompose method to delete all the child nodes, then get all the remaining items with text .
tag = soup.find(id='d2')
for e in tag.find_all() :
e.decompose()
s = tag.text

How to extract tuples using findall?

I'm trying to extract tuples from an url and I've managed to extract string text and tuples using the re.search(pattern_str, text_str). However, I got stuck when I tried to extract a list of tuples using re.findall(pattern_str, text_str).
The text looks like:
<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>
... # repeating
...
...
and I'm using the following pattern & code to extract the tuples:
text_above = "..." # this is the text above
pat_str = '<a href="(\d+)">\n(.+)\n<span class'
pat = re.compile(pat_str)
# following line is supposed to return the numbers from the 2nd line
# and the string from the 3rd line for each repeating sequence
list_of_tuples = re.findall(pat, text_above)
for t in list_of tuples:
# supposed to print "11111 -> blah blah 111"
print(t[0], '->', t[1])
Maybe I'm trying something weird & impossible, maybe its better to extract the data using primitive string manipulations... But in case there exists a solution?

Your regex does not take into account the whitespace (indentation) between \n and <span. (And neither the whitespace at the start of the line you want to capture, but that's not as much of a problem.) To fix it, you could add some \s*:
pat_str = '<a href="(\d+)">\n\s*(.+)\n\s*<span class'

As suggested in the comments, use a html parser like BeautifulSoup:
from bs4 import BeautifulSoup
h = """<li>
<a href="11111">
some text 111
<span class="some-class">
#11111
</span>
</a>
</li><li>
<a href="22222">
some text 222
<span class="some-class">
#22222
</span>
</a>
</li><li>
<a href="33333">
some text 333
<span class="some-class">
#33333
</span>
</a>"""
soup = BeautifulSoup(h)
You can get the href and the previous_sibling to the span:
print([(a["href"].strip(), a.span.previous_sibling.strip()) for a in soup.find_all("a")])
[('11111', u'some text 111'), ('22222', u'some text 222'), ('33333', u'some text 333')]
Or the href and the first content from the anchor:
print([(a["href"].strip(), a.contents[0].strip()) for a in soup.find_all("a")])
Or with .find(text=True) to only get the tag text and not from the children.
[(a["href"].strip(), a.find(text=True).strip()) for a in soup.find_all("a")]
Also if you just want the anchors inside the list tags, you can specifically parse those:
[(a["href"].strip(), a.contents[0].strip()) for a in soup.select("li a")]

Beautiful Soup to Find and Regex Replace text 'not within <a></a> '

I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !

If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('this is some text')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> test1 </div>
<div><br /></div>
<div>test2</div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract all text between two specific empty divs - python

Related

Get the text between a parent tag and its first child using BeatifulSoup

Python - BeautifulSoup - Unable to extract Span Value

How to extract the text inside a tag with BeautifulSoup in Python?

How to extract tuples using findall?

Beautiful Soup to Find and Regex Replace text 'not within <a></a> '

Categories

Resources