How to remove all balise in text python - python

I want to extract data from a tag to simply retrieve the text. Unfortunately I can't extract just the text, I always have links in this one.
Is it possible to remove all of the <img> and <a href> tags from my text?
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
I just want to recover this : its a good day and ignore the content of the <a href> tag in my <div> tag
Currently I perform the extraction via a beautifulsoup.find('div)

Try to do this
import requests
from bs4 import BeautifulSoup
#response = requests.get('your url')
html = BeautifulSoup('''<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a>
</div>''', 'html.parser')
soup = html.find_all(class_='xxx')
print(soup[0].text.split('\n')[0])

Let's import re and use re.sub:
import re
s1 = '<div class="xxx" data-handler="xxx">its a good day'
s2 = '<a class="link" href="https://" title="text">https:// link</a></div>'
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)
Output
>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''

EDIT
Based on your comment, that all text before <a> should be captured and not only the first one in element, select all previous_siblings and check for NavigableString:
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
Example
from bs4 import Tag, NavigableString, BeautifulSoup
html='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
To focus just on the text and not the children tags of an element, you could use :
.find(text=True)
In case the pattern is always the same and text the first part of content in the element:
.contents[0]
Example
from bs4 import BeautifulSoup
html='''
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
soup.div.find(text=True).strip()
Output
its a good day

So basically you don't want any text inside the <a> tags and everything within all tags.
from bs4 import BeautifulSoup
html1='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a></div>
'''
html2 = ''' <div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div> '''
html3 = ''' <div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a><div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div></div> '''
soup = BeautifulSoup(html3,'html.parser')
for t in soup.find_all('a', href=True):
t.decompose()
test = soup.find('div',class_='xxx').getText().strip()
print(test)
output:
#for html1: New wallpaper Find over 100+ of
#for html2: its a good day
#for html3: New wallpaper Find over 100+ of its a good day

Related

BeautifulSoup - extracting text from multiple span elements w/o classes

So that's how HTML looks:
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>
I need to extract detail2 & detail3.
But with this piece of code I only get detail1.
info = data.find("p", class_ = "details").span.text
How do I extract the needed items?
Thanks in advance!
Select your elements more specific in your case all sibling <span> of <span> with class number:
soup.select('span.number ~ span')
Example
from bs4 import BeautifulSoup
html='''<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('span.number ~ span')]
Output
['detail2', 'detail3']
You can find all <span>s and do normal indexing:
from bs4 import BeautifulSoup
html_doc = """\
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
spans = soup.find("p", class_="details").find_all("span")
for s in spans[-2:]:
print(s.text)
Prints:
detail2
detail3
Or CSS selectors:
spans = soup.select(".details span:nth-last-of-type(-n+2)")
for s in spans:
print(s.text)
Prints:
detail2
detail3

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
from bs4 import BeautifulSoup
req_ses = requests.Session()
igr_get_base_response = req_ses.get("https://igr.karnataka.gov.in/english#")
soup = BeautifulSoup(igr_get_base_response.text)
def matches_block(tag):
return matches_dropdown(tag) and tag.find(matches_text) != None
def matches_dropdown(tag):
return tag.name == 'li' and tag.has_attr('class') and 'dropdown-toggle' in tag['class']
def matches_text(tag):
return tag.name == 'a' and tag.get_text()
for li in soup.find_all(matches_block):
for ul in li.find_all('ul', class_='dropdown-toggle'):
for a in ul.find_all('a'):
if a.has_attr('href'):
print (a['href'])
any suggestion would be great help !
Edit: Adding part of HTML below:
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li class="">
<i class="fa fa-home"> </i>
</li>
<li>
<a class="dropdown-toggle" data-toggle="dropdown" title="RTI Act">RTI Act <b class="caret"></b></a>
<ul class="dropdown-menu multi-level">
<!-- <li> -->
<li class="">
<a href=" https://igr.karnataka.gov.in/page/RTI+Act/Yadagiri+./en " title="Yadagiri .">Yadagiri .
</a>
</li>
<!-- </li> -->
<!-- <li>
I have tried to get the links of all the PDF files that you need.
I have selected the <a> tags whose href matches with the pattern - see patt in code. This pattern is common to all the PDF files that you need.
Now you have all the links to the PDF files in links list.
from bs4 import BeautifulSoup
import requests
url = 'https://igr.karnataka.gov.in/english#'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
a = soup.find('a', attrs= {'title': 'Guidelines Value (CVC)'})
lst = a.parent()
links = []
patt = 'https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/'
for i in lst:
temp = i.find('a')
if temp:
if patt in temp['href']:
links.append(temp['href'].strip())
I have first find ul_tag in which all the data is available now from that find_all method on a where it contains .pdf href with attrs having target:_blank so from it we can extract only .pdf links
from bs4 import BeautifulSoup
import requests
res=requests.get("https://igr.karnataka.gov.in/english#")
soup=BeautifulSoup(res.text,"lxml")
ul_tag=soup.find("ul",class_="nav navbar-nav")
a_tag=ul_tag.find_all("a",attrs={"target":"_blank"})
for i in a_tag:
print(i.get_text(strip=True))
print(i.get("href").strip())
Output:
SRO Chikkaballapur
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/chikkaballapur sro.pdf
SRO Gudibande
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/gudibande sro.pdf
SRO Shidlaghatta
https://igr.karnataka.gov.in/storage/pdf-files/Guidelines Value/shidlagatta sro.pdf
SRO Bagepalli
....
So, i used the following approach to complete the above mentioned part:
def make_sqlite_dict_from_parsed_row(district_value, sro_value, pdf_file_link):
sqlite_dict = {
"district_value": district_value,
"sro_value": sro_value,
"pdf_file_link": pdf_file_link.strip().replace(' ', '%20'),
"status": "PENDING"
}
sqlite_dict['hsh'] = get_hash(sqlite_dict, IGR_SQLITE_HSH_TUP)
return sqlite_dict
li_element_list = home_response_soup.find_all('li', {'class': 'dropdown-submenu'})
parsed_row_list=[]
for ele in li_element_list:
district_value = ele.find('a', {'class': 'dropdown-toggle'}).get_text().strip()
sro_pdf_a_tags = ele.find_all('a', attrs={'target': '_blank'})
if len(sro_pdf_a_tags) >=1:
for sro_a_tag in sro_pdf_a_tags:
sqlite_dict = make_sqlite_dict_from_parsed_row(
district_value,
sro_a_tag.get_text(strip=True),
sro_a_tag.get('href')
)
parsed_row_list.append(sqlite_dict)
else:
print("District: ", district_value, "'s pdf is corrupted")
this will give a proper_pdf_link, sro_name and disctrict_name

Extract content of div tag except other tags using BeuatifulSoup

I have below HTML content, wherein div tag looks like below
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
From above I want to extract text only as "aaa" and not other tags content.
When I do,
soup.find('div', {"class": "block"})
it gives me all the content as text and I want to avoid the contents of p tag.
Is there a method available in BeautifulSoup to do this?
Check the type of element,You could try:
from bs4 import BeautifulSoup
from bs4 import element
s = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
<h1>ddd</h1>
</div>
'''
soup = BeautifulSoup(s, "lxml")
for e in soup.find('div', {"class": "block"}):
if type(e) == element.NavigableString and e.strip():
print(e.strip())
# aaa
And this will ignore all text in sub tags.
You can remove the p tags from that div, which effectively gives you the aaa text.
Here's how:
from bs4 import BeautifulSoup
sample = """<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
"""
s = BeautifulSoup(sample, "html.parser")
excluded = [i.extract() for i in s.find("div", class_="block").find_all("p")]
print(s.text.strip())
Output:
aaa
You can use find_next(), which returns the first match found:
from bs4 import BeautifulSoup
html = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find('div', {"class": "block"}).find_next(text=True))
Output:
aaa

How to extract text based on a condition in python

I have my soup data like below.
<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
I need the text data from those a tags on a condition, maybe href=/title/*wildcharacter*
My could somewhat looks like this.
titles = []
for a in soup.find_all("a",href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
But with this condition, i get texts from all the a tags. I need only texts where href has "/title/***".
I guess you want it like this:
from bs4 import BeautifulSoup
html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
titles = []
for a in soup.select('a[href*="/title/"]',href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
Output:
[' Pulp Fiction ', ' Fight Club ']
You can use a regular expression to search for the contents of an attribute (in this case href).
For more details please refer to this answer: https://stackoverflow.com/a/47091570/1426630
1.) To get all <a> tags, where the href= begins with "/title/", you can use CSS selector a[href^="/title/"].
2.) To strip all text inside the tag, you can use .get_text() with parameter strip=True
soup = BeautifulSoup(html_text, 'html.parser')
out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)
Prints:
['Pulp Fiction', 'Fight Club']

Beautiful Soup and searching in results

These are my first steps with python, please bear with me.
Basically I want to parse a Table of Contents from a single Dokuwiki page with Beautiful Soup. The TOC looks like this:
<div id="dw__toc">
<h3 class="toggle">Table of Contents</h3>
<div>
<ul class="toc">
<li class="level1"><div class="li">#</div>
<ul class="toc">
<li class="level2"><div class="li">One</div></li>
<li class="level2"><div class="li">Two</div></li>
<li class="level2"><div class="li">Three</div></li>
I would like to be able to search in the content of the a-tags and if a result is found return its content and also return the href-link. So if I search for "one" the result should be
One
#link1
What I have done so far:
#!/usr/bin/python2
from BeautifulSoup import BeautifulSoup
import urllib2
#Grab and open URL, create BeatifulSoup object
url = "http://www.somewiki.at/wiki/doku.php"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Grab Table of Contents
grab_toc = soup.find('div', {"id":"dw__toc"})
#Look for all divs with class: li
ftext = grab_toc.findAll('div', {"class":"li"})
#Look for links
links = grab_toc.findAll('a',href=True)
#Iterate
for everytext in ftext:
text = ''.join(everytext.findAll(text=True))
data = text.strip()
print data
for everylink in links:
print everylink['href']
This prints out the data I want but I'm kind of lost to rewrite it to be able to search within the result and only return the searchterm. Tried something like
if data == 'searchtearm':
print data
break
else:
print 'Nothing found'
But this is kind of a weak search. Is there a nicer way to do this? In my example the Beatiful Soup resultset is changed into a list. Is it better to search in the result set in the first place, if so then how to do this?
Instead of searching through the links one-by-one, have BeautifulSoup search for you, using a regular expression:
import re
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE))
This would find the first a link in the table of contents with the 3 characters one in the text somewhere. Then just print the link and text:
print matching_link.string
print matching_link['href']
Short demo based on your sample:
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('''\
... <div id="dw__toc">
... <h3 class="toggle">Table of Contents</h3>
... <div>
...
... <ul class="toc">
... <li class="level1"><div class="li">#</div>
... <ul class="toc">
... <li class="level2"><div class="li">One</div></li>
... <li class="level2"><div class="li">Two</div></li>
... <li class="level2"><div class="li">Three</div></li>
... </ul></ul>''')
>>> matching_link = soup.find('a', text=re.compile('one', re.IGNORECASE))
>>> print matching_link.string
One
>>> print matching_link['href']
#link1
In BeautifulSoup version 3, the above .find() call returns the contained NavigableString object instead. To get back to the parent a element, use the .parent attribute:
matching_link = grab_toc.find('a', text=re.compile('one', re.IGNORECASE)).parent
print matching_link.string
print matching_link['href']

Categories