How to extract text based on a condition in python - python

I have my soup data like below.
<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
I need the text data from those a tags on a condition, maybe href=/title/*wildcharacter*
My could somewhat looks like this.
titles = []
for a in soup.find_all("a",href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
But with this condition, i get texts from all the a tags. I need only texts where href has "/title/***".

I guess you want it like this:
from bs4 import BeautifulSoup
html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
titles = []
for a in soup.select('a[href*="/title/"]',href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
Output:
[' Pulp Fiction ', ' Fight Club ']

You can use a regular expression to search for the contents of an attribute (in this case href).
For more details please refer to this answer: https://stackoverflow.com/a/47091570/1426630

1.) To get all <a> tags, where the href= begins with "/title/", you can use CSS selector a[href^="/title/"].
2.) To strip all text inside the tag, you can use .get_text() with parameter strip=True
soup = BeautifulSoup(html_text, 'html.parser')
out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)
Prints:
['Pulp Fiction', 'Fight Club']

Related

How to remove all balise in text python

I want to extract data from a tag to simply retrieve the text. Unfortunately I can't extract just the text, I always have links in this one.
Is it possible to remove all of the <img> and <a href> tags from my text?
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
I just want to recover this : its a good day and ignore the content of the <a href> tag in my <div> tag
Currently I perform the extraction via a beautifulsoup.find('div)
Try to do this
import requests
from bs4 import BeautifulSoup
#response = requests.get('your url')
html = BeautifulSoup('''<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a>
</div>''', 'html.parser')
soup = html.find_all(class_='xxx')
print(soup[0].text.split('\n')[0])
Let's import re and use re.sub:
import re
s1 = '<div class="xxx" data-handler="xxx">its a good day'
s2 = '<a class="link" href="https://" title="text">https:// link</a></div>'
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)
Output
>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''
EDIT
Based on your comment, that all text before <a> should be captured and not only the first one in element, select all previous_siblings and check for NavigableString:
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
Example
from bs4 import Tag, NavigableString, BeautifulSoup
html='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
To focus just on the text and not the children tags of an element, you could use :
.find(text=True)
In case the pattern is always the same and text the first part of content in the element:
.contents[0]
Example
from bs4 import BeautifulSoup
html='''
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
soup.div.find(text=True).strip()
Output
its a good day
So basically you don't want any text inside the <a> tags and everything within all tags.
from bs4 import BeautifulSoup
html1='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a></div>
'''
html2 = ''' <div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div> '''
html3 = ''' <div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a><div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div></div> '''
soup = BeautifulSoup(html3,'html.parser')
for t in soup.find_all('a', href=True):
t.decompose()
test = soup.find('div',class_='xxx').getText().strip()
print(test)
output:
#for html1: New wallpaper Find over 100+ of
#for html2: its a good day
#for html3: New wallpaper Find over 100+ of its a good day

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

How to extract text from site using beautiful soup and python

I have this html tag that i am trying to scrape
<span class="title NSNTitle">
<small class="text-primary"><strong>
ID 1040-KK-143-6964, 1040001436964
</strong></small>
<br>
<small class="text-primary">
MODIFICATION KIT,
</small>
</span>
I use this code
page_soup = soup(page_html, "html.parser")
FSGcontainer = page_soup.find("h1", {"class": "nopad-top"}).find_all("small", {"class": "text-primary"})
for subcontainer in FSGcontainer:
FSGsubcard = subcontainer
if FSGsubcard is not None:
Nomenclature = FSGsubcard.text
print(Nomenclature)
and I get this output
NSN 1040-KK-143-6964, 1005009927288
MODIFICATION KIT,
what I really want is the text "Modification kit,"
how can I capture just the text and not the IDs ?
Use select_one together with a css selector that selects the second small element.
nomenclature = page_soup.find("h1",
{"class": "nopad-top"}
).select_one(
'small:nth-of-type(2)'
).text.strip()
Try this. It will let you fetch the specific items you want.
for item in soup.find_all(class_="title"):
text_item = item.find_all(class_="text-primary")[1].text
print(text_item)
Result:
MODIFICATION KIT

Why does BeautifulSoup work the second time parsing, but not the first

This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]
I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.
You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]

How to extracts contents of a div tag containing a particular text using BeautifulSoup

I am new to BeautifulSoup and am looking to extract texts from a list inside a div tag. this is the code
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
I would like to extract the text "Awaiting bone marrow transplant". This is the code which I use now which gives me an empty list:
for link in soup.findAll('div', text = re.compile('Description Synonyms ')):
print link
Sorry for not adding this. I do have other divs by the same class name. I am interested in only the description synonyms.The other div is listed below
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
Dsoup.find(text='...') doesn't work if there's other text or tags inside that tag.
Try:
[i.find('ul', {'class': "definitionList"}).find('li').text
for i in soup.find_all('div', {'class': "contentBlurb"})
if 'Description Synonyms' in str(i.text)][0]
You can do this:
# coding: utf-8
from bs4 import BeautifulSoup
html = """
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
"""
souped = BeautifulSoup(html)
matching_divs = [div for div in souped.find_all(
'div', {'class': 'contentBlurb'}) if 'Description Synonyms' in div.getText()]
li_elements = []
matching_uls = []
for mdiv in matching_divs:
matching_uls.extend(mdiv.findAll('ul', {'class': 'definitionList'}))
for muls in matching_uls:
li_elements.extend(muls.findAll('li'))
for li in li_elements:
print(li.getText())
EDIT: Updated for matching particular div.
Try this, Change it to the required string in the if clause. The below snippet will print if the tag's text has Applicable To, you can change it to your requirement
val = soup.find('div', {'class': 'contentBlurb'}).text
if "Description Synonyms" in val:
print soup.find('div', {'class': 'contentBlurb'}).find('ul', {'class': 'definitionList'}).find('li').text

Categories