How to extract text from site using beautiful soup and python

How to extract text from site using beautiful soup and python - python

I have this html tag that i am trying to scrape
<span class="title NSNTitle">
<small class="text-primary"><strong>
ID 1040-KK-143-6964, 1040001436964
</strong></small>
<br>
<small class="text-primary">
MODIFICATION KIT,
</small>
</span>
I use this code
page_soup = soup(page_html, "html.parser")
FSGcontainer = page_soup.find("h1", {"class": "nopad-top"}).find_all("small", {"class": "text-primary"})
for subcontainer in FSGcontainer:
FSGsubcard = subcontainer
if FSGsubcard is not None:
Nomenclature = FSGsubcard.text
print(Nomenclature)
and I get this output
NSN 1040-KK-143-6964, 1005009927288
MODIFICATION KIT,
what I really want is the text "Modification kit,"
how can I capture just the text and not the IDs ?

Use select_one together with a css selector that selects the second small element.
nomenclature = page_soup.find("h1",
{"class": "nopad-top"}
).select_one(
'small:nth-of-type(2)'
).text.strip()

Try this. It will let you fetch the specific items you want.
for item in soup.find_all(class_="title"):
text_item = item.find_all(class_="text-primary")[1].text
print(text_item)
Result:
MODIFICATION KIT

Related

How to extract text based on a condition in python

I have my soup data like below.
<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
I need the text data from those a tags on a condition, maybe href=/title/*wildcharacter*
My could somewhat looks like this.
titles = []
for a in soup.find_all("a",href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
But with this condition, i get texts from all the a tags. I need only texts where href has "/title/***".

I guess you want it like this:
from bs4 import BeautifulSoup
html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
titles = []
for a in soup.select('a[href*="/title/"]',href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
Output:
[' Pulp Fiction ', ' Fight Club ']

You can use a regular expression to search for the contents of an attribute (in this case href).
For more details please refer to this answer: https://stackoverflow.com/a/47091570/1426630

1.) To get all <a> tags, where the href= begins with "/title/", you can use CSS selector a[href^="/title/"].
2.) To strip all text inside the tag, you can use .get_text() with parameter strip=True
soup = BeautifulSoup(html_text, 'html.parser')
out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)
Prints:
['Pulp Fiction', 'Fight Club']

Extracting li element and assigning it to variable with beautiful soup

Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.

Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)

Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]

Why does BeautifulSoup work the second time parsing, but not the first

This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]

I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.

You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I'm not using the Twitter API because it doesn't look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href value from the first line.
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "#", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
#Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

Use the dictionary-like access to the Tag's attributes.
For example, to get the href attribute value:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
Or, if you need to get the href values for every link found:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
You may use:
soup('a', {'class': 'tweet-timestamp'})
Or, a CSS selector:
soup.select("a.tweet-timestamp")

Alecxe already explained to use the 'href' key to get the value.
So I'm going to answer the other part of your questions:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
This will return the value 4.
If you want a rather more readable approach, try this:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
This returns 4 and 2 respectively.
This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.

How to extracts contents of a div tag containing a particular text using BeautifulSoup

I am new to BeautifulSoup and am looking to extract texts from a list inside a div tag. this is the code
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
I would like to extract the text "Awaiting bone marrow transplant". This is the code which I use now which gives me an empty list:
for link in soup.findAll('div', text = re.compile('Description Synonyms ')):
print link
Sorry for not adding this. I do have other divs by the same class name. I am interested in only the description synonyms.The other div is listed below
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>

Dsoup.find(text='...') doesn't work if there's other text or tags inside that tag.
Try:
[i.find('ul', {'class': "definitionList"}).find('li').text
for i in soup.find_all('div', {'class': "contentBlurb"})
if 'Description Synonyms' in str(i.text)][0]

You can do this:
# coding: utf-8
from bs4 import BeautifulSoup
html = """
<div class="contentBlurb">Description Synonyms
<ul class="definitionList">
<li>Awaiting bone marrow transplant</li>
</ul>
</div>
<div class="contentBlurb">Applicable To
<ul class="definitionList">
<li>Patient waiting for organ availability</li>
</ul>
</div>
"""
souped = BeautifulSoup(html)
matching_divs = [div for div in souped.find_all(
'div', {'class': 'contentBlurb'}) if 'Description Synonyms' in div.getText()]
li_elements = []
matching_uls = []
for mdiv in matching_divs:
matching_uls.extend(mdiv.findAll('ul', {'class': 'definitionList'}))
for muls in matching_uls:
li_elements.extend(muls.findAll('li'))
for li in li_elements:
print(li.getText())
EDIT: Updated for matching particular div.

Try this, Change it to the required string in the if clause. The below snippet will print if the tag's text has Applicable To, you can change it to your requirement
val = soup.find('div', {'class': 'contentBlurb'}).text
if "Description Synonyms" in val:
print soup.find('div', {'class': 'contentBlurb'}).find('ul', {'class': 'definitionList'}).find('li').text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text from site using beautiful soup and python - python

Use select_one together with a css selector that selects the second small element. nomenclature = page_soup.find("h1", {"class": "nopad-top"} ).select_one( 'small:nth-of-type(2)' ).text.strip()

Try this. It will let you fetch the specific items you want. for item in soup.find_all(class_="title"): text_item = item.find_all(class_="text-primary")[1].text print(text_item) Result: MODIFICATION KIT

Related

How to extract text based on a condition in python

Extracting li element and assigning it to variable with beautiful soup

Why does BeautifulSoup work the second time parsing, but not the first

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

How to extracts contents of a div tag containing a particular text using BeautifulSoup

Categories

Resources