Extract text from a class in HTML using CSS language - python

I have the following html piece
soup = <span class="posting-location go-to-posting">
Santa Gertrudes ,
<span> Tatuapé, São Paulo</span>
</span>
I know that, to access the "Tatuapé, São Paulo", I can use
soup.select_one('span')
However, how do I select "Santa Gertrudes , "?

I'm using BeautifulSoup to parse the HTML you provided.
Then I navigate the soup using the spans. After I have the target element, I get the text of the element.
soup.span.span.text
or
This finds all spans and selects the second one.
soup.find_all('span')[1]
I have this additional code before calling either of those.
from bs4 import BeautifulSoup
html = "<span class="posting-location go-to-posting">Santa Gertrudes , <span> Tatuapé, São Paulo</span></span>"
soup = BeautifulSoup(html, 'html.parser')

Related

Scraping <span> text</span> with BeautifulSoup and urllib

I want to scrape 2015 from below HTML:
I use the below code but am only able to scrape "Annee"
soup.find('span', {'class':'optionLabel'}).get_text()
Can someone please help?
I am a new learner.
Simply try to find its next span that holds the text you wanna scrape:
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
or css selectors with adjacent sibling combinator:
soup.select_one('span.optionLabel + span').get_text()
Example
html='''
<span class="optionLabel"><button>Année</button</span> :
<span>2015</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Output
2015

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?
You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

How to scrape aria-label text in python?

I want scrape players name list from website, but names are on labels. I don't know how to scrape text on labels.
Here is the link
https://athletics.baruch.cuny.edu/sports/mens-swimming-and-diving/roster
For example, from html we have
How to scrape text from labels?
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/mens-swimming-and-diving/roster/gregory-becker/3555" aria-label="Gregory Becker - View Full Bio" title="View Full Bio">
<img class="lazyload" data-src="/images/2018/10/19/GREGORY_BECKER.jpg?width=80" alt="GREGORY BECKER">
</a>
</div>
You can use .get() method in BeautifulSoup. First select your element in elem or any other variable using any selector or find/find_all. Then try:
print(elem.get('aria-label'))
Below is the code that will help you to extract name from the a tag
from bs4 import BeautifulSoup
with open("<path-to-html-file>") as fp:
soup = BeautifulSoup(fp, 'html.parser') #parse the html
tags = soup.find_all('a') # get all the a tag
for tag in tags:
print(tag.get('aria-label')) #get the required text

How to find element based on text ignore child tags in beautifulsoup

I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example:
<div> <b>Ignore this text</b>Find based on this text </div>
How can I find this div? Thanks for you helps!
You can use .find with the text argument and then use findParent to the parent element.
Ex:
from bs4 import BeautifulSoup
s="""<div> <b>Ignore this text</b>Find based on this text </div>"""
soup = BeautifulSoup(s, 'html.parser')
t = soup.find(text="Find based on this text ")
print(t.findParent())
Output:
<div> <b>Ignore this text</b>Find based on this text </div>
try it , it is like example but it works
from bs4 import BeautifulSoup
html="""
<div> <b>Ignore this text</b>Find based on this text </div>
"""
soup = BeautifulSoup(html, 'lxml')
s = soup.find('div')
for child in s.find_all('b'):
child.decompose()
print(s.get_text())
Output
Find based on this text

Beautiful Soup - Cannot find the tags

The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})

Categories