Get the span class name using BeautifulSoup - python

I am using BeautifulSoup to scrape a website. The retrieved resultset looks like this:
<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
From here, I want to retrieve the class name "I_Want_This_Class_Name". I can get the "Text Is Here" part no problem, but the class name itself is proving to be difficult.
Is there a way to do this using BeautifulSoup resultset or do I need to convert to a dictionary?
Thank you

from bs4 import BeautifulSoup
doc = '''<td><span class="I_Want_This_Class_Name"></span><span class="other_name">Text Is Here</span></td>
'''
soup = BeautifulSoup(doc, 'html.parser')
res = soup.find('td')
out = {}
for each in res:
if each.has_attr('class'):
out[each['class'][0]] = each.text
print(out)
output will be like:
{'I_Want_This_Class_Name': '', 'other_name': 'Text Is Here'}

If you are trying to get the class name for this one result, then I would use the select method on your soup object, calling the class key:
foo_class = soup.select('td>span.I_Want_This_Class_Name')[0]['class'][0]
Note here that the select method does return a list, hence the indexing before the key.

Related

How do I have nested find_all statements in BeautifulSoup (Python)?

I started off by pulling the page with Selenium and I believe I passed the part of the page I needed to BeautifulSoup correctly using this code:
soup = BeautifulSoup(driver.find_element("xpath", '//*[#id="version_table"]/tbody').get_attribute('outerHTML'))
Now I can parse using BeautifulSoup
query = soup.find_all("tr", class_=lambda x: x != "hidden*")
print (query)
My problem is that I need to dig deeper than just this one query. For example, I would like to nest this one inside of the first (so the first needs to be true, and then this one):
query2 = soup.find_all("tr", id = "version_new_*")
print (query2)
Logically speaking, this is what I'm trying to do (but I get SyntaxError: invalid syntax):
query = soup.find_all(("tr", class_=lambda x: x != "hidden*") and ("tr", id = "version_new_*"))
print (query)
How do I accomplish this?
As mentioned without any example it is hard to help or give a precise answer - However you could use a css selector:
soup.select('tr[id^="version_new_"]:not(.hidden)')
Example
from bs4 import BeautifulSoup
html = '''
<tr id="version_new_1" class="hidden"></tr>
<tr id="version_new_2"></tr>
<tr id="version_new_3" class="hidden"></tr>
<tr id="version_new_4"></tr>
'''
soup = BeautifulSoup(html)
soup.select('tr[id^="version_new_"]:not(.hidden)')
Output
Will be a ResultSet you could iterate to scrape what you need.
[<tr id="version_new_2"></tr>, <tr id="version_new_4"></tr>]
Regarding: query = soup.find_all(...) and print (query)
find_all is going to return an iterable type. Iterable types can be iterated.
for query in soup.find_all(...):
print(query)
You can use a lambda function (along with regex) for every element to do some advanced conditioning
import re
query = soup.find_all(
lambda tag:
tag.name == 'tr' and
'id' in tag.attrs and re.search('^version_new_*', str(tag.attrs['id'])) and
'class' in tag.attrs and not re.search('^hidden*', str(tag.attrs['class']))
)
print(list(query))
For every element in the html, we are checking...
If the tag is a table row (tr)
If the tag has an id and if that id matches the pattern
If the tag has a class and if that class matches the pattern

How to get the href value of a link with bs4?

I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results
You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]
From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")

Which element to use in Selenium?

I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)

Check if a specific class present in HTML using beautifulsoup Python

I am writing a script and want to check if a particular class is present in html or not.
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
print("3 star",html.has_attr("i-stars i-stars--large-3 rating-very-large")) #it's returning False
res = html.find('i-stars i-stars--large-3 rating-very-large")) #it's returning NONE
Please guide me how I can resolve this issue?If somehow I get title (title="3.0 star rating") that will also work for me. Screenshot of console HTML
<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">
<img class="offscreen" height="303" src="https://s3-media1.fl.yelpcdn.com/assets/srv0/yelp_design_web/8a6fc2d74183/assets/img/stars/stars.png" width="84" alt="3.0 star rating">
</div>
has_attr is a method that checks if an element has the attribute that you want. class is an attribute, i-stars i-stars--large-3 rating-very-large is its value.
find expects CSS selectors, not class values. So you should instead use html.find('div.i-stars.i-stars--large-3.rating-very-large'). This is because you are looking for a div with all of these classes.
Was having similar problems getting the exact classes. They can be brought back as a dictionary object as follows.
html = '<div class="i-stars i-stars--large-3 rating-very-large" title="3.0 star rating">'
soup = BeautifulSoup(html, 'html.parser')
find = soup.div
classes = find.attrs['class']
c1 = find.attrs['class'][0]
print (classes, c1)
from bs4 import BeautifulSoup
import requests
def makesoup(u):
page=requests.get(u)
html=BeautifulSoup(page.content,"lxml")
return html
html=makesoup('https://www.yelp.com/biz/soco-urban-lofts-dallas')
res = html.find(class_='i-stars i-stars--large-3 rating-very-large')
if res:
print("3 star", 'whatever you want print')
out:
3 star whatever you want print

Can print but not return html table: "TypeError: ResultSet object is not an iterator"

Python newbie here. Python 2.7 with beautifulsoup 3.2.1.
I'm trying to scrape a table from a simple page. I can easily get it to print, but I can't get it to return to my view function.
The following works:
#app.route('/process')
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
print table
return 'All good'
I can also return html successfully. But when I try to return table instead of return 'All good' I get the following error:
TypeError: ResultSet object is not an iterator
I also tried:
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
out = []
for row in table.findAll('tr'):
colvals = [col.text for col in row.findAll('td')]
out.append('\t'.join(colvals))
return table
With no success. Any suggestions?
You're trying to return an object, you're not actually getting the text of the object so return table.text should be what you are looking for. Full modified code:
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
return table.text
EDIT:
Since I understand now that you want the HTML code that forms the site instead of the values, you can do something like this example I made:
import urllib
url = urllib.urlopen('http://www.xpn.org/events/concert-calendar')
htmldata = url.readlines()
url.close()
for tag in htmldata:
if '<th' in tag:
print tag
if '<tr' in tag:
print tag
if '<thead' in tag:
print tag
if '<tbody' in tag:
print tag
if '<td' in tag:
print tag
You can't do this with BeautifulSoup (at least not to my knowledge) is because BeautifulSoup is more for parsing or printing the HTML in a nice looking manner. You can just do what I did and have a for loop go through the HTML code and if a tag is in the line, then print it.
If you want to store the output in a list to use later, you would do something like:
htmlCodeList = []
for tag in htmldata:
if '<th' in tag:
htmlCodeList.append(tag)
if '<tr' in tag:
htmlCodeList.append(tag)
if '<thead' in tag:
htmlCodeList.append(tag)
if '<tbody' in tag:
htmlCodeList.append(tag)
if '<td' in tag:
htmlCodeList.append(tag)
This save the HTML line in a new element of the list. so <td> would be index 0 the next set of tags would be index 1, etc.
After #Heinst pointed out that I was trying to return an Object and not a string, I also found a more elegant solution to convert the BeautifulSoup Object into a string and return it:
return str(table)

Categories