Fetch href from class - selenium python - python

Tried extracting the href from:
<a lang="en" class="new class" href="/abc/stack.com"
tabindex="-1" data-type="itemTitles"><span><mark>Scott</mark>, CC042<br></span></a>
using elems = driver.find_elements_by_css_selector(".new class [href]") , but doesn't seem to work.
Also tried Python Selenium - get href value, but it returned an empty list.
So I want to extract all the href elements of class = "new class" as mentioned above and append them in a list
Thanks!!

Use .get_attribute('href').
by_css_selector:
elems = driver.find_elements_by_css_selector('.new.class')
for elem in elems:
print(elem.get_attribute('href'))
Or by_xpath:
elems = driver.find_elements_by_xpath('//a[#class="new class"]')

Just change it to
elems = driver.find_elements_by_css_selector(".new.class[href]")
OR
elems = driver.find_elements_by_css_selector("[class='new class'][href]")

Related

python beautifulsoup4 how to get span text in div tag

This is the html code
<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>
I used like this
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
print(item.find('div', class_='salary-snippet'))
i got the result a list such as
<div aria-label="RM 3,500 to RM 8,000 a month" class="salary-snippet"><span>RM 3,500 - RM 8,000 a month</span></div>
if i used
print(item.find('div', class_='salary-snippet').text.strip())
it will return the error
AttributeError: 'NoneType' object has no attribute 'text'
so how can i get only the span text? its my first time web scraping
May be this is what you are looking for.
First select all the <div> tags with class as salary-snippet as this is the parent of the <span> tag that you are looking for. Use .find_all()
Now Iterate over the all the selected <div> tags from above and find the <span> from each <div>.
Based on your question, I assume that All these <div> may not have the <span> tag. In that case you can print the text only if the <div> contains a span tag. See below
# Find all the divs
d = soup.find_all('div', class_='salary-snippet')
# Iterating over the <div> tags
for item in d:
# Find <span> in each item. If not exists x will be None
x = item.find('span')
# Check if x is not None and then only print
if x:
print(x.text.strip())
Here is the complete code.
from bs4 import BeautifulSoup
s = """<div aria-label="RM 6,000 a month" class="salary-snippet"><span>RM 6,000 a month</span></div>"""
soup = BeautifulSoup(s, 'lxml')
d = soup.find_all('div', class_='salary-snippet')
for item in d:
x = item.find('span')
if x:
print(x.text.strip())
RM 6,000 a month
I believe the line should be:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
Alternatively, if there is only the span you can simply use:
item.find("span").text.strip()
Considering you used the .find_all() method you might want to ensure that every div returned from your HTML
soup.find_all('div', class_='job_seen_beacon')
contains the element you are looking for as thi could arise if only one element doesn't.
i.e.
divs = soup.find_all('div', class_='job_seen_beacon')
for item in divs:
try:
print(item.find('div', {'class':'salary-snippet'}).text.strip())
except AttributeError:
print("Item Not available")
What this will do is try get the text but if this fails will print the item that failed so you can identify why... perhaps it doesn't have the element you are searching for.

python bs4 extract from attribute inside button class

So im trying to get the value of a attribute using BeautifulSoup4.
replay_url_data = matchdatatr[1].findAll("button",{"class":"replay_button_super"})
This is how i get all my data into the object.
Typing the replay_url_data into the console returns :
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
What i want is to get the value of data-spectate-link.
I have tried every google result i found about similar topics but nothing worked.
replay_url_split = replay_url_data[0].findAll("button",{"class":"data-spectate-link"})
This returns "[]" empty.
replay_url_data[0].find('data-spectate-platform')
This returns same result empty
replay_url_data[0].find('button',attrs={'class' : 'data-spectate-link'})
And this one returns the same as above "[]" empty.
After 3 hours of searching on google so far nothing has helped me and im getting desperate.Im still new to python and html so excuse my stupidity.
To get attribute you use .attrs["data-spectate-link"] or directly ["data-spectate-link"]
Example
from bs4 import BeautifulSoup as BS
text = '<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>'
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"class": "replay_button_super"})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)
value = one_button.attrs["data-spectate-link"]
print(value)
BTW: If you want to search buttons with attribute data-spectate-link then you have to search
{"data-spectate-link": True}
not {"class": "data-spectate-link"}
Example
from bs4 import BeautifulSoup as BS
text = '''<button>Other button</button>
<button>Other button</button>
<button>Other button</button>
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
<button>Other button</button>
<button>Other button</button>'''
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"data-spectate-link": True})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)
soup.button['data-spectate-link']
is what you want.
soup.button set the tag inside the soup. Then with ['data-spectate-link'] you can set property inside the tag.
docs here
data-spectate-link is an attribute.To get the value of data-spectate-link you need to use element['data-spectate-link']
You can use either findAll() or CSS selector select()
replay_url_data =matchdatatr[1].findAll("button",attrs={"class" :"replay_button_super", "data-spectate-link" :True})
print(replay_url_data[0]['data-spectate-link'])
OR Css selector
replay_url_data =soup.select("button.replay_button_super[data-spectate-link]")
print(replay_url_data[0]['data-spectate-link'])

(Beautiful Soup) Get data inside a button tag

I try to scrape out an ImageId inside a button tag, want to have the result:
"25511e1fd64e99acd991a22d6c2d6b6c".
When I try:
drawing_url = drawing_url.find_all('button', class_='inspectBut')['onclick']
it doesn't work. Giving an error-
TypeError: list indices must be integers or slices, not str
Input =
for article in soup.find_all('div', class_='dojoxGridRow'):
drawing_url = article.find('td', class_='dojoxGridCell', idx='3')
drawing_url = drawing_url.find_all('button', class_='inspectBut')
if drawing_url:
for e in drawing_url:
print(e)
Output =
<button class="inspectBut" href="#"
onclick="window.open('getImg?imageId=25511e1fd64e99acd991a22d6c2d6b6c&
timestamp=1552011572288','_blank', 'toolbar=0,
menubar=0, modal=yes, scrollbars=1, resizable=1,
height='+$(window).height()+', width='+$(window).width())"
title="Open Image" type="button">
</button>
...
...
Try this one.
import re
#for all the buttons
btn_onlclick_list = [a.get('onclick') for a in soup.find_all('button')]
for click in btn_onlclick_list:
a = re.findall("imageId=(\w+)", click)[0]
print(a)
You first need to check whether the attribute is present or not.
tag.attrs returns a list of attributes present in the current tag
Consider the following Code.
Code:
from bs4 import BeautifulSoup
a="""
<td>
<button class='hi' onclick="This Data">
<button class='hi' onclick="This Second">
</td>"""
soup = BeautifulSoup(a,'lxml')
print([btn['onclick'] for btn in soup.find_all('button',class_='hi') if 'onclick' in btn.attrs])
Output:
['This Data','This Second']
or you can simply do this
[btn['onclick'] for btn in soup.find_all('button', attrs={'class' : 'hi', 'onclick' : True})]
You should be searching for
button_list = soup.find_all('button', {'class': 'inspectBut'})
That will give you the button array and you can later get url field by
[button['getimg?imageid'] for button in button_list]
You will still need to do some parsing, but I hope this can get you on the right track.
Your mistake here was that you need to search correct property class and look for correct html tag, which is, ironically, getimg?imageid.

Extracting li element and assigning it to variable with beautiful soup

Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.
Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)
Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]

Python print scraped data with beautifulsoup without tags

<div class="number" title="Player number">1211</div>
<div class="shirt" title="sName">Ronaldo 1211</div>
I'm scraping a website. I've managed to print out the . Here is my code:
web = urllib2.urlopen("WEBSITE")
soupit = BeautifulSoup(web, 'html.parser')
scrapeme = soupit.findAll("div", { "class" : "number" })
print scrapeme
prints out :
<div class="id" title="Player number">1211</div>
I want it to print just the 1211. How can I do it?
The get_ text() method of any beautifulsoup object does exactly that.
print(scrapeme.get_text())
Once you have your list of elements, scrapeme, you can loop through each element in the list and print it's text attribute using:
for element in scrapeme:
print(element.text)
Since in your example the scrape only generates a list scrapeme containing one element, the output in this case will just be:
1211

Categories