python bs4 extract from attribute inside button class - python

So im trying to get the value of a attribute using BeautifulSoup4.
replay_url_data = matchdatatr[1].findAll("button",{"class":"replay_button_super"})
This is how i get all my data into the object.
Typing the replay_url_data into the console returns :
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
What i want is to get the value of data-spectate-link.
I have tried every google result i found about similar topics but nothing worked.
replay_url_split = replay_url_data[0].findAll("button",{"class":"data-spectate-link"})
This returns "[]" empty.
replay_url_data[0].find('data-spectate-platform')
This returns same result empty
replay_url_data[0].find('button',attrs={'class' : 'data-spectate-link'})
And this one returns the same as above "[]" empty.
After 3 hours of searching on google so far nothing has helped me and im getting desperate.Im still new to python and html so excuse my stupidity.

To get attribute you use .attrs["data-spectate-link"] or directly ["data-spectate-link"]
Example
from bs4 import BeautifulSoup as BS
text = '<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>'
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"class": "replay_button_super"})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)
value = one_button.attrs["data-spectate-link"]
print(value)
BTW: If you want to search buttons with attribute data-spectate-link then you have to search
{"data-spectate-link": True}
not {"class": "data-spectate-link"}
Example
from bs4 import BeautifulSoup as BS
text = '''<button>Other button</button>
<button>Other button</button>
<button>Other button</button>
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
<button>Other button</button>
<button>Other button</button>'''
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"data-spectate-link": True})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)

soup.button['data-spectate-link']
is what you want.
soup.button set the tag inside the soup. Then with ['data-spectate-link'] you can set property inside the tag.
docs here

data-spectate-link is an attribute.To get the value of data-spectate-link you need to use element['data-spectate-link']
You can use either findAll() or CSS selector select()
replay_url_data =matchdatatr[1].findAll("button",attrs={"class" :"replay_button_super", "data-spectate-link" :True})
print(replay_url_data[0]['data-spectate-link'])
OR Css selector
replay_url_data =soup.select("button.replay_button_super[data-spectate-link]")
print(replay_url_data[0]['data-spectate-link'])

Related

How to get the value of an element within a tag?

<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
I want to get the values of data-account and data-id which are elements within play-js tag.
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
I tried like below, but I couldn't get the value.
With javascript, I was able to import it with the code below.
var elems = document.querySelectorAll('.player__AAtt play-js');
console.log(elems[0].dataset.account)
console.log(elems[0].dataset.dataid)
How can I get the value of an element within a tag rather than the tag itself?
You can use the .get_attribute() method:
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
elemname.get_attribute("data-account")
In python we use BeautifulSoup mostly for parsing the html page. Here the code that will help you to get the value of all the play-js element in the provided html file.
from bs4 import BeautifulSoup
res_page = """
<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
"""
soup = BeautifulSoup(res_page, 'html.parser')
output_list = soup.find_all('play-js')
data_account_list = [data_account['data-account'] for data_account in output_list]
data_id_list = [data_id['data-id'] for data_id in output_list]
print(data_account_list)
print(data_id_list)
The output is:
['1234567890']
['32667_32797']

Using BeautifulSoup, how to select a tag without its children?

The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.
With clear you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
use the Tag class
from bs4 import BeautifulSoup, Tag
...
div_only = Tag(name='div', attrs=div.attrs)
use strings
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))
#cards pointed me in the right direction with copy(). This is what I ended up using:
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
def remove_children(tag):
tag.clear()
return tag
divs = [str(remove_children(copy.copy(i))) for i in soup.find_all('div')]

Get values from CSS span element with constantly changing values

I am trying to scrape a website that seems to use different values each time a particular span element appears. For example, the first few times the span element appears, it could be:
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
I have tried the following, but I keep getting either empty lists:
site = BeautifulSoup(link.text, "html.parser")
jobs_a = site.find_all("span title")
or
jobs_a = site.find_all("span", attrs="title")
or
jobs_a = site.find_all("span", attrs="title*")
Any suggestions?
I prefer using a CSS selector.
from bs4 import BeautifulSoup
data = '''\
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
'''
soup = BeautifulSoup(data, 'html.parser')
for s in soup.select('span[title]'):
print(f"{s.text=}\t{s.attrs['title']=}")

(Beautiful Soup) Get data inside a button tag

I try to scrape out an ImageId inside a button tag, want to have the result:
"25511e1fd64e99acd991a22d6c2d6b6c".
When I try:
drawing_url = drawing_url.find_all('button', class_='inspectBut')['onclick']
it doesn't work. Giving an error-
TypeError: list indices must be integers or slices, not str
Input =
for article in soup.find_all('div', class_='dojoxGridRow'):
drawing_url = article.find('td', class_='dojoxGridCell', idx='3')
drawing_url = drawing_url.find_all('button', class_='inspectBut')
if drawing_url:
for e in drawing_url:
print(e)
Output =
<button class="inspectBut" href="#"
onclick="window.open('getImg?imageId=25511e1fd64e99acd991a22d6c2d6b6c&
timestamp=1552011572288','_blank', 'toolbar=0,
menubar=0, modal=yes, scrollbars=1, resizable=1,
height='+$(window).height()+', width='+$(window).width())"
title="Open Image" type="button">
</button>
...
...
Try this one.
import re
#for all the buttons
btn_onlclick_list = [a.get('onclick') for a in soup.find_all('button')]
for click in btn_onlclick_list:
a = re.findall("imageId=(\w+)", click)[0]
print(a)
You first need to check whether the attribute is present or not.
tag.attrs returns a list of attributes present in the current tag
Consider the following Code.
Code:
from bs4 import BeautifulSoup
a="""
<td>
<button class='hi' onclick="This Data">
<button class='hi' onclick="This Second">
</td>"""
soup = BeautifulSoup(a,'lxml')
print([btn['onclick'] for btn in soup.find_all('button',class_='hi') if 'onclick' in btn.attrs])
Output:
['This Data','This Second']
or you can simply do this
[btn['onclick'] for btn in soup.find_all('button', attrs={'class' : 'hi', 'onclick' : True})]
You should be searching for
button_list = soup.find_all('button', {'class': 'inspectBut'})
That will give you the button array and you can later get url field by
[button['getimg?imageid'] for button in button_list]
You will still need to do some parsing, but I hope this can get you on the right track.
Your mistake here was that you need to search correct property class and look for correct html tag, which is, ironically, getimg?imageid.

How to find an ID in a div class with multiple values BS4 Python

I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "

Categories