i got the code below
h = """<div class="SB-kickOffInfo">
<div class="SB-kickOff">
<div class="SB-kickOff" data-eventdatetime='05/17/2022 18:45:00'></div>
</div>
</div>"""
soup = BeautifulSoup(h)
#print(soup)
kick_off = soup.find(class_="SB-kickOffInfo").get('data-eventdatetime')
print(kick_off)
i want to extract the date but fro the code above am getting None, what should i change to extract the date?
Issue here is that the selected element do not have this attribute you are looking for directly, it is one of its children:
soup.find(class_="SB-kickOffInfo").find(attrs={"data-eventdatetime": True}).get('data-eventdatetime')
Here also a solution with css selectors:
soup.select_one('.SB-kickOffInfo [data-eventdatetime]').get('data-eventdatetime')
Output:
05/17/2022 18:45:00
Related
'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
according to picture I attached, I want to extract all "kt-post-card__body" attrs and then from each one of them, extract:
("kt-post-card__title", "kt-post-card__description")
like a list.
I tried this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
but with ads[0].div I only access to "kt-post-card__title" while "kt-post-card__body" has three other sub tags like: "kt-post-card__description" and "kt-post-card__bottom" ... , why is that?
Cause your question is not that clear - To extract the classes:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
Output:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
To get the texts you also have to iterate your ResultSet and could access each elements text to fill your list or use stripped_strings.
Example
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
Output:
['Example_1', 'Example_2']
or
print(list(e.stripped_strings))
Output:
['Example_1', 'Example_2', 'Example_4']
Try this:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
I think you're getting only the first div because you called ads[0].div
<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
I want to get the values of data-account and data-id which are elements within play-js tag.
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
I tried like below, but I couldn't get the value.
With javascript, I was able to import it with the code below.
var elems = document.querySelectorAll('.player__AAtt play-js');
console.log(elems[0].dataset.account)
console.log(elems[0].dataset.dataid)
How can I get the value of an element within a tag rather than the tag itself?
You can use the .get_attribute() method:
elemname = driver.find_elements_by_xpath('xpath.../div/play-js')
elemname.get_attribute("data-account")
In python we use BeautifulSoup mostly for parsing the html page. Here the code that will help you to get the value of all the play-js element in the provided html file.
from bs4 import BeautifulSoup
res_page = """
<div class="player__AAtt">
<div>
<play-js data-account="1234567890" data-id="32667_32797">
</play-js>
</div>
</div>
"""
soup = BeautifulSoup(res_page, 'html.parser')
output_list = soup.find_all('play-js')
data_account_list = [data_account['data-account'] for data_account in output_list]
data_id_list = [data_id['data-id'] for data_id in output_list]
print(data_account_list)
print(data_id_list)
The output is:
['1234567890']
['32667_32797']
I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "
Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.
Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)
Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]
This is the ResultSet of running soup[0].find_all('div', {'class':'font-160 line-110'}):
[<div class="font-160 line-110" data-container=".snippet-container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">
<a class="no-underline group-ib color-inherit"
href="/en/ais/details/ports/959">
<span class="text-default">CN</span><span class="text-default text-darker">XMN
</span>
</a>
</div>]
In an attempt to pull out XIAMEN [CN] after title I could not use a[0].find('div')['title] (where a is the above BeautifulSoup ResultSet). However, if I copy and paste that HTML as a new string, say,
b = '''<div class="font-160 line-110" data-container=".snippet container" data-html="true" data-placement="top" data-template='<div class="tooltip infowin-tooltip" role="tooltip"><div class="tooltip-arrow"><div class="tooltip-arrow-inner"></div></div><div class="tooltip-inner" style="text-align: left"></div></div>' data-toggle="tooltip" title="XIAMEN [CN]">'''
Then do:
>>soup = BeautifulSoup(b, 'html.parser')
>>soup.find('div')['title']
>>XIAMEN [CN] #prints contents of title
Why do I have to reSoup the Soup? Why doesn't this work on my first search?
Edit, origin of soup:
I have a list of urls that I'm going though via grequests. One of the things I'm looking for is that title that contains XIAMEN [CN].
So soup was created when I did
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
The urls are
[
'http://www.marinetraffic.com/en/ais/details/ships/shipid:564352/imo:9643752/mmsi:511228000/vessel:DE%20MI',
'http://www.marinetraffic.com/en/ais/details/ships/shipid:3780155/imo:9712395/mmsi:477588800/vessel:SITC%20GUANGXI?cb=2267'
]
I found out the problem occurred when I set up my BeautifulSoup. I created a list of partial search results then had to iterate over the list to research it. I fixed this by just searching for what I wanted in on line:
I changed:
soup = []
with i in range(2) #number of pages parsed
rawSoup = BeautifulSoup(response[i].content, 'html.parser')
souporigin = rawSoup.find_all('div', {'class': 'bg-default bg-white no- snippet-hide'})
soup.append(souporigin)
to:
a = soup.find("div", class_='font-160 line-110')["title"]
And run this search as soon as I create my soup which removes a lot of redundancies in the code-- I had been creating lists of ResultSets and having to use find on them for new fields.
You use wrong selection.
Selection soup[0].find_all('div', {'class':'font-160 line-110'}) finds <div> and you can even see <div> when you print it. But when you add .find() it starts searching inside <div> - so .find('div') tries to find new div in current div
You need
a[0]['title']
When you create new soup then main/root element is not div but [document] and div is its child (div is inside main "tag") so you can use find('div').
>>> a[0].name
div
>>> soup = BeautifulSoup(b, 'html.parser')
>>> soup.name
[document]