I have some HTML that has the following structure:
<div class="article">
<h1 class="header">Birth Date between 1919-01-01 and 2019-01-01, Oscar-Winning, Oscar-Nominated, Males (Sorted by Popularity Ascending) </h1>
<br class="clear"/>
<div class="desc">
<span>1-100 of 716 names.</span> // I WANT THIS ELEMENT
<span class="ghost">|</span> <a class="lister-page-next next-page" href="/search/name?birth_date=1919-01-01,2019-01-01&groups=oscar_winner,oscar_nominee&gender=male&count=100&start=101&ref_=rlm">Next ยป</a>
</div>
<br class="clear"/>
</div>
Now I am trying to get a specific element out of this html with bs4. I tried to do:
webSoup = BeautifulSoup(html, 'html.parser')
nextUrl = webSoup.findChildren()[2][0]
but this gives me the following error:
return self.attrs[key]
KeyError: 0
So, to summarize my question:
How do I get a specific child at a certain index from an html document with bs4?
If you want the first match for span following class desc then you can use a css child combinator to pair the parent class with child element tag:
webSoup.select_one('.desc > span')
You can also choose to specify the parent must be a div
div.desc > span
If there is more than one match then use webSoup.select and then index into the list returned.
You can use:
nextUrl = webSoup.findChildren()[3].findChildren()[0]
print(nextUrl)
Related
i got the code below
h = """<div class="SB-kickOffInfo">
<div class="SB-kickOff">
<div class="SB-kickOff" data-eventdatetime='05/17/2022 18:45:00'></div>
</div>
</div>"""
soup = BeautifulSoup(h)
#print(soup)
kick_off = soup.find(class_="SB-kickOffInfo").get('data-eventdatetime')
print(kick_off)
i want to extract the date but fro the code above am getting None, what should i change to extract the date?
Issue here is that the selected element do not have this attribute you are looking for directly, it is one of its children:
soup.find(class_="SB-kickOffInfo").find(attrs={"data-eventdatetime": True}).get('data-eventdatetime')
Here also a solution with css selectors:
soup.select_one('.SB-kickOffInfo [data-eventdatetime]').get('data-eventdatetime')
Output:
05/17/2022 18:45:00
Question:
Can I group found elements by a div class they're in and store them in lists in a list.
Is that possible?
*So I did some further testing and as said. It seems like that even if you store one div in a variable and when trying to search in that stored div it searches the whole site content.
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# Let's say this is the class of the different divs, I want to group it by
#class='a-fixed-right-grid a-spacing-top-medium'
# These are the texts from all divs around the page that I'm looking for but I can't say which one belongs in witch div
elements = driver.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
for element in elements:
result_text.append(element.text)
print(result_text )
Current Result:
I'm already getting all the information I'm looking for from different divs around the page but I want it to be "grouped" by the topmost div.
['Text11', 'Text12', 'Text2', 'Text31', 'Text32']
Result I want to achieve:
The
text is grouped by the #class='a-fixed-right-grid a-spacing-top-medium'
[['Text11', 'Text12'], ['Text2'], ['Text31', 'Text32']]
HTML: (looks something like this)
class="a-text-center a-fixed-left-grid-col a-col-left" is the first one that wraps the group from there on we can use any div to group it. At least I think that.
</div>
</div>
</div>
</div>
<div class="a-fixed-right-grid a-spacing-top-medium"><div class="a-fixed-right-grid-inner a-grid-vertical-align a-grid-top">
<div class="a-fixed-right-grid-col a-col-left" style="padding-right:3.2%;float:left;">
<div class="a-row">
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:100px">
<div class="a-text-center a-fixed-left-grid-col a-col-left" style="width:100px;margin-left:-100px;float:left;">
<div class="item-view-left-col-inner">
<a class="a-link-normal" href="/gp/product/B07YCW79/ref=ppx_yo_dt_b_asin_image_o0_s00?ie=UTF8&psc=1">
<img alt="" src="https://images-eu.ssl-images-amazon.com/images/I/41rcskoL._SY90_.jpg" aria-hidden="true" onload="if (typeof uet == 'function') { uet('cf'); uet('af'); }" class="yo-critical-feature" height="90" width="90" title="Same as the text I'm looking for" data-a-hires="https://images-eu.ssl-images-amazon.com/images/I/41rsxooL._SY180_.jpg">
</a>
</div>
</div>
<div class="a-fixed-left-grid-col a-col-right" style="padding-left:1.5%;float:left;">
<div class="a-row">
<a class="a-link-normal" href="/gp/product/B07YCR79/ref=ppx_yo_dt_b_asin_title_o00_s0?ie=UTF8&psc=1">
Text I'm looking for
</a>
</div>
<div class="a-row">
I don't have the link to test it on but this might work for you:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = [[a.text for a in div.find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")]
for div in driver.find_elements_by_class_name('a-fixed-right-grid')]
print(result_text)
EDIT: added alternative function:
# if that doesn't work try:
def get_results(selenium_driver, div_class, a_xpath):
div_list = []
for div in selenium_driver.find_elements_by_class_name(div_class):
a_list = []
for a in div.find_elements_by_xpath(a_xpath):
a_list.append(a.text)
div_list.append(a_list)
return div_list
get_results(driver,
div_class='a-fixed-right-grid'
a_xpath="//a[contains(#href, '/gp/product/')]")
If that doesn't work then maybe the xpath is returning EVERY matching element every time despite being called from the div, or another element has that same class name farther up the document
I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "
Okay so I'm currently using python beautifulsoup to output a specific line from a html file, since the html contains multiple of the same div class, it'll output every div containing the same class, example of this
CONTENT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
OUTPUT:
<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>
Now I only want #2 of div class border,
<div class=border>example</a>
now if i view source within chrome, it'll show content in number lines, so line 1 will contain
<div class=border>aaaa</a>
& line 2 will contain
<div class=border>example</a>
is it possible to output via numbered line using beautiful soup?
find_all returns a list, so you can index it with [1] to get the second element.
from bs4 import BeautifulSoup
html_doc = """<div class=border>aaaa</a>
<div class=border>example</a>
<div class=border>runrunrun</a>"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all(class_="border")[1]
returns
<div class="border">example</div>
If you have the list with say 200 elements generated by soup.find_all... If the list is called div_list, you could just do an index loop (you want index 1,4,7 etc...)
count = 1
while True:
try:
print(div_list[count])
count+=3
except:
# happens because of index error
break
Or even shorter:
count = 1
while count<= len(div_list):
print(div_list[count])
count+=3
The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())