I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')
So I have been trying to figure out how to scrape a website for a buy/sell site and I have came to a place where I found everything in a HTML but the class contains different random numbers such as:
<div aria-label="Adidas NMD x Bape" class="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>
I do see all the tags I am looking for such as:
Adidas NMD x Bape
3 000 kr
Skåne
/annons/skane/adidas_nmd_x_bape/87267675
https://cdn.blocket.com/pictures/1692451915.jpg
I do have a quite knowledge about soup and how to scrape basic but when it come to this advanced then I am out of my mind so I am here asking what kind of tip you guys can provide me on how I can be able to scrape those values I am looking for?
updated
test = eachPart.select_one('h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text
print(test)
print(eachPart.select_one('[aria-label="{}"] img[alt="{}"]'.format(test, test))['src'])
print(eachPart.select_one('h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(eachPart.select_one('div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)
for test in eachPart.select('p[class^="styled__TopInfoWrapper"] a')[1:]:
print(test.text)
Identify the Parent tag first to find the main tag and then find all child tag.
Use CSS selector which is more convenient.
from bs4 import BeautifulSoup
html='''<div aria-label="Adidas NMD x Bape" caria-label="Adidas NMD x Bape"lass="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('[aria-label="Adidas NMD x Bape"] img[alt="Adidas NMD x Bape"]')['src'])
print(soup.select_one('[aria-label="Adidas NMD x Bape"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text)
print(soup.select_one('[aria-label="Adidas NMD x Bape"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(soup.select_one('[aria-label="Adidas NMD x Bape"] div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)
Output:
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big
Adidas NMD x Bape
/annons/skane/adidas_nmd_x_bape/87267675
3 000 kr
EDIT
from bs4 import BeautifulSoup
html='''<div aria-label="Adidas NMD x Bape" class="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('[class^="styled__Wrapper-sc-"] img[class^="ListImage__StyledImg-sc-"]')['src'])
print(soup.select_one('[class^="styled__Wrapper-sc-"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text)
print(soup.select_one('[class^="styled__Wrapper-sc-"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(soup.select_one('[class^="styled__Wrapper-sc-"] div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)
I am coding a scraper for Etsy and when I scrape the span for reviews I get the right output. However when I scrape for the span with the price it gives me only None values and I don't understand why. If someone could help, it would be great!
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each listing card
divs = page_soup.find_all("div", {"class": "v2-listing-card__shop"})
for i in divs:
shop = i.p.text
reviews = i.find("span", {"class" : "text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1"})
prices = i.find("span", {"class" : "currency-value"})
print shop
print reviews.text
print prices
Here are the two span elements as on the website:
<div class="v2-listing-card__info">
<p class="text-gray text-truncate mb-xs-0 text-body">
Blush Watercolor Flowers & Leaves with Different Shades Clipart Separate Elements Hand Painted Commercial Use | S15 Fairy Tale
</p>
<div class="v2-listing-card__shop">
<p class="text-gray-lighter text-body-smaller display-inline-block mr-xs-1">PatishopArt</p>
<div class="v2-listing-card__rating icon-t-2">
<div class="stars-svg stars-smaller ">
<input name="initial-rating" type="hidden" value="5"/>
<input name="rating" type="hidden" value="5"/>
<span class="screen-reader-only">5 out of 5 stars</span>
<div aria-hidden="true" class="rating lit rating-first icon-b-2" data-rating="1">
<span class="etsy-icon stars-svg-star" title="Disappointed"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="2">
<span class="etsy-icon stars-svg-star" title="Not a fan"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="3">
<span class="etsy-icon stars-svg-star" title="It's okay"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="4">
<span class="etsy-icon stars-svg-star" title="Like it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
<div class="rating lit" data-rating="5">
<span class="etsy-icon stars-svg-star" title="Love it"><svg aria-hidden="true" focusable="false" viewbox="3 3 18 18" xmlns="http://www.w3.org/2000/svg"><path d="M19.985,10.36a0.5,0.5,0,0,0-.477-0.352H14.157L12.488,4.366a0.5,0.5,0,0,0-.962,0l-1.67,5.642H4.5a0.5,0.5,0,0,0-.279.911L8.53,13.991l-1.5,5.328a0.5,0.5,0,0,0,.741.6l4.231-2.935,4.215,2.935a0.5,0.5,0,0,0,.743-0.6l-1.484-5.328,4.306-3.074A0.5,0.5,0,0,0,19.985,10.36Z"></path></svg></span>
</div>
</div>
</div>
</div>
</div>
</div>
<span class="text-body-smaller text-gray-lighter display-inline-block vertical-align-middle icon-b-1">(110)</span>
</div>
</div>
<p class="n-listing-card__price text-gray strong mt-xs-0">
<span class="currency-symbol">$</span><span class="currency-value">6.60</span>
</p>
<!-- This shows Free shipping on its own line , we only show it if it wasn't shown above -->
</div>
You are only checking in divs of type listing-card__shop but it looks to me as if the span in question, is outside of those divs