Python - How to use soup with random class characters - python

So I have been trying to figure out how to scrape a website for a buy/sell site and I have came to a place where I found everything in a HTML but the class contains different random numbers such as:
<div aria-label="Adidas NMD x Bape" class="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>
I do see all the tags I am looking for such as:
Adidas NMD x Bape
3 000 kr
Skåne
/annons/skane/adidas_nmd_x_bape/87267675
https://cdn.blocket.com/pictures/1692451915.jpg
I do have a quite knowledge about soup and how to scrape basic but when it come to this advanced then I am out of my mind so I am here asking what kind of tip you guys can provide me on how I can be able to scrape those values I am looking for?
updated
test = eachPart.select_one('h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text
print(test)
print(eachPart.select_one('[aria-label="{}"] img[alt="{}"]'.format(test, test))['src'])
print(eachPart.select_one('h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(eachPart.select_one('div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)
for test in eachPart.select('p[class^="styled__TopInfoWrapper"] a')[1:]:
print(test.text)

Identify the Parent tag first to find the main tag and then find all child tag.
Use CSS selector which is more convenient.
from bs4 import BeautifulSoup
html='''<div aria-label="Adidas NMD x Bape" caria-label="Adidas NMD x Bape"lass="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('[aria-label="Adidas NMD x Bape"] img[alt="Adidas NMD x Bape"]')['src'])
print(soup.select_one('[aria-label="Adidas NMD x Bape"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text)
print(soup.select_one('[aria-label="Adidas NMD x Bape"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(soup.select_one('[aria-label="Adidas NMD x Bape"] div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)
Output:
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big
Adidas NMD x Bape
/annons/skane/adidas_nmd_x_bape/87267675
3 000 kr
EDIT
from bs4 import BeautifulSoup
html='''<div aria-label="Adidas NMD x Bape" class="styled__Wrapper-sc-1kpvi4z-0 eDiSuB" to="/annons/skane/adidas_nmd_x_bape/87267675">
<article class="styled__Article-sc-1kpvi4z-1 hbWRzz">
<div class="styled__ImageWrapper-sc-1kpvi4z-4 kxhCJn">
<div class="ListImage__Wrapper-sc-1rp77jc-0 cvipJS"><img alt="Adidas NMD x Bape" class="ListImage__StyledImg-sc-1rp77jc-1 iwClwW" sizes="
(min-width: 768px) 180px,
120px
" src="https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big" srcset="
https://cdn.blocket.com/pictures/1692451915.jpg?type=thumb 120w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=gallery_big 180w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal 240w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=store_presentation 360w,
https://cdn.blocket.com/pictures/1692451915.jpg?type=mob_iphone_vi_normal_retina 540w,
" /></div>
</div>
<div class="styled__Content-sc-1kpvi4z-2 dwtNsH">
<div class="styled__LocationTimeWrapper-sc-1kpvi4z-17 dvvNDw">
<div class="styled__SubjectSymbol-sc-1kpvi4z-11 cbBbUz"></div>
<p class="styled__TopInfoWrapper-sc-1kpvi4z-22 kEcJNb"><a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/hela_sverige/personligt/klader_skor?cg=4080&q=bape&st=s">Kläder & skor</a> · <a class="Link-sc-139ww1j-0 TopInfoLink__StyledLink-lzfj8j-0 bjnLor" href="/annonser/skane/personligt/klader_skor?cg=4080&q=bape&r=23&st=s">Skåne</a></p>
<p class="styled__Time-sc-1kpvi4z-18 bGSnhf">Idag 14:06</p>
</div>
<div class="styled__SubjectWrapper-sc-1kpvi4z-10 kZyTSM">
<h2 class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq styled__StyledTitle-sc-1kpvi4z-6 bSElwy"><a class="Link-sc-139ww1j-0 styled__StyledTitleLink-sc-1kpvi4z-7 edlhAW" href="/annons/skane/adidas_nmd_x_bape/87267675">Adidas NMD x Bape</a></h2></div>
<div class="styled__ParamsWrapper-sc-1kpvi4z-13 cRZIFG"></div>
<div class="styled__SalesInfo-sc-1kpvi4z-20 bbHjGJ">
<div class="TextSubHeading__TextSubHeadingWrapper-sc-1ilszdp-0 jIvScq Price__Wrapper-sc-1v2maoc-0 heunWX"><span>3 000 kr<div class="TextCallout2__TextCallout2Wrapper-sc-19qvftl-0 eERYUj Price__StyledVatPrice-sc-1v2maoc-1 hMWxAJ"></div></span></div>
</div>
</div>
</article>
</div>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('[class^="styled__Wrapper-sc-"] img[class^="ListImage__StyledImg-sc-"]')['src'])
print(soup.select_one('[class^="styled__Wrapper-sc-"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a').text)
print(soup.select_one('[class^="styled__Wrapper-sc-"] h2[class^="TextSubHeading__TextSubHeadingWrapper"] >a')['href'])
print(soup.select_one('[class^="styled__Wrapper-sc-"] div[class^="TextSubHeading__TextSubHeadingWrapper"] >span').text)

Related

I need to do two columns web scraption with python

<div class="content">
<div class="container">
<div class="row pt-2">
<div class="col pe-1">
<div class="grid-cell p-2">
<a href="united-states_florida/company/met-west-commercial-lender/tom-mchugh-975">
Tom Mchugh
</a>
</div>
</div>
<div class="col ps-1">
<div class="grid-cell p-2">
Company:
<span>
<a href="united-states_florida/company/met-west-commercial-lender">
Met West Commercial Lender
</a>
</span>
</div>
</div>
</div>
My result showing like this
I want to look like following table:
Column A
Column B
Tom Mchugh
Met West Commercial Lender
There might be different approaches. Here is an elegant one.
y = df.Name.values
df = pd.DataFrame({'A' : y[::2], 'B' : y[1::2]})

Access child inside an element list by XPath Sales Navigator

I am trying to pull the name and position of random people from Sales Navigator. Each person shows up as a card that contains all the information. I obtain a list of the cards but then I want to get for each one the Name and Title. I have tried using the code below to get the information from a card, the HTML of one result is below.
So far, my attempts always return an error indicating that the element could not be found. How could I solve this?
def testeo(driver):
lista = driver.find_elements_by_xpath("//*[contains(#class,'pv5 ph2 search-results__result-item')]")
nombres = []
for i in range(0, len(lista)):
nombres.append((lista[i].find_element_by_xpath(".//*[contains(#class,'result-lockup__name')]").text,
lista[i].find_element_by_xpath(".//*[contains(#class,'t-14 t-bold')]").text))
<li class="pv5 ph2 search-results__result-item" data-scroll-into-view="urn:li:fs_salesProfile:(ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T)">
<div class="pt5 absolute search-results__select-container">
<input id="search-result-ember6830" class="small-input ember-checkbox ember-view" type="checkbox">
<label class="m0" for="search-result-ember6830">
<span class="a11y-text">
Select Jean Jongejan
</span>
</label>
</div>
<div style="" id="ember6866" class="flex full-width deferred-area ember-view"> <div class="search-results__result-container full-width pl2">
<div id="ember6981" class="ember-view"> <div id="ember6982" class="ember-view">
<article>
<h3 class="a11y-text">
Profile result – Jean Jongejan
</h3>
<section class="result-lockup">
<h4 class="a11y-text">
Profile result lockup – Jean Jongejan
</h4>
<div class="result-lockup__profile-info flex flex-column">
<div class="horizontal-person-entity-lockup-4 result-lockup__entity ml6">
<figure>
<a href="/sales/people/ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T?_ntb=ErSmZYlWS8KlI9CD0cB6Yg%3D%3D" id="ember6985" class="result-lockup__icon-link ember-view">
<div class="presence-entity--size-4 relative mr2">
<img src="" loading="lazy" alt="Go to Jean Jongejan’s profile" id="ember6986" class="max-width max-height circle-entity-4 lazy-image ghost-person loaded ember-view">
<div class="presence-indicator presence-indicator--size-4 hidden presence-entity__indicator presence-entity__indicator--size-4" title="Reachable">
<span class="a11y-text">
Jean Jongejan is reachable
</span>
</div>
</div>
</a> </figure>
<dl>
<dt class="result-lockup__name">
<a href="/sales/people/ACwAAAJ-Ab0Bu4JpScPs9SE2b8R_LP9L9vU9nM8,NAME_SEARCH,fH_T?_ntb=ErSmZYlWS8KlI9CD0cB6Yg%3D%3D" id="ember6989" class="ember-view"> Jean Jongejan
</a> </dt>
<dd class="inline-flex vertical-align-middle">
<ul class="ml1 flex align-items-center list-style-none">
<li class="mr1">
<span class="a11y-text">
3rd degree contact
</span>
<span class="label-16dp block" aria-hidden="true">
3rd
</span>
</li>
<!----><!----><!----> </ul>
</dd>
<dd class="result-lockup__highlight-keyword">
<span class="t-14 t-bold">EXT Key Account Management & Consultancy</span>
<span>
at
<span data-entity-hovercard-id="urn:li:fs_salesCompany:36314" class="result-lockup__position-company">
<a href="/sales/company/36314?_ntb=Z6Rvdg6sRMiPD6xsYlUuFQ%3D%3D" id="ember6991" class="Sans-14px-black-75%-bold ember-view"> <span aria-hidden="true">
Marimekko
</span>
<span class="a11y-text">
Go to Marimekko account page
</span>
</a> <button aria-expanded="false" aria-label="See more about Marimekko" class="entity-hovercard__a11y-trigger p0 b0" data-entity-hovercard-id="urn:li:fs_salesCompany:36314" data-entity-hovercard-trigger="click"></button>
</span>
</span>
</dd>
<dd>
<span class="t-12 t-black--light">
3 years 11 months in role and company
</span>
</dd>
<dd>
<ul class="mv1 t-12 t-black--light result-lockup__misc-list">
<li class="result-lockup__misc-item">Breda, North Brabant, Netherlands</li>
</ul>
</dd>
</dl>
</div>
<!----> </div>
<div class="result-lockup__actions flex">
<ul class="result-lockup__common-actions">
<li class="result-lockup__action-item mb3">
<div class="display-flex">
<div id="ember6993" class="ember-view"> <div id="ember6995" class="save-to-list-dropdown artdeco-dropdown artdeco-dropdown--placement-bottom artdeco-dropdown--justification-right ember-view"><button aria-expanded="false" id="ember6996" class="save-to-list-dropdown__trigger ph4 artdeco-button artdeco-button--secondary artdeco-button--pro artdeco-button--1 m-type--message artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view" type="button" tabindex="0"> Save
<!----></button><div tabindex="-1" aria-hidden="true" id="ember6997" class="save-to-list-dropdown__content-container artdeco-dropdown__content artdeco-dropdown--is-dropdown-element artdeco-dropdown__content--has-arrow artdeco-dropdown__content--arrow-right artdeco-dropdown__content--justification-right artdeco-dropdown__content--placement-bottom ember-view"><div class="artdeco-dropdown__content-inner">
<!---->
</div>
</div></div>
<div id="ember6998" class="ember-view">
<!---->
<!----></div>
</div> <div class="relative">
<div id="ember6999" class="ember-view">
<div id="ember7000" class="artdeco-dropdown artdeco-dropdown--placement-bottom artdeco-dropdown--justification-right ember-view"><button aria-expanded="false" id="ember7001" class="artdeco-dropdown__trigger result-lockup__action-button m-type--more artdeco-dropdown__trigger--non-button artdeco-dropdown__trigger--placement-bottom ember-view" type="button" tabindex="0"> <span class="a11y-text">See more actions for this result</span>
<li-icon aria-hidden="true" type="ellipsis-horizontal-icon" class="artdeco-button artdeco-button--tertiary artdeco-button--1 artdeco-button--muted p0" size="medium"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
<path d="M2 10h4v4H2v-4zm8 4h4v-4h-4v4zm8-4v4h4v-4h-4z"></path>
</svg></li-icon>
<!----></button><div tabindex="-1" aria-hidden="true" id="ember7002" class="artdeco-dropdown__content result-lockup__dropdown-more artdeco-dropdown--is-dropdown-element artdeco-dropdown__content--has-arrow artdeco-dropdown__content--arrow-right artdeco-dropdown__content--justification-right artdeco-dropdown__content--placement-bottom ember-view"><!----></div></div>
<div id="ember7003" class="ember-view"><!----></div>
<!----></div> </div>
</div>
</li>
<!----> </ul>
</div>
</section>
<section class="result-context relative pt1">
<h4 class="a11y-text">Profile result context – Jean Jongejan</h4>
<!---->
<!---->
<!----> </section>
</article>
</div>
</div> </div>
</div>
</li>
Can you try this?
//*[name()='dt'][#class='result-lockup__name']

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Beautifulsoup: Get a range of divs

I just found out about how to process webpages in python using BeautifulSoup.
There's a list of div from which I want to get those in a specific range. The range is defined by two div that have a h2 child.
How would I do that? Thank you for your support!
EDIT: I added an actual representation of my html code below instead of a previous "simplified" version that was missing tags.
The new code shows a root div with class foo-bar-details.
Nested are 9 div tags. Two of which have a nested h2 tag. All of those 9 div tags contain img elements deeply nested within. What I need is each img element of those divs that are between the ones containing the h2 element.
An expected outcome if applied to the html code below would be:
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
This is the html code:
<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><span class="icon-help"></span>
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>
Here is a solution involving lxml.html:
We extract all divs between the first and last divs which contain an h2 tag:
import lxml.html
# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())
# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1 # add one as range stops at k - 1
continue
# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]
# Or
img_list = [_.xpath('.//img/#src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]
Another solution involving SimplifiedDoc:
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div class="foo-bar-details">
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">Test 1</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-1">Test 2</div>
<div class="padding-y-10 padding-x-40 " id="foo-feat-4-2">Test 3</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-3">Test 4</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.select('div.foo-bar-details').divs.contains('<h2')
print ([div.id for div in divs])
divs = doc.select('div.foo-bar-details').divs.notContains('<h2')
print ([div.id for div in divs])
Result:
['elem-4', 'elem-5']
['info-panel-header', 'foo-feat-4-1', 'foo-feat-4-2', 'foo-feat-4-3']
Simplifieddoc library does not rely on the third-party library, which is lighter and faster, perfect for beginners.
Here are more examples here
If I understand you correctly, you want to find <img> tags and corresponding <h2> to which the images belong to.
This example (txt variable contains the HTML snippet from your question):
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = {}
for img in soup.select('div:has(h2) ~ div img'):
out.setdefault(img.find_previous('h2').get_text(strip=True), []).append(img['src'])
from pprint import pprint
pprint(out)
Prints:
{'Bar': ['../../images/39826_thumb.JPG', '../../images/209876_thumb.JPG'],
'Foo': ['../../images/123456_thumb.jpg', '../../images/67890_thumb.JPG']}

How to find desired data within multiple div in beautifulsoup

this is the html code
i am trying to select data within multiple div tags
<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="title">Updated</div>
<div class="content" itemprop="datePublished">March 7, 2016</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>
i want to select March 7,2016
how can i select this in beautifulsoup
You can use soup.find('div', {'itemprop': 'datePublished'}) to select the div element with itemprop datePublished.
Demo
from bs4 import BeautifulSoup
content = '''<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="title">Updated</div>
<div class="content" itemprop="datePublished">March 7, 2016</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>'''
soup = BeautifulSoup(content)
date = soup.find('div', {'itemprop':'datePublished'})
print(date.text)
Output
March 7, 2016

Categories