How to find desired data within multiple div in beautifulsoup - python

this is the html code
i am trying to select data within multiple div tags
<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="title">Updated</div>
<div class="content" itemprop="datePublished">March 7, 2016</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>
i want to select March 7,2016
how can i select this in beautifulsoup

You can use soup.find('div', {'itemprop': 'datePublished'}) to select the div element with itemprop datePublished.
Demo
from bs4 import BeautifulSoup
content = '''<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="title">Updated</div>
<div class="content" itemprop="datePublished">March 7, 2016</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>'''
soup = BeautifulSoup(content)
date = soup.find('div', {'itemprop':'datePublished'})
print(date.text)
Output
March 7, 2016

Related

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Parsing all elements which have tag before

I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')

Beautifulsoup: Get a range of divs

I just found out about how to process webpages in python using BeautifulSoup.
There's a list of div from which I want to get those in a specific range. The range is defined by two div that have a h2 child.
How would I do that? Thank you for your support!
EDIT: I added an actual representation of my html code below instead of a previous "simplified" version that was missing tags.
The new code shows a root div with class foo-bar-details.
Nested are 9 div tags. Two of which have a nested h2 tag. All of those 9 div tags contain img elements deeply nested within. What I need is each img element of those divs that are between the ones containing the h2 element.
An expected outcome if applied to the html code below would be:
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
This is the html code:
<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><span class="icon-help"></span>
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>
Here is a solution involving lxml.html:
We extract all divs between the first and last divs which contain an h2 tag:
import lxml.html
# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())
# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1 # add one as range stops at k - 1
continue
# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]
# Or
img_list = [_.xpath('.//img/#src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]
Another solution involving SimplifiedDoc:
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div class="foo-bar-details">
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">Test 1</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-1">Test 2</div>
<div class="padding-y-10 padding-x-40 " id="foo-feat-4-2">Test 3</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-3">Test 4</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.select('div.foo-bar-details').divs.contains('<h2')
print ([div.id for div in divs])
divs = doc.select('div.foo-bar-details').divs.notContains('<h2')
print ([div.id for div in divs])
Result:
['elem-4', 'elem-5']
['info-panel-header', 'foo-feat-4-1', 'foo-feat-4-2', 'foo-feat-4-3']
Simplifieddoc library does not rely on the third-party library, which is lighter and faster, perfect for beginners.
Here are more examples here
If I understand you correctly, you want to find <img> tags and corresponding <h2> to which the images belong to.
This example (txt variable contains the HTML snippet from your question):
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = {}
for img in soup.select('div:has(h2) ~ div img'):
out.setdefault(img.find_previous('h2').get_text(strip=True), []).append(img['src'])
from pprint import pprint
pprint(out)
Prints:
{'Bar': ['../../images/39826_thumb.JPG', '../../images/209876_thumb.JPG'],
'Foo': ['../../images/123456_thumb.jpg', '../../images/67890_thumb.JPG']}

Selenium Python cannot get elements

Here is the HTML code:
<i>
<div id="Calendar">
<div class="Title">booking</div>
<div class="calendarHolder">
<div class="month">
<div class="left"></div>
<span class="monthName">APRIL 2018</span>
<div class="right"></div>
</div>
<div class="dayHolder">
<div class="day holiday"><div class="dCell">SUN</div></div>
<div class="day"><div class="dCell">MON</div></div>
<div class="day"><div class="dCell">TUE</div></div>
<div class="day"><div class="dCell">WED</div></div>
<div class="day"><div class="dCell">THU</div></div>
<div class="day"><div class="dCell">FRI</div></div>
<div class="day last"><div class="dCell">SAT</div></div>
<div class="clearboth"></div>
</div>
<div class="dateHolder">
<div class="date"><div class="dCell">1</div></div>
<div class="date"><div class="dCell">2</div></div>
<div class="date"><div class="dCell">3</div></div>
<div class="date"><div class="dCell">4</div></div>
<div class="date"><div class="dCell">5</div></div>
<div class="date"><div class="dCell">6</div></div>
<div class="date last"><div class="dCell">7</div></div>
<div class="date"><div class="dCell">8</div></div>
<div class="date"><div class="dCell">9</div></div>
<div class="date"><div class="dCell">10</div></div>
<div class="date"><div class="dCell">11</div></div>
<div class="date"><div class="dCell">12</div></div>
<div class="date"><div class="dCell">13</div></div>
<div class="date last"><div class="dCell">14</div></div>
<div class="date time_day stin"><div class="dCell">15</div>
<input type="hidden" name="pid" value="5000">
</div>
<div class="date"><div class="dCell">16</div></div>
</div>
</i>
There is a calendar which I need to click on an available date to perform further actions.
What I actually need is to click on the one with the class name "date time_day stin".
I have simply try the exact Xpath and also the following but it return error:
No such element: Unable to locate element:
{"method":"class name","selector":"date.time_night.stin"}
driver.find_element_by_class_name('date.time_day.stin').click()
It return the same error by:
Dates =
driver.find_element_by_css_selector("div#date.time_day.stin>div").text
print (Dates)
Then I have tried different things of whether can I get the text of each div to find out what's the problem.
The only things I can get among all the find_element(s) trial are texts "SUN" to "SAT" with Class name "dCell" by:
lst = []
driver.find_elements_by_css_selector("div#startBookingBlock div.dCell")
for i in calendar:
lst.append(i.text)
print (lst)
But it still didn't return the others dates by the same class names, so.
I revamp it as the following and it returns "[]":
lst = []
calendar = driver.find_elements_by_css_selector("div#dateHolder div.dCell")
for i in calendar:
lst.append(i.text)
print (lst)
Then I tried to write more specifically as follow:
calendar = driver.find_elements_by_xpath("//div[#id='startBookingBlock' and #class='dateHolder' and #class='date' and #class='dCell']")
print (calendar)
However, it still prints "[]".
It can't seem to get anything under the "dateHolder" class but I just cannot figure out why would this happen, can anyone suggest? Thanks!
I think your HTML code is broken -- it's missing one </input> after the input tag and two </div>s at the end.
Try running your code on this one (I just added those three):
<div id="Calendar">
<div class="Title">booking</div>
<div class="calendarHolder">
<div class="month">
<div class="left"></div>
<span class="monthName">APRIL 2018</span>
<div class="right"></div>
</div>
<div class="dayHolder">
<div class="day holiday"><div class="dCell">SUN</div></div>
<div class="day"><div class="dCell">MON</div></div>
<div class="day"><div class="dCell">TUE</div></div>
<div class="day"><div class="dCell">WED</div></div>
<div class="day"><div class="dCell">THU</div></div>
<div class="day"><div class="dCell">FRI</div></div>
<div class="day last"><div class="dCell">SAT</div></div>
<div class="clearboth"></div>
</div>
<div class="dateHolder">
<div class="date"><div class="dCell">1</div></div>
<div class="date"><div class="dCell">2</div></div>
<div class="date"><div class="dCell">3</div></div>
<div class="date"><div class="dCell">4</div></div>
<div class="date"><div class="dCell">5</div></div>
<div class="date"><div class="dCell">6</div></div>
<div class="date last"><div class="dCell">7</div></div>
<div class="date"><div class="dCell">8</div></div>
<div class="date"><div class="dCell">9</div></div>
<div class="date"><div class="dCell">10</div></div>
<div class="date"><div class="dCell">11</div></div>
<div class="date"><div class="dCell">12</div></div>
<div class="date"><div class="dCell">13</div></div>
<div class="date last"><div class="dCell">14</div></div>
<div class="date time_day stin"><div class="dCell">15</div>
<input type="hidden" name="pid" value="5000"></input>
</div>
<div class="date"><div class="dCell">16</div></div>
</div></div></div>
The XML that works for me (in an online XPath tester) is as follows:
div[#id="Calendar"]/div[#class="calendarHolder"]/div[#class="dateHolder"]/div[#class="date time_day stin"]
Does this give the element you want?

Categories