Beautifulsoup Strainer to strain items from a specific container only - python

Is it possible to make a Beautifulsoup Strainer that strains all 'order-cards' from 'container-01' only (without 'order-cards' from other containers)?
Below the sample HTML
<div class="items-container" container-id="container-01">
<div class="order-card">order_01
<div class="item-card">item1</div>
<div class="item-card">item2</div>
<div class="item-card">item3</div>
<div class="item-card">item4</div>
</div>
<div class="order-card">order_02
<div class="item-card">itemA</div>
<div class="item-card">itemB</div>
<div class="item-card">itemC</div>
<div class="item-card">itemD</div>
</div>
<div class="order-card">order_03
<div class="item-card">itemW</div>
<div class="item-card">itemX</div>
<div class="item-card">itemY</div>
<div class="item-card">itemZ</div>
<div class="item-card">item</div>
</div>
</div>
<div class="items-container" container-id="container-02">
<div class="order-card">order_53
<div class="item-card">item_7</div>
<div class="item-card">item_8</div>
</div>
</div>
<div class="items-container" container-id="container-03">
<div class="order-card">order_13
<div class="item-card">item_16</div>
<div class="item-card">item_17</div>
<div class="item-card">item_18</div>
</div>
</div>
What I have so far is the code below which strains ALL 'order-cards' from ALL containers.
The goal is that 'page_soup' contains ALL 'order-card' items that are in 'container-01' only.
The following loop then uses that 'page_soup' to iterate through each item in 'order-card' to get the details from each 'item-card'.
rephrased above!
The goal is to get the details from each 'item-card' that are in 'container-01' only.
There is no need for parsing any other containers than 'container-01'.
only_item_cells = SoupStrainer('div', attrs={"class":"order-card"})
page_soup = BeautifulSoup(page_html, 'html.parser', parse_only=only_item_cells)
Following that is a loop that gets the details from ALL the 'item-cards' in ALL containers. In fact, that is NOT wanted, as the output includes items from containers other than 'container-01' only.
Running Python 3.8.8, on Anaconda, Win64

Use the appropriate attribute as you have indicated:
only_item_cells = SoupStrainer('div', attrs= {"container-id": "container-01"})

Related

is there a method to detect common form in html code?

I have a lot of html pages that are formatted differently but the content that interests me is the same , for example :
Page_1.html :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
<div class = "block_person">
<div class="persons"><span>James Alfonso</span></div>
<div class="contents"><h1>James is a singer</h1></div>
</div>
page_2.html :
<div class="many_speakers" >
<div class="speakers"><h1>Jules Rodrigez</h1></div>
<div class="summary"><span>Jules Rodrigez is a programmer specialized in data science</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Peka Yaya</h1></div>
<div class="summary"><span>Peka is a professor</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Cristiano dimaria</h1></div>
<div class="summary"><span>Cristiano is a football player</span></div>
</div>
from a page html (page_1 or page_2), i want to get a list of objects like :
from page_1.html
[{"person":"Jules Rodrigez","content":"Jules Rodrigez is a programmer specialized in machine learning"},{"person":"James Alfonso","content":"James is a singer"}]
the problem is that each page is formatted with an structure : how can we detect in an html page that a block is repeated several times and therefore it contains the requested information : for example in the page_1.html the bloc which is repeated several times is :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>

How to extract the data from encoded HTML class using python

How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.
You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''

How to get the text of the next tag? (Beautiful Soup)

The html code is :
<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>
I want to get the output inside the <div> tag i.e. Steven Cantrell .
I need such a way that I should be able to get the contents of next tag. In this case, it is 'span',{'class':'small text-muted'}
What I tried is :
rfq_name = soup.find('span',{'class':'small text-muted'})
print(rfq_name.next)
But this printed Contact instead of the name.
You're nearly there, just change your print to: print(rfq_name.find_next('div').text)
Find the element that has the text "Contact". Then use .find_next() to get the next <div> tag.
from bs4 import BeautifulSoup
html = '''<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
contact = soup.find(text='Contact').find_next('div').text
Output:
print(contact)
Steven Cantrell

How to use xpath to get text from similar class?

(1)
</div>
<div class="n_cont5" id="nct7">
<div class="nc_tit">说明书:</div>
<div class="nc5" id="smsdiv">
正在查询请稍候......
</div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
</div>
(2)
<a href="javascript:noAction()" title="PDF下载"
onclick="pdfDownloadDetail('Unexamined_patent_for_invention/2016/20160330/CN105452223A/PDF_PID/CN112014000037041CN00001054522230APDFZH20160330CN00F.PDF,CN201480037041.6')" href="javascript:noAction()">PDF下载</a>
</dd>
</dl>
</div>
</li>
<li>打印</li>
<li><a class="icon7" href="javascript:noAction();" class="zidongfanyi"
onclick="translateToEn('CN201480037041.6', 'FMZL_EN,SYXX_EN')">中译英</a></li>
</ul>
<div class="clear"></div>
</div>
<div class="clear"></div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
Here are two very similar content in the html content. Iwant to get the number "53" from the first?. I used the below code which doesn't work. I also try from div class, but it also failed. How can I get the number "53" from the first html content?
html.xpath('//a[contains(text(),"下一篇")]/span/text()')
Why it didn't work : the span holding 53 is a sibling (not a child) of the a element.
To complete #super.single430's answer, here's an alternative (if encoding issues occur during the parsing process) :
//span[#class="cur"]/following-sibling::span/text()
html.xpath("//a[contains(text(),'下一篇')]/following-sibling::span/text()")

Follow a sibling in Selenium/Python

I want to enter a text in a text area. The HTML code is as follows:
<li class="order-unavailable string-type-key string-block clear-fix status- require_changes expanded working autogrowed activity-opened" data-string_status="require_changes" data-master_unit_count="22" data-string_id="2394473">
<div class="key-area clear-fix">
<div class="key-area-container-one clear-fix">
<div class="key-area-container-two">
<div class="col-50 col-left">
<div class="string-controls">
<a class="control-expand-toggle selected" href="#"></a>
<a class="control-activity-toggle " href="#">2</a>
<input class="control-select-string" type="checkbox">
</div>
<div class="master-content">
</div>
<div class="col-50 col-right slave-side-container">
</div>
</div>
</div>
<div class="activity-area clear-fix">
<div class="col-50 col-left">
<div class="col-50 col-right">
<div class="comment-area-inner">
<h3>Add comment</h3>
<div class="comment-container">
<textarea class="comment-content" name="comment_content"></textarea>
</div>
<div class="col-right">
<div class="clear"></div>
<strong>Notification settings</strong>
<p>The people you select will get an email when you post this comment. They'll also be notified by email every time a new comment is added.</p>
<div class="notification-settings">
</div>
</div>
</div>
The textarea component name is comment-content
The xpath of the textarea is:
/html/body/div/section/ol/li[16]/div[2]/div[2]/div/div/textarea
This is the code I am using:
driver.find_element_by_xpath("*//div[#title=\"NOTIFICATION_HOMEPAGE_REDIRECT_CHANGED_SITE\"]
/following-sibling::div[2]/div[2]/div/div/textarea").send_keys("Test comment")
Can someone hekp me how to frame the sibling tag?
div[2]/div[2]/div/div/textarea
The tag before the following-sibling keyword is correct.
Choose the textarea and enter something,
driver.find_element_by_xpath(r'//textarea[#class='comment-content']').send_keys('Test Comment')
For xpath, you can use tool Firepath plugin for Firefox

Categories