Im struggling with scraping a few pages ... it happens when the structure of the page implies a lot of nested divs...
Here is the code page:
<div>
<section class="ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons" role="tab" id="ui-id-1" aria-controls="ui-id-2" aria-selected="false" aria-expanded="false" tabindex="0"><span class="ui-accordion-header-icon ui-icon ui-icon-triangle-1-e"></span>
<div class="detail-avocat">
<div class="nom-avocat">Me <span class="avocat_name">NAME </span></div>
<div class="type-avocat">Avocat postulant au Tribunal Judiciaire</div>
</div>
<div class="more-info">Plus d'informations</div>
</section>
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom" style="display: none;" id="ui-id-2" aria-labelledby="ui-id-1" role="tabpanel" aria-hidden="true">
<div class="details">
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Structure :</span>
<div>
<p>Cabinet individuel NAME</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Adresse :</span>
<div>
<p>21 rue Belle Isle 57000 VILLE</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Mail :</span>
<div>
<p>cabinet#mail.fr</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Tél :</span>
<div>
<p>Telnum</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Fax :</span>
<div>
<p> </p>
</div>
</div>
</div>
<div class="contact-avocat"> Contacter </div>
</div>
</div>
</div>
And here is my python code:
divtel = self.driver.find_elements(by=By.XPATH,
value=f'//div[#class="detail-avocat-content overflow-h"]/div/p')#div[#class="detail-avocat-content overflow-h"]')
for p in divtel:
print(p.text)
It doesnt print anything...with other similar pages it prints the text but in this case it doesnt altough there is text in the nested span and div/p . Do you know why?
How can i resolve my problem please?
thank you
The method .text works only when the webelement containing the text is visible in the webpage. If otherwise the webelement is hidden, you have to use .get_attribute('innerText') or .get_attribute('textContent') or .get_attribute('innerHTML') (see here for difference between them). So for example change
print(p.text)
to
print(p.get_attribute('innerText'))
Related
Need to scrap text appears before and after script tag,
HTML:
<div class="card-body">
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">EUR/USD signal</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="timeago fw-normal small" datetime="1656687480000" timeago-id="10">1 day ago</span>
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
From
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
<div class="signal-row signal-status signal-color">
Filled
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Sold at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGMP');</script>1.0407
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Bought at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGML');</script>1.0408
</div>
need to extract UTC and +5:30 and other details available different mentioned in html span tag eg :<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
Tried using next_sibling but it returns nothing, tried using etree and xpath but this is also not returning anything.
I tried using lxml etree:
dom = etree.HTML(str(soup))
t = dom.xpath("//div[#class='ms-auto signal-value signal-color']/span/script/following-sibling::text()")
for i in t:
print(i.text)
Using next siblling:
l = soup.find('script').next_siblings
Expected Output :
UTC +05:30
20:28
Simply call .text or get_text() method on your element, the script tag will be ignored.
soup.select_one('.card-body span').parent.get_text(' ', strip=True)
Note Assuming HTML is generated dynamically, so prerequisites differ from facts in your question.
Example
It will select all the <span> and iterate over ResultSet to print the texts.
from bs4 import BeautifulSoup
html='''
<div class="card-body">
<div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('.card-body span'):
print(e.parent.get_text(' ', strip=True))
Output
UTC +05:30 20:28
UTC +05:30 23:28
The HTML is located below, If the span value is less than 20%, then I want to remove the span child up until the <div class="action"> parent only.
So for example:
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
From the above HTML, these code should only be removed:
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
So what should left is:
<div class="item">
<div class="info">
</div>
</div>
This is my current python code:
items = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='content-name']")))
for item in items:
percentage_text = re.findall("\d+", item.text)[0]
if int(percentage_text) <= 20:
driver.execute_script("arguments[0].remove();", item)
But it only removes the span class and not its parent.
Here is the full HTML, I think it needs javascript to remove elements but I am very new on javascript I researched for more than 2 hours and I still can't find solutions. Thank you very much.
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 95% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 32% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 15% </span>
</div>
</div>
</div>
</div>
get to the parent of the parent:
driver.execute_script("arguments[0].parentElement.parentElement.remove();", item)
I have problems to press this type of buttons with Selenium since the name by which I look for "5dbnhpbwuny6rmr65h86" and the button are in different div in Python.
Complete HTML code: https://codeshare.io/a39b3g.
Image HTML.
Example HTML code:
<div class="o_kanban_view o_kanban_dashboard o_pos_kanban o_cannot_create o_kanban_ungrouped" style="display: flex;"><div class="o_kanban_record">
<div class="o_kanban_card_header">
<div class="o_kanban_card_header_title">
<div class="o_primary">5dbnhpbwuny6rmr65h86</div>
<div class="o_secondary">Unused</div>
</div>
<div class="o_kanban_manage_button_section">
<a class="o_kanban_manage_toggle_button" href="#">Más <i class="fa fa-caret-down"></i></a>
</div>
</div>
<div class="container o_kanban_card_content o_visible">
<div class="row">
<div class="col-xs-6 o_kanban_primary_left">
<button class="btn btn-default oe_kanban_action oe_kanban_action_button" data-name="open_session_cb" data-type="object" type="button">New Session
</button>
</div>
<div class="col-xs-6 o_kanban_primary_right">
</div>
</div>
</div><div class="container o_kanban_card_manage_pane o_invisible">
<div class="row">
<div class="col-xs-6 o_kanban_card_manage_section o_kanban_manage_view">
<div class="o_kanban_card_manage_title">
<span>Ver</span>
</div>
<div>
<a data-name="341" data-type="action" href="#" class=" oe_kanban_action oe_kanban_action_a">Sesiones</a>
</div>
<div>
<a data-name="342" data-type="action" href="#" class=" oe_kanban_action oe_kanban_action_a">Pedidos de ventas</a>
</div>
</div>
<div class="col-xs-6 o_kanban_card_manage_section o_kanban_manage_new">
<div class="o_kanban_card_manage_title">
<span>Informes</span>
</div>
<div>
<a data-name="343" data-type="action" href="#" class=" oe_kanban_action oe_kanban_action_a">Pedidos</a>
</div>
</div>
</div>
<div class="o_kanban_card_manage_settings row">
<div class="col-xs-12 text-right">
<a data-type="edit" href="#" class=" oe_kanban_action oe_kanban_action_a">Configuración</a>
</div>
</div>
</div>
</div><div class="o_kanban_record o_kanban_ghost"></div><div class="o_kanban_record o_kanban_ghost"></div><div class="o_kanban_record o_kanban_ghost"></div><div class="o_kanban_record o_kanban_ghost"></div><div class="o_kanban_record o_kanban_ghost"></div><div class="o_kanban_record o_kanban_ghost"></div></div>
I came up with something like that, but I do not have the right solution:
for div in driver.find_elements_by_xpath("//div[#class='o_kanban_record']"):
if div.find_elements_by_xpath("//div[contains(text() , '5dbnhpbwuny6rmr65h86')]") != []:
div.find_elements_by_xpath("//button[contains(text() , 'New Session')]").click()
Thanks!
To click() on the New Session button for any of the Strings e.g.
iuijg6bzr2xs9gsueq2i or 5dbnhpbwuny6rmr65h86, you can take help of a function and pass any String to get the relevant New Session button clicked.
The final solution, which detects the state of the button is:
driver.find_elements_by_xpath("//div[#class='o_primary' and contains(text(), '%s')]/parent::div[*]/parent::div[*]/parent::div[*]/descendant::button[#data-name='open_session_cb']" % (shop))[0].click()
OR
driver.find_elements_by_xpath("//div[#class='o_primary' and contains(text(), '%s')]/parent::div[*]/parent::div[*]/parent::div[*]/descendant::button[#data-name='open_ui']" % (shop))[0].click()
Get all divs and buttons:
divs = driver.find_elements_by_css_selector(".o_primary")
buttons = driver.find_elements_by_css_selector(".btn.btn-default.oe_kanban_action.oe_kanban_action_button")
Go through the list of div elements and find needed one and do click action on the appropriate button:
for div, button in zip(divs, buttons):
if div.text == "5dbnhpbwuny6rmr65h86":
button.click()
Have you heard about Splinter? It is an abstraction layer on top of Selenium and it allows you to find elements by text: https://splinter.readthedocs.io/en/latest/finding.html
driver.find_by_text('5dbnhpbwuny6rmr65h86')
find_by_text returns a list of elements,so it should be
element = driver.find_by_text('5dbnhpbwuny6rmr65h86').first
element.find_by_xpath("//button[contains(text() , 'New Session')]").first.click()
Note: Untested
I want to crawl the HTML data(elements view using Chrome Developer Tools, not page view).
I crawled HTML data in Python.
- The following is tried code in Python.
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
But only Page view was crawled.
- The following is crawled Page view using the above Python code.
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
I want to crawl the Elements view using Python.
- The following is Elements view using Chrome Developer Tools(I want to crawl the HTML data).
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
댓글 <span class="_count">22</span>
</li>
<li>댓글쓰기</li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> ldo1**** <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 10 7 </div> </div> </li> <li id="scmt-item-1149360" class=""> dbgh**** <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 1 5 </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <span class="cmt_pg_prev">이전</span> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <span class="cmt_pg_next">다음</span> </div></div> </div></div></div>
...
I don't know about Java script. If possible, Give tell me the easy way(Python, wget & curl in Linux).
I tried the following to identify elements but I am getting "No element found" message when I run my scripts.
Method1 tried:
self.driver.find_element_by_xpath("//button[text()='Adopt and Initial']").click()
Method2 tried:
self.driver.find_element_by_css_selector(".btn-primary.btn.left.item-alt").click()
HTML of the button:
Updated Html code for the element. This is for docusign.
<div class="dialog is-signature-mode" tabindex="0">
<header class="dialog-header">
<h1 class="dialog-title">
<span class="item-alt" data-group="tagType" data-group-item="signature">Adopt Your Signature</span>
<span class="item-alt" data-group="tagType" data-group-item="initials" data-selected="">Adopt Your Initials</span>
</h1>
<nav class="icons">
<a class="close" data-action="cancelAdoptSignature">
<i class="icon-close"></i>
</a>
</nav>
</header>
<section class="dialog-body">
<article id="adopt">
<header class="ds-title p">
Confirm your name, initials, and signature.
</header>
<div class="full-name">
<div class="wrapper">
<label for="full-name">Full Name</label> <span class="error hidden">Name required</span>
<br>
<div class="text-input-wrapper">
<input id="full-name" disabled="" value="QAAuto 01Dec2014_10.41.03" name="fullname" type="text" class="required text-input" maxlength="50">
</div>
</div>
</div>
<div class="initials">
<div class="wrapper">
<label for="initials">Initials</label> <span class="error hidden">Initials required</span>
<br>
<div class="text-input-wrapper">
<input id="initials" disabled="" value="Q0" name="initials" type="text" class="required text-input" maxlength="50">
</div>
</div>
</div>
<div class="clear-float"></div>
</article>
<header class="tab-nav">
<ul>
<li>Select Style</li>
<li>Draw</li>
</ul>
</header>
<article id="select-style" class="tab-panel panel-select-style selected">
<h4 class="normal">Preview <span class="error"></span></h4>
<div class="signature-preview">
<div class="signature"><img alt="" src="https://demo.docusign.net/Signing/image.aspx?ti=56b2faad38e7427a99defd1dfaa258ce&insession=1&i=asig150&force=154&s=QAAuto+01Dec2014_10.41.03&f=7_DocuSign&nochrome=0" height="75px" class="signature-img left">
<img alt="" src="https://demo.docusign.net/Signing/image.aspx?ti=56b2faad38e7427a99defd1dfaa258ce&insession=1&i=ainit150&force=155&s=Q0&n=QAAuto+01Dec2014_10.41.03&f=7_DocuSign&nochrome=0" height="75px" class="initials-img right">
<div class="clear-float"></div></div>
<a class="change-style">
Change Style
</a>
<div class="clear-float"></div>
</div>
</article>
<article id="draw" class="tab-panel panel-draw">
<h4 class="normal">
<span class="item-alt-inline" data-group="tagType" data-group-item="signature">Draw your signature</span>
<span class="item-alt-inline" data-group="tagType" data-group-item="initials" data-selected="">Draw your initials</span>
<span class="error"></span>
</h4>
<a class="clear" data-ds="clear">Clear</a>
<div class="signature-draw signature">
<div class="canvas-wrapper"><canvas class="canvas" width="0" height="0"></canvas><canvas class="canvas" width="0" height="0"></canvas></div></div>
</article>
<p class="legalese">By clicking Adopt and Sign, I agree that the signature and initials will be the electronic representation of my signature and initials for all purposes when I (or my agent) use them on documents, including legally binding contracts - just the same as a pen-and-paper signature or initial.</p>
<hr>
<button class="btn-primary btn left item-alt" data-group="tagType" data-group-item="signature" type="button" data-ds="submit" value="initials">Adopt and Sign</button>
<button class="btn-primary btn left item-alt" data-group="tagType" data-group-item="initials" type="button" data-ds="submit" value="initials" data-selected="">Adopt and Initial</button>
<button class="close left btn btn-default" type="button" data-action="cancelAdoptSignature">Cancel</button>
<div class="clear-float"></div>
<div class="styles"></div></section>
</div>
Please try below xpath and let me know what happens
assertTrue(driver.getPageSource().contains("Adopt and Initial"));
self.driver.find_element_by_xpath("//button[conatins(text(),'Adopt and Initial')]".click();
Now Make sure you use assertion before the command because if the assertion fails then the element is not present in the current frame. so we will have to switch the correct frame before we perform the action.
Let me know if you try this.