I have the following HTML code, I'm trying to get "clients" for each specific "date",
but I only get the first next element :
<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
<span class="client" >client6</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client7</span>
<span class="client" >client8</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client9</span>
<span class="client" >client10</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client11</span>
<span class="client" >client12</span>
</div>
</div>
I'm using the following code :
soup=BeautifulSoup(html,'html.parser')
dates=soup.find_all(class_='date')
for date in dates:
print(date.text)
for item in date.find_next(class_='clients-list').find_all(class_='client'):
print(item.text)
The output is get is :
DATE-1
client1
client2
client3
DATE-2
client7
client8
I tried with find_next_all, but got the same output.
A bit tricky but you will get the output.Use find_next_siblings()
from bs4 import BeautifulSoup
html='''<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
<span class="client" >client6</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client7</span>
<span class="client" >client8</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client9</span>
<span class="client" >client10</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client11</span>
<span class="client" >client12</span>
</div>
</div>'''
soup=BeautifulSoup(html,'html.parser')
dates=soup.find_all(class_='date')
for date in dates:
print(date.text)
for item in date.parent.parent.find_next_siblings(class_='clients-list'):
if item.find_previous_sibling(class_='info').find_next(class_='date').text==date.text:
for client in item.find_all(class_='client'):
print(client.text)
Output:
DATE-1
client1
client2
client3
client4
client5
client6
DATE-2
client7
client8
client9
client10
client11
client12
Related
Im struggling with scraping a few pages ... it happens when the structure of the page implies a lot of nested divs...
Here is the code page:
<div>
<section class="ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons" role="tab" id="ui-id-1" aria-controls="ui-id-2" aria-selected="false" aria-expanded="false" tabindex="0"><span class="ui-accordion-header-icon ui-icon ui-icon-triangle-1-e"></span>
<div class="detail-avocat">
<div class="nom-avocat">Me <span class="avocat_name">NAME </span></div>
<div class="type-avocat">Avocat postulant au Tribunal Judiciaire</div>
</div>
<div class="more-info">Plus d'informations</div>
</section>
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom" style="display: none;" id="ui-id-2" aria-labelledby="ui-id-1" role="tabpanel" aria-hidden="true">
<div class="details">
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Structure :</span>
<div>
<p>Cabinet individuel NAME</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Adresse :</span>
<div>
<p>21 rue Belle Isle 57000 VILLE</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Mail :</span>
<div>
<p>cabinet#mail.fr</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Tél :</span>
<div>
<p>Telnum</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Fax :</span>
<div>
<p> </p>
</div>
</div>
</div>
<div class="contact-avocat"> Contacter </div>
</div>
</div>
</div>
And here is my python code:
divtel = self.driver.find_elements(by=By.XPATH,
value=f'//div[#class="detail-avocat-content overflow-h"]/div/p')#div[#class="detail-avocat-content overflow-h"]')
for p in divtel:
print(p.text)
It doesnt print anything...with other similar pages it prints the text but in this case it doesnt altough there is text in the nested span and div/p . Do you know why?
How can i resolve my problem please?
thank you
The method .text works only when the webelement containing the text is visible in the webpage. If otherwise the webelement is hidden, you have to use .get_attribute('innerText') or .get_attribute('textContent') or .get_attribute('innerHTML') (see here for difference between them). So for example change
print(p.text)
to
print(p.get_attribute('innerText'))
The HTML is located below, If the span value is less than 20%, then I want to remove the span child up until the <div class="action"> parent only.
So for example:
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
From the above HTML, these code should only be removed:
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
So what should left is:
<div class="item">
<div class="info">
</div>
</div>
This is my current python code:
items = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='content-name']")))
for item in items:
percentage_text = re.findall("\d+", item.text)[0]
if int(percentage_text) <= 20:
driver.execute_script("arguments[0].remove();", item)
But it only removes the span class and not its parent.
Here is the full HTML, I think it needs javascript to remove elements but I am very new on javascript I researched for more than 2 hours and I still can't find solutions. Thank you very much.
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 95% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 32% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 15% </span>
</div>
</div>
</div>
</div>
get to the parent of the parent:
driver.execute_script("arguments[0].parentElement.parentElement.remove();", item)
I have some html data given below i want to extract the all times from the webpage and then store the all data inside a list Variable. How can i do that.. Help Please..
<div class=panchang-box-secondary-header>
<div class="list-wrapper pl-2">
<div class="list-style-thumbnail list-layout-horizontal">
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्योदय</span>
<span class="d-block b">5:31 AM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्यास्त</span>
<span class="d-block b">7:24 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रोदय</span>
<span class="d-block b">10:05 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रास्त</span>
<span class="d-block b">9:12 AM</span>
</div>
</div>
</div>
Try using this:
from bs4 import BeautifulSoup
a = '''<div class=panchang-box-secondary-header>
<div class="list-wrapper pl-2">
<div class="list-style-thumbnail list-layout-horizontal">
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्योदय</span>
<span class="d-block b">5:31 AM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्यास्त</span>
<span class="d-block b">7:24 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रोदय</span>
<span class="d-block b">10:05 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रास्त</span>
<span class="d-block b">9:12 AM</span>
</div>
</div>
</div>'''
soup = BeautifulSoup(a,'html.parser')
time = soup.select('.d-block.b')
times = [times.text for times in time]
print(times)
Output:
['5:31 AM', '7:24 PM', '10:05 PM', '9:12 AM']
Just extract "d-block b" and push it into wherever you want.
time = soup.find_all(class_ = "d-block b").text
This will make a list that gets all the time in the webpage source and store it in the variable time
Here is the outer html
<div class="container">
<div class="header">
<div id="logo"></div>
<div class="feedback_wrap">
<span id="rating-ask" style="display: inline-block;">
<span class="ext_text">Rate us</span>
<br>
<span class="stars">
<a class="star" data-pos="1">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="2">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="3">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="4">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="5">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
</span>
</span>
</div>
<div class="device_toggler_wrap">
<div class="device_toggler_wrap2">
<span class="phone_icon"></span>
<div class="device_toggler">
<label class="switch">
<input type="checkbox" id="deviceswitch">
<span class="slider round"></span>
</label>
</div>
<span class="tablet_icon"></span>
</div>
</div>
<div class="device_color_switcher">
<div class="device_color_switcher_row">
<div class="color color_1" data-color="#FDD8D5"></div>
<div class="color color_2" data-color="#F1D9BF"></div>
<div class="color color_3" data-color="#E6E7E9"></div>
</div>
<div class="device_color_switcher_row">
<div class="color color_4" data-color="#35383D"></div>
<div class="color color_5" data-color="#000000"></div>
<div class="color color_6" data-color="#873239"></div>
</div>
<div class="device_color_switcher_row">
<div class="color color_7" data-color="#54E9DD"></div>
<div class="color color_8" data-color="#28B3D4"></div>
<div class="color color_9" data-color="#A7A7A7"></div>
</div>
</div>
</div>
<div class="phone_wrap">
<div class="phone" style="border-color: rgb(230, 231, 233); width: 340px; max-height: 650px; border-width: 28px 7px 50px;">
<div class="arrow_left">
<span class="ext_icon"></span>
</div>
<div class="arrow_right">
<span class="ext_icon"></span>
</div>
<div class="label">
<span class="text" style="color: rgb(0, 0, 0);"></span>
</div>
<div class="insta_loading" style="display: none;">
<i class="icon"></i>
<span class="text">Loading...</span>
</div>
<div id="iframe_wrap" style="width: 340px; opacity: 1;"><iframe frameborder="0"></iframe></div>
<div class="circle"></div>
</div>
</div>
</div>
Here is the inner iframe:
https://gist.github.com/ishandutta2007/67c1698b34e58634b9a855051b67fdfd
This is my effort:
iframe = browser.find_element_by_tag_name("iframe")
browser.switch_to.frame(iframe)
try:
browser.find_element_by_css_selector("form > input").send_keys("/my/path/to/sample.jpg")
except Exception as e:
print(e)
No exception thrown but nothing happens either
I want to crawl the HTML data(elements view using Chrome Developer Tools, not page view).
I crawled HTML data in Python.
- The following is tried code in Python.
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
But only Page view was crawled.
- The following is crawled Page view using the above Python code.
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
I want to crawl the Elements view using Python.
- The following is Elements view using Chrome Developer Tools(I want to crawl the HTML data).
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
댓글 <span class="_count">22</span>
</li>
<li>댓글쓰기</li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> ldo1**** <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 10 7 </div> </div> </li> <li id="scmt-item-1149360" class=""> dbgh**** <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 1 5 </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <span class="cmt_pg_prev">이전</span> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <span class="cmt_pg_next">다음</span> </div></div> </div></div></div>
...
I don't know about Java script. If possible, Give tell me the easy way(Python, wget & curl in Linux).