I want to crawl the web page html (all elements) in Python - python

I want to crawl comments of news (http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117).
But, crawling using requests failed.
Crawling result is empty <div id="social-comment">.
I want to crawl the web page HTML (all elements).
Tried source in python.
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
Failed html:
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
Expected html:
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
댓글 <span class="_count">22</span>
</li>
<li>댓글쓰기</li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> ldo1**** <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 10 7 </div> </div> </li> <li id="scmt-item-1149360" class=""> dbgh**** <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 1 5 </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <span class="cmt_pg_prev">이전</span> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <span class="cmt_pg_next">다음</span> </div></div> </div></div></div>
...

The page you're attempting to read loads comments in an XHR request, after the page loads. Thus, unless you're using some tool that emulates a full browser (executing javascript and loading external resources), you're not loading the comments.
The comments are loaded in a POST request sent to http://m.entertain.naver.com/api/comment/list.json
That returns a JSON object, with all the data you're looking for.
As it is a POST request, it's looking for data that you may send along. In my testing, it appears that the minimal information you need to provide is:
gno : news117,0002600716
pageSize : 2000 (20 is the default. 2,000 will probably give you all of the comments for most cases, but you should adapt as you see fit.)
sort : newest
Encoded as a URL string, (which, is how data is actually sent as a POST request when it's sent urlencoded,) this becomes gno=news117%2C0002600716&page=2&sort=newest&pageSize=20&serviceId=news
As to how to go about making POST requests in Python, see here and here.
As to how to parse the JSON that's returned, see here.

Related

How to fetch a specific value from a HTML using BeautifySoup

I am trying to bring the percent value from a HTML file. I used the below two method in juypter Notebook and got the result as expected in both the method. But while I am trying to replicate the same in pycharm I am not getting expected result in either of the method. I am getting None in second method and picking another word in the first method
spans = soup.select_one('span').text
print("spans:", spans)
>> spans 99%
spans = soup.find("span", {"class": "rc_late"}).text
print("spans", spans)
>> spans 99%
Here is snipper of the HTML. Is there any way we can fetch those value (99%)?
<div class="content">
<h1>Latest report:
<span class="rc_late">99%</span>
</h1>
<aside id="help_panel_wrapper">
<input id="help_panel_state" type="checkbox">
<label for="help_panel_state">
<img id="keyboard_icon" src="keybd_closed.png" alt="Show/hide keyboard shortcuts">
</label>
<div id="help_panel">
<p class="legend">Shortcuts on this page</p>
<div class="keyhelp">
<p>
<kbd>n</kbd>
<kbd>s</kbd>
<kbd>m</kbd>
<kbd>x</kbd>
<kbd>c</kbd>
change column sorting
</p>
<p>
<kbd>[</kbd>
<kbd>]</kbd>
prev/next file
</p>
<p>
<kbd>?</kbd> show/hide this help
</p>
</div>
</div>
</aside>
<form id="filter_container">
<input id="filter" type="text" value="" placeholder="filter...">
</form>
<p class="text">
created at 2023-01-22 12:01 +0000
</p>
</div>
Here is one way to do it, given your example:
from bs4 import BeautifulSoup as bs
html = '''
<div class="content">
<h1>Latest report:
<span class="rc_late">99%</span>
</h1>
<aside id="help_panel_wrapper">
<input id="help_panel_state" type="checkbox">
<label for="help_panel_state">
<img id="keyboard_icon" src="keybd_closed.png" alt="Show/hide keyboard shortcuts">
</label>
<div id="help_panel">
<p class="legend">Shortcuts on this page</p>
<div class="keyhelp">
<p>
<kbd>n</kbd>
<kbd>s</kbd>
<kbd>m</kbd>
<kbd>x</kbd>
<kbd>c</kbd>
change column sorting
</p>
<p>
<kbd>[</kbd>
<kbd>]</kbd>
prev/next file
</p>
<p>
<kbd>?</kbd> show/hide this help
</p>
</div>
</div>
</aside>
<form id="filter_container">
<input id="filter" type="text" value="" placeholder="filter...">
</form>
<p class="text">
created at 2023-01-22 12:01 +0000
</p>
</div>
'''
soup = bs(html, 'html.parser')
data_we_want = soup.select_one('span[class="rc_late"]').text
print(data_we_want)
Result in terminal:
99%
You can find BeautifulSoup documentation here.

Add Up and Down Votes in bootstrap

I am creating a website, where I want to implement Up and Down voting button. I am using Flask and Bootstrap for that. So, if anyone can tell me how to add up and down votes without using jQuery?
Bootstrap version is 4, so no glyphicons are available. Any help?
<div class="card">
<div class="card-body">
{{{{Need those buttons here}}}}
<div class="row" style="margin-left: 5px;">
{{posts[post]["user"]}}
<em> {{posts[post]["title"]}} </em>
</div>
<div class="row" style="margin-left: 5px;">
/user311
<a target="_blank" href={{ posts[post]["replies"]["link"] }}>{{ posts[post]["replies"]["head"] }} </a>
</div>
</div>
</div>
Just link to a font awesome CDN and then add this to your card-body div:
<div>
<i class="fa fa-caret-up fa-2x"></i>
</div>
<div>
<i class="fa fa-caret-down fa-2x"></i>
</div>
N.B. Change fa-2x according to the size you want for the icon.
jsfiddle: https://jsfiddle.net/9j0ggd5g/5/

log in to a website using Python's Requests module doesn't work on this site but work on other site

use the following code on twitter and github it work fine but it doesn't work on the main site i am using it for, can someone please tell me what went wrong.
i did not get any error instead it scrap back the login page instead of log me.
import requests
session = requests.Session()
params = {'j_username': '**********', 'j_password': '************'}
r = requests.post("https://connect.data.com/loginProcess", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("https://connect.data.com/home")
print(r.text)
I am new to this so i can't figure out what went wrong, i have try many answer out before asking the question but none seem to workout for the site.
The login url is /login but the file that the login is been process is /loginProcess that is why i use /loginProcess but loginProcess does not print out my cookie but /login does print it out
The form to the site look like this:
<form id="command" name="LoginForm" action="https://connect.data.com/loginProcess" method="post">
<div>
<div class="login-container fields float-left">
<div class="content">
<div class="first">
<span class="title">Login</span>
</div>
<div class="middle">
<input id="j_username" name="j_username" type="email" class="text" placeholder="Email" maxlength="128" tabindex="1">
</div>
<div class="middle">
<input id="j_password" name="j_password" type="password" class="text" placeholder="Password" autocomplete="off" tabindex="2" maxlength="128">
</div>
<div class="middle">
<label for="_spring_security_remember_me" class="general-checkbox-label">
<input name="_spring_security_remember_me" tabindex="3" value="on" id="_spring_security_remember_me" class="checkbox margin-0px" type="checkbox">
<span>Keep me logged in </span>
</label>
</div>
<div class="last">
<button id="login_btn" type="submit" class="button-standard button-primary" tabindex="4">
<span class="button-standard-text">Login</span>
</button>
<a class="link" href="https://connect.data.com/forgotpassword" onclick="var x=".tl(";s_objectID="https://connect.data.com/forgotpassword_1";return this.s_oc?this.s_oc(e):true">Forgot Password?</a>
</div>
</div>
</div>
<div class="login-container marketing-message float-right">
<div class="content">
<h2>
Don't have an account?
Sign up now - it's FREE!
</h2>
<div class="float-left">
<img class="login-bicons" src="./Data.com Connect Business Contact Directory of Business Contacts and Company Information_files/clear.cache.gif">
</div>
<div class="float-left margin-left-20px">
<p>
<span id="pageTitle" style="color: #146791">Find</span>
business card information & B2B professionals.
</p>
<p>
<span id="pageTitle" style="color: #87c540">Research</span>
people & companies.
</p>
<p class="margin-top-5px">
<span id="pageTitle" style="color: #f37521">Get</span>
millions of contacts.
</p>
</div>
<div class="clear"></div>
</div>
</div>
<div class="clear nonlogged-container-foot">
<span class="new-to-ddc">New to Data.com?</span>
<a class="sign-up" href="https://connect.data.com/registration/signup" onclick="var x=".tl(";s_objectID="https://connect.data.com/registration/signup_1";return this.s_oc?this.s_oc(e):true">Sign up for an account...</a>
</div>
</div>
<input type="hidden" name="CSRF_TOKEN" id="CSRF_TOKEN" value="ce56932e08fc97bc29c5f6535b572664a66000513143584504b0ee7c66ef9659"></form>

Can't able to crawl the HTML data(elements view, not page view)?

I want to crawl the HTML data(elements view using Chrome Developer Tools, not page view).
I crawled HTML data in Python.
- The following is tried code in Python.
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
But only Page view was crawled.
- The following is crawled Page view using the above Python code.
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
I want to crawl the Elements view using Python.
- The following is Elements view using Chrome Developer Tools(I want to crawl the HTML data).
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
댓글 <span class="_count">22</span>
</li>
<li>댓글쓰기</li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> ldo1**** <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 10 7 </div> </div> </li> <li id="scmt-item-1149360" class=""> dbgh**** <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 1 5 </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <span class="cmt_pg_prev">이전</span> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <span class="cmt_pg_next">다음</span> </div></div> </div></div></div>
...
I don't know about Java script. If possible, Give tell me the easy way(Python, wget & curl in Linux).

How to identify button class element?

I tried the following to identify elements but I am getting "No element found" message when I run my scripts.
Method1 tried:
self.driver.find_element_by_xpath("//button[text()='Adopt and Initial']").click()
Method2 tried:
self.driver.find_element_by_css_selector(".btn-primary.btn.left.item-alt").click()
HTML of the button:
Updated Html code for the element. This is for docusign.
<div class="dialog is-signature-mode" tabindex="0">
<header class="dialog-header">
<h1 class="dialog-title">
<span class="item-alt" data-group="tagType" data-group-item="signature">Adopt Your Signature</span>
<span class="item-alt" data-group="tagType" data-group-item="initials" data-selected="">Adopt Your Initials</span>
</h1>
<nav class="icons">
<a class="close" data-action="cancelAdoptSignature">
<i class="icon-close"></i>
</a>
</nav>
</header>
<section class="dialog-body">
<article id="adopt">
<header class="ds-title p">
Confirm your name, initials, and signature.
</header>
<div class="full-name">
<div class="wrapper">
<label for="full-name">Full Name</label> <span class="error hidden">Name required</span>
<br>
<div class="text-input-wrapper">
<input id="full-name" disabled="" value="QAAuto 01Dec2014_10.41.03" name="fullname" type="text" class="required text-input" maxlength="50">
</div>
</div>
</div>
<div class="initials">
<div class="wrapper">
<label for="initials">Initials</label> <span class="error hidden">Initials required</span>
<br>
<div class="text-input-wrapper">
<input id="initials" disabled="" value="Q0" name="initials" type="text" class="required text-input" maxlength="50">
</div>
</div>
</div>
<div class="clear-float"></div>
</article>
<header class="tab-nav">
<ul>
<li>Select Style</li>
<li>Draw</li>
</ul>
</header>
<article id="select-style" class="tab-panel panel-select-style selected">
<h4 class="normal">Preview <span class="error"></span></h4>
<div class="signature-preview">
<div class="signature"><img alt="" src="https://demo.docusign.net/Signing/image.aspx?ti=56b2faad38e7427a99defd1dfaa258ce&insession=1&i=asig150&force=154&s=QAAuto+01Dec2014_10.41.03&f=7_DocuSign&nochrome=0" height="75px" class="signature-img left">
<img alt="" src="https://demo.docusign.net/Signing/image.aspx?ti=56b2faad38e7427a99defd1dfaa258ce&insession=1&i=ainit150&force=155&s=Q0&n=QAAuto+01Dec2014_10.41.03&f=7_DocuSign&nochrome=0" height="75px" class="initials-img right">
<div class="clear-float"></div></div>
<a class="change-style">
Change Style
</a>
<div class="clear-float"></div>
</div>
</article>
<article id="draw" class="tab-panel panel-draw">
<h4 class="normal">
<span class="item-alt-inline" data-group="tagType" data-group-item="signature">Draw your signature</span>
<span class="item-alt-inline" data-group="tagType" data-group-item="initials" data-selected="">Draw your initials</span>
<span class="error"></span>
</h4>
<a class="clear" data-ds="clear">Clear</a>
<div class="signature-draw signature">
<div class="canvas-wrapper"><canvas class="canvas" width="0" height="0"></canvas><canvas class="canvas" width="0" height="0"></canvas></div></div>
</article>
<p class="legalese">By clicking Adopt and Sign, I agree that the signature and initials will be the electronic representation of my signature and initials for all purposes when I (or my agent) use them on documents, including legally binding contracts - just the same as a pen-and-paper signature or initial.</p>
<hr>
<button class="btn-primary btn left item-alt" data-group="tagType" data-group-item="signature" type="button" data-ds="submit" value="initials">Adopt and Sign</button>
<button class="btn-primary btn left item-alt" data-group="tagType" data-group-item="initials" type="button" data-ds="submit" value="initials" data-selected="">Adopt and Initial</button>
<button class="close left btn btn-default" type="button" data-action="cancelAdoptSignature">Cancel</button>
<div class="clear-float"></div>
<div class="styles"></div></section>
</div>
Please try below xpath and let me know what happens
assertTrue(driver.getPageSource().contains("Adopt and Initial"));
self.driver.find_element_by_xpath("//button[conatins(text(),'Adopt and Initial')]".click();
Now Make sure you use assertion before the command because if the assertion fails then the element is not present in the current frame. so we will have to switch the correct frame before we perform the action.
Let me know if you try this.

Categories