How to fetch a specific value from a HTML using BeautifySoup

How to fetch a specific value from a HTML using BeautifySoup - python

I am trying to bring the percent value from a HTML file. I used the below two method in juypter Notebook and got the result as expected in both the method. But while I am trying to replicate the same in pycharm I am not getting expected result in either of the method. I am getting None in second method and picking another word in the first method
spans = soup.select_one('span').text
print("spans:", spans)
>> spans 99%
spans = soup.find("span", {"class": "rc_late"}).text
print("spans", spans)
>> spans 99%
Here is snipper of the HTML. Is there any way we can fetch those value (99%)?
<div class="content">
<h1>Latest report:
<span class="rc_late">99%</span>
</h1>
<aside id="help_panel_wrapper">
<input id="help_panel_state" type="checkbox">
<label for="help_panel_state">
<img id="keyboard_icon" src="keybd_closed.png" alt="Show/hide keyboard shortcuts">
</label>
<div id="help_panel">
<p class="legend">Shortcuts on this page</p>
<div class="keyhelp">
<p>
<kbd>n</kbd>
<kbd>s</kbd>
<kbd>m</kbd>
<kbd>x</kbd>
<kbd>c</kbd>
change column sorting
</p>
<p>
<kbd>[</kbd>
<kbd>]</kbd>
prev/next file
</p>
<p>
<kbd>?</kbd> show/hide this help
</p>
</div>
</div>
</aside>
<form id="filter_container">
<input id="filter" type="text" value="" placeholder="filter...">
</form>
<p class="text">
created at 2023-01-22 12:01 +0000
</p>
</div>

Here is one way to do it, given your example:
from bs4 import BeautifulSoup as bs
html = '''
<div class="content">
<h1>Latest report:
<span class="rc_late">99%</span>
</h1>
<aside id="help_panel_wrapper">
<input id="help_panel_state" type="checkbox">
<label for="help_panel_state">
<img id="keyboard_icon" src="keybd_closed.png" alt="Show/hide keyboard shortcuts">
</label>
<div id="help_panel">
<p class="legend">Shortcuts on this page</p>
<div class="keyhelp">
<p>
<kbd>n</kbd>
<kbd>s</kbd>
<kbd>m</kbd>
<kbd>x</kbd>
<kbd>c</kbd>
change column sorting
</p>
<p>
<kbd>[</kbd>
<kbd>]</kbd>
prev/next file
</p>
<p>
<kbd>?</kbd> show/hide this help
</p>
</div>
</div>
</aside>
<form id="filter_container">
<input id="filter" type="text" value="" placeholder="filter...">
</form>
<p class="text">
created at 2023-01-22 12:01 +0000
</p>
</div>
'''
soup = bs(html, 'html.parser')
data_we_want = soup.select_one('span[class="rc_late"]').text
print(data_we_want)
Result in terminal:
99%
You can find BeautifulSoup documentation here.

Related

Selenium Python No Name or ID

How do I submit the postcode eg. "3000" and click submit Selenium (Python) when I can't search by the name or ID.
Here is the HTML
<div class="container-fluid" id="basicpage">
<div class="container">
<div class="row">
<div id=right_col class="col-xs-12 col-md-9">
<div class=nothingwrapper>
</div>
<div class=nothingwrapper>
<h1>Postcode Snapshot </h1>
<p> Please type in the postcode for which you would like to download a postcode snapshot:</p>
<form action="postcodesnapshotchoosepostcode.php" method="post">
<input name=sub value=1 type=hidden>
<p><input type=text name=postcode size=5></p><p><input type=submit value='Continue' /></p>
</form>
<p> </p>
</div>
Here is what I have tried
# Select the by name
postcode_box = driver.find_element(By.NAME, "postcode")
# Send information
postcode_box.send_keys('3000')
driver.find_element(By.NAME("sub")).submit()

//input[#value='Continue']
If you need an xpath for an attribute you can do it like so

log in to a website using Python's Requests module doesn't work on this site but work on other site

use the following code on twitter and github it work fine but it doesn't work on the main site i am using it for, can someone please tell me what went wrong.
i did not get any error instead it scrap back the login page instead of log me.
import requests
session = requests.Session()
params = {'j_username': '**********', 'j_password': '************'}
r = requests.post("https://connect.data.com/loginProcess", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("https://connect.data.com/home")
print(r.text)
I am new to this so i can't figure out what went wrong, i have try many answer out before asking the question but none seem to workout for the site.
The login url is /login but the file that the login is been process is /loginProcess that is why i use /loginProcess but loginProcess does not print out my cookie but /login does print it out
The form to the site look like this:
<form id="command" name="LoginForm" action="https://connect.data.com/loginProcess" method="post">
<div>
<div class="login-container fields float-left">
<div class="content">
<div class="first">
<span class="title">Login</span>
</div>
<div class="middle">
<input id="j_username" name="j_username" type="email" class="text" placeholder="Email" maxlength="128" tabindex="1">
</div>
<div class="middle">
<input id="j_password" name="j_password" type="password" class="text" placeholder="Password" autocomplete="off" tabindex="2" maxlength="128">
</div>
<div class="middle">
<label for="_spring_security_remember_me" class="general-checkbox-label">
<input name="_spring_security_remember_me" tabindex="3" value="on" id="_spring_security_remember_me" class="checkbox margin-0px" type="checkbox">
<span>Keep me logged in </span>
</label>
</div>
<div class="last">
<button id="login_btn" type="submit" class="button-standard button-primary" tabindex="4">
<span class="button-standard-text">Login</span>
</button>
<a class="link" href="https://connect.data.com/forgotpassword" onclick="var x=".tl(";s_objectID="https://connect.data.com/forgotpassword_1";return this.s_oc?this.s_oc(e):true">Forgot Password?</a>
</div>
</div>
</div>
<div class="login-container marketing-message float-right">
<div class="content">
<h2>
Don't have an account?
Sign up now - it's FREE!
</h2>
<div class="float-left">
<img class="login-bicons" src="./Data.com Connect Business Contact Directory of Business Contacts and Company Information_files/clear.cache.gif">
</div>
<div class="float-left margin-left-20px">
<p>
<span id="pageTitle" style="color: #146791">Find</span>
business card information & B2B professionals.
</p>
<p>
<span id="pageTitle" style="color: #87c540">Research</span>
people & companies.
</p>
<p class="margin-top-5px">
<span id="pageTitle" style="color: #f37521">Get</span>
millions of contacts.
</p>
</div>
<div class="clear"></div>
</div>
</div>
<div class="clear nonlogged-container-foot">
<span class="new-to-ddc">New to Data.com?</span>
<a class="sign-up" href="https://connect.data.com/registration/signup" onclick="var x=".tl(";s_objectID="https://connect.data.com/registration/signup_1";return this.s_oc?this.s_oc(e):true">Sign up for an account...</a>
</div>
</div>
<input type="hidden" name="CSRF_TOKEN" id="CSRF_TOKEN" value="ce56932e08fc97bc29c5f6535b572664a66000513143584504b0ee7c66ef9659"></form>

get content of <li> tags with beautifulsoup in python

I want to get content of first 3 <li> tag after the <section> tag, I don't know how to manipulate children tag in BeautifulSoap, I tried to get this strip the text and then get what I want with splitting it but I wasn't successful.
This is HTML code :
<section class="l-map">
<ul>
<li>خیابان شریعتی، روبروی پارک کوروش، کوچه پیروز، پلاک 48 </li>
<li>22855157 22852085</li>
<li>شریعتی:قلهک، سید خندان
</li>
</ul>
<div class="foot">
<a class="dm fancy" href="#contact" id="inline">پیام مستقیم به مدیر</a>
<a class="rm" href="#phonenumber" id="inline">دریافت پیامکی اطلاعات</a>
</div>
<input id="IsMaximumSmsReached" name="IsMaximumSmsReached" value="False" type="hidden">
<div style="display:none">
<div id="phonenumber">
<div class="contact-form number">
<h1>
دریافت پیامکی اطلاعات
<i class="icon contact"></i>
</h1>
<p>
شماره تلفن همراه خود را وارد کنید.
</p>
<form id="sendSMS">
<div class="form-input">
<input id="cellphone" name="cellphone" placeholder="برای مثال. 09121112222" type="text">
</div>
<div class="form-submit">
<button type="submit" href="#" class="submit">ارسال</button>
</div>
<p class="alert-box"></p>
</form>
</div>
</div>
</div>
<div style="display:none">
<div id="contact">
<div class="contact-form">
<h1>
ارسال پیام به مدیریت رستوران
<i class="icon message"></i>
</h1>
<p>
در این بخش شما می توانید به صورت مستقیم به مدیریت رستوران پیام ارسال نمایید.
<br>
پیام خود را در زیر بنویسید و ارسال نمایید.
</p>
<form id="managerMessage">
<div class="form-input">
<input id="MessageSenderName" name="MessageSenderName" placeholder="نام شما (اختیاری)">
<input id="MessageSenderPhone" name="MessageSenderPhone" placeholder="تلفن تماس شما (اختیاری)"><br>
<input id="MessageSenderEmail" name="MessageSenderEmail" placeholder="ایمیل شما (اختیاری)"><br>
<textarea id="MessageToManager" name="MessageToManager" placeholder="پیام"></textarea>
</div>
<div class="form-submit">
<button type="submit" href="#" class="submit">ارسال</button>
</div>
<p class="alert-box"></p>
</form>
</div>
</div>
</div>
</section>
I just can access the whole <section> tag with this line of code:
address = soup.find('section', class_="l-map")
I appreciate every help or comment you give me :)

You can use the function .find_all() to find all the li components inside the section , and then get its text by using either .text attribute or the method - .get_text() . Example -
>>> for lis in address.find_all('li'):
... print(lis.get_text())
...
<first li text>
22855157 22852085
<third li text>

I want to crawl the web page html (all elements) in Python

I want to crawl comments of news (http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117).
But, crawling using requests failed.
Crawling result is empty <div id="social-comment">.
I want to crawl the web page HTML (all elements).
Tried source in python.
#-*- coding: utf-8 -*-
import requests
url = 'http://m.entertain.naver.com/comment/list?page=2&gno=news117%2C0002600716&sort=newest&aid=0002600716&oid=117'
response = requests.get(url)
print(response.content)
Failed html:
...
<div id="social-comment">
<ul class="cmt_lst">
<li class="ld"><span>로딩중입니다.</span></li>
</ul>
</div>
...
Expected html:
...
<div id="social-comment">
<div class="sc_cmt_wrp" queryid="C1431107741291317890" style="display: block;">
<div id="tabArea">
<ul class="cmt_tab">
<li style="width:50%" class="on">
댓글 <span class="_count">22</span>
</li>
<li>댓글쓰기</li>
</ul> </div>
<div id="sortOptionArea" style="display: block;">
<div class="cmt_choice">
<input type="radio" name="scmt-sort" id="newest" class="_scmt_sort(newest) _nclicks(rpt.rct)">
<label for="newest" title="최신순" class="_scmt_sort(newest) _nclicks(rpt.rct) on" onclick="javascript:;">최신순</label>
<input type="radio" name="scmt-sort" id="oldest" class="_scmt_sort(oldest) _nclicks(rpt.old)">
<label for="oldest" title="과거순" class="_scmt_sort(oldest) _nclicks(rpt.old)" onclick="javascript:;">과거순</label>
<input type="radio" name="scmt-sort" id="likability" class="_scmt_sort(likability) _nclicks(rpt.rcm)">
<label for="likability" title="호감순" class="_scmt_sort(likability) _nclicks(rpt.rcm)" onclick="javascript:;">호감순</label>
<input type="radio" name="scmt-sort" id="replycount" class="_scmt_sort(replycount) _nclicks(rpt.rpl)">
<label for="replycount" title="답글많은순" class="_scmt_sort(replycount) _nclicks(rpt.rpl)" onclick="javascript:;">답글많은순</label>
</div> </div>
<div id="noticeArea"></div>
<div id="commentListPaginationArea"> <div class="_noComments _refreshable"></div> <div class="_commentClosed _refreshable"></div> <div id="commentItemArea" class="_refreshable"><ul class="cmt_lst"> <li id="scmt-item-1149361" class=""> ldo1**** <p> 서인영 아직도 존예...하....자야는데..</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 10 7 </div> </div> </li> <li id="scmt-item-1149360" class=""> dbgh**** <p> 1빠 댓글수채우기.</p> <div class="func"> <span class="time">2015.04.28 오후 11:49</span> <span class="mobile">모바일에서 작성</span> | 신고 </div> <div class="btn_area2"> <div> 답글 <strong>0</strong> </div> <div> 1 5 </div> </div> </li> </ul></div> <div id="paginationArea" style="display: block;"> <div class="cmt_pg"> <span class="cmt_pg_prev">이전</span> <span class="cmt_pg_btn uc_vh scmt-page-prev-off" style="display: none;"><span class="cmt_pg_prev">이전</span></span> <em class="cmt_pg_pg _pageInfo">21 - 22 <span class="u_vc">페이지 </span><span class="cmt_pg_total">/ <span class="u_vc">총 </span>22<span class="u_vc"> 페이지</span></span></em> <span class="cmt_pg_btn uc_vh scmt-page-next-off" style="display: inline-block;"><span class="cmt_pg_next">다음</span></span> <span class="cmt_pg_next">다음</span> </div></div> </div></div></div>
...

The page you're attempting to read loads comments in an XHR request, after the page loads. Thus, unless you're using some tool that emulates a full browser (executing javascript and loading external resources), you're not loading the comments.
The comments are loaded in a POST request sent to http://m.entertain.naver.com/api/comment/list.json
That returns a JSON object, with all the data you're looking for.
As it is a POST request, it's looking for data that you may send along. In my testing, it appears that the minimal information you need to provide is:
gno : news117,0002600716
pageSize : 2000 (20 is the default. 2,000 will probably give you all of the comments for most cases, but you should adapt as you see fit.)
sort : newest
Encoded as a URL string, (which, is how data is actually sent as a POST request when it's sent urlencoded,) this becomes gno=news117%2C0002600716&page=2&sort=newest&pageSize=20&serviceId=news
As to how to go about making POST requests in Python, see here and here.
As to how to parse the JSON that's returned, see here.

wrapping html with a python function

I want to be able to wrap a div based on it's id. For example given the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>
I want to write a Python function that takes a document, an id, and a selector. and will wrap the given id in the given document in a div with the class or id selector. For example, lets say that the HTML above is in a variable doc
wrap(doc,'#a2','#wrapped')
will return the following HTML:
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapped">
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</div>
</body>
I looked at some XML parsers and Python HTMLParser, but I have not found anything that gives me the capability to not only get everything inside a specific tag, but then be able to append strings and easily edit the document. If one does not exist, what would be a good approach to this?

from BeautifulSoup import BeautifulSoup
#div1 is to be wrapped with div2
def wrap(doc,div1_id,div2_id)
pool = BeautifulSoup(doc)
for div in pool.findAll('div', attrs={'id':div1_id}):
div.replaceWith('<div id='+div2_id+'>' + div.prettify() + '</div>' )
return pool.prettify()
wrap(doc,'a2','wrapped')

I recommend BeautifulSoup though it will bring some dependency but also a lot convenience. The following code can acheieve the goal of the wrap:
from bs4 import BeautifulSoup
data = '''<body>
<div id="info">
<div id="a1">
</div>
<div id="a2">
<div id="description">
</div>
<div id="links">
link
</div>
</div>
</div>
</body>'''
soup = BeautifulSoup(data)
div = soup.find('div', attrs={'id': 'a2'})
div.wrap(soup.new_tag('div', id='wrapper'))
And then print soup.prettify() we can see the result:
<html>
<body>
<div id="info">
<div id="a1">
</div>
<div id="wrapper">
<div id="a2">
<div id="description">
</div>
<div id="links">
<a href="http://example.com">
link
</a>
</div>
</div>
</div>
</div>
</body>
</html>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fetch a specific value from a HTML using BeautifySoup - python

Related

Selenium Python No Name or ID

log in to a website using Python's Requests module doesn't work on this site but work on other site

get content of <li> tags with beautifulsoup in python

I want to crawl the web page html (all elements) in Python

wrapping html with a python function

Categories

Resources