How to use xpath to get text from similar class? - python

(1)
</div>
<div class="n_cont5" id="nct7">
<div class="nc_tit">说明书:</div>
<div class="nc5" id="smsdiv">
正在查询请稍候......
</div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
</div>
(2)
<a href="javascript:noAction()" title="PDF下载"
onclick="pdfDownloadDetail('Unexamined_patent_for_invention/2016/20160330/CN105452223A/PDF_PID/CN112014000037041CN00001054522230APDFZH20160330CN00F.PDF,CN201480037041.6')" href="javascript:noAction()">PDF下载</a>
</dd>
</dl>
</div>
</li>
<li>打印</li>
<li><a class="icon7" href="javascript:noAction();" class="zidongfanyi"
onclick="translateToEn('CN201480037041.6', 'FMZL_EN,SYXX_EN')">中译英</a></li>
</ul>
<div class="clear"></div>
</div>
<div class="clear"></div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
Here are two very similar content in the html content. Iwant to get the number "53" from the first?. I used the below code which doesn't work. I also try from div class, but it also failed. How can I get the number "53" from the first html content?
html.xpath('//a[contains(text(),"下一篇")]/span/text()')

Why it didn't work : the span holding 53 is a sibling (not a child) of the a element.
To complete #super.single430's answer, here's an alternative (if encoding issues occur during the parsing process) :
//span[#class="cur"]/following-sibling::span/text()

html.xpath("//a[contains(text(),'下一篇')]/following-sibling::span/text()")

Related

Beautifulsoup Strainer to strain items from a specific container only

Is it possible to make a Beautifulsoup Strainer that strains all 'order-cards' from 'container-01' only (without 'order-cards' from other containers)?
Below the sample HTML
<div class="items-container" container-id="container-01">
<div class="order-card">order_01
<div class="item-card">item1</div>
<div class="item-card">item2</div>
<div class="item-card">item3</div>
<div class="item-card">item4</div>
</div>
<div class="order-card">order_02
<div class="item-card">itemA</div>
<div class="item-card">itemB</div>
<div class="item-card">itemC</div>
<div class="item-card">itemD</div>
</div>
<div class="order-card">order_03
<div class="item-card">itemW</div>
<div class="item-card">itemX</div>
<div class="item-card">itemY</div>
<div class="item-card">itemZ</div>
<div class="item-card">item</div>
</div>
</div>
<div class="items-container" container-id="container-02">
<div class="order-card">order_53
<div class="item-card">item_7</div>
<div class="item-card">item_8</div>
</div>
</div>
<div class="items-container" container-id="container-03">
<div class="order-card">order_13
<div class="item-card">item_16</div>
<div class="item-card">item_17</div>
<div class="item-card">item_18</div>
</div>
</div>
What I have so far is the code below which strains ALL 'order-cards' from ALL containers.
The goal is that 'page_soup' contains ALL 'order-card' items that are in 'container-01' only.
The following loop then uses that 'page_soup' to iterate through each item in 'order-card' to get the details from each 'item-card'.
rephrased above!
The goal is to get the details from each 'item-card' that are in 'container-01' only.
There is no need for parsing any other containers than 'container-01'.
only_item_cells = SoupStrainer('div', attrs={"class":"order-card"})
page_soup = BeautifulSoup(page_html, 'html.parser', parse_only=only_item_cells)
Following that is a loop that gets the details from ALL the 'item-cards' in ALL containers. In fact, that is NOT wanted, as the output includes items from containers other than 'container-01' only.
Running Python 3.8.8, on Anaconda, Win64
Use the appropriate attribute as you have indicated:
only_item_cells = SoupStrainer('div', attrs= {"container-id": "container-01"})

Python Selenium: Login automation, Layer Problem?

So I am using Python Selenium to Login through a Webpage, i have done that before, my code was working pretty fine in that case:
driver.find_element_by_name("username").send_keys("user")
driver.find_element_by_name("password").send_keys("password")
driver.find_element_by_xpath("//input[#type='submit' and #value='Login / Register']").click()
But now, as I would like to do that again, it doesnt work anymore. I have tried some variations but none of them have worked for me, I also tried to access through layers, which I have seen in some older cases:
wait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.ID, "loginmask")))
driver.find_element_by_id("yourName").send_keys('username')
.....
I also tried something like that:
frame = driver.find_element_by_id("mainBody")
driver.switch_to.frame(frame)
driver.find_element_by_id("yourName").send_keys('username')
But somehow the find_element_by_id function isnt working for me, but the id is the only thing I can search for in that case.
enter image description here
this image of the login page might be helpful, as you can see the layers i might have to bypass. Maybe you can help me how to get the Element 'yourName'
Better Picture of HTML Body:
enter image description here
HTML Body:
<body id="mainBody" style="visibility: visible;">
<div id="treeLeft" style="height: 882px;">
<img id="logoLeft" src="img/ax_01.png" alt="">
<div id="buttonAreaLeft">
<img id="btnLogout" src="img/logoff.gif" title="logout" alt="" style="float:right">
<img id="btnHome" src="img/home.gif" title="startpage" alt="" style="float:left">
<br clear="all">
</div>
<div id="myMenuAccordion"></div>
</div>
<div id="mainLayer" style="left: 230px; width: 1125px; height: 860px; top: 12px;"><h1>AX Controller WEB-Interface</h1>
<div id="loginLayer" style="position:absolute;top:120px;left:120px;">
<h1>PLEASE LOGIN</h1>
<div class="hspacer5"></div>
<div class="hspacer5"></div>
<div id="loginmask" class="form">
<form method="post" action="index.php" onsubmit="venue.login();return false;">
<span class="label block">Your Name:</span><input type="text" id="yourName" maxlength="20" style="width:110px;" autocomplete="off"><br>
<div class="hspacer5"></div>
<span class="label block">Password:</span><input type="password" id="authCode" maxlength="20" style="width:110px;" autocomplete="off"><br>
<div class="hspacer5"></div>
<span class="label block"> </span><input type="submit" id="btnLogin" value="login" style="width:112px;padding:1px 0px;">
</form>
</div>
</div>

Python read forms in webpage

I read some webpage contents in html that has the following form:
<div class="cart">
<div class="cart-title">
<img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
המקצועות שלי
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
104134
</div>
<div class="course-name">
אלגברה מודרנית ח
</div>
<div class="course-points">
2.5 נק'
</div>
<div class="entry-group">
קבוצה 11
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
234118
</div>
<div class="course-name">
ארגון ותכנות המחשב
</div>
<div class="course-points">
3 נק'
</div>
<div class="entry-group">
קבוצה 22
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div>
Now the question is how can I read the courses numbers which appear in blue in my image??
Here's an example of how course number appears in the webpage:
<div class="course-number">
104134
</div>
and I want to read: 104134 in this example
First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.
from bs4 import BeautifulSoup
r = requests.get(<your-target>)
soup = BeautifulSoup(r.text, 'lxml')
numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]
I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.
Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.
Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.

Follow a sibling in Selenium/Python

I want to enter a text in a text area. The HTML code is as follows:
<li class="order-unavailable string-type-key string-block clear-fix status- require_changes expanded working autogrowed activity-opened" data-string_status="require_changes" data-master_unit_count="22" data-string_id="2394473">
<div class="key-area clear-fix">
<div class="key-area-container-one clear-fix">
<div class="key-area-container-two">
<div class="col-50 col-left">
<div class="string-controls">
<a class="control-expand-toggle selected" href="#"></a>
<a class="control-activity-toggle " href="#">2</a>
<input class="control-select-string" type="checkbox">
</div>
<div class="master-content">
</div>
<div class="col-50 col-right slave-side-container">
</div>
</div>
</div>
<div class="activity-area clear-fix">
<div class="col-50 col-left">
<div class="col-50 col-right">
<div class="comment-area-inner">
<h3>Add comment</h3>
<div class="comment-container">
<textarea class="comment-content" name="comment_content"></textarea>
</div>
<div class="col-right">
<div class="clear"></div>
<strong>Notification settings</strong>
<p>The people you select will get an email when you post this comment. They'll also be notified by email every time a new comment is added.</p>
<div class="notification-settings">
</div>
</div>
</div>
The textarea component name is comment-content
The xpath of the textarea is:
/html/body/div/section/ol/li[16]/div[2]/div[2]/div/div/textarea
This is the code I am using:
driver.find_element_by_xpath("*//div[#title=\"NOTIFICATION_HOMEPAGE_REDIRECT_CHANGED_SITE\"]
/following-sibling::div[2]/div[2]/div/div/textarea").send_keys("Test comment")
Can someone hekp me how to frame the sibling tag?
div[2]/div[2]/div/div/textarea
The tag before the following-sibling keyword is correct.
Choose the textarea and enter something,
driver.find_element_by_xpath(r'//textarea[#class='comment-content']').send_keys('Test Comment')
For xpath, you can use tool Firepath plugin for Firefox

Parsing Html data using LXML

<div id="descriptionmodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">Description</h3>
</div>
<div id="issue-description" class="mod-content">
<p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<ul class="alternate" type="square">
<li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul>
I want only the Q's . I tried this
doc=lh.fromstring(resp.read())
for id in doc.cssselect('div.mod-content' ):
print id.text_content()
This gives me the q's but it also gives me other details on the page with class mod-content.
How do i specifically get only the q's.
I am using lxml.
<div id="peoplemodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">People</h3>
</div>
<div class="mod-content">
<ul class="item-details" id="peopledetails">
<li class="people-details">
<dl>
<dt>Assignee:</dt>
<dd id="Assign-Val">
<a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
</dd>
</dl>
<dl>
<dt>Reporter:</dt>
<dd id="Report-Val">
<a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
</dd>
</dl>
<dl><dt> </dt><dd> </dd></dl>
<dl>
<dt title="Multiple Assignees">Multiple Assignees:</dt>
<dd id="customfield_10020-val"> <div class="shorten" id="customfield_10020-field">
<span class="tinylink"> <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>, <span class="tinylink"> <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span> </div>
</dd>
</dl>
</li>
</ul>
<div id="watchers-val">
<span class="icon icon-watch-off"></span><span class="action-text">Watch</span>
(<span id="watcher-data">1</span>)
</div>
</div>
</div>
First off: if you are parsing HTML there is a high chance humans will have messed up with it and it won't validate correctly. For example this is the case for the example you posted (there are a couple of </div> missing...). Consider passing to beautifulsoup instead, which is specifically designed to accommodate for these kind of errors.
That said, if your question is just about how to extract the "textual part of the HTML", or in other words how to convert HTML → plain text [as opposed to "extracting only the text contained in specific HTML containers], this is a minimal working example:
from lxml import etree
content = '''<div id="descriptionmodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">Description</h3>
</div>
<div id="issue-description" class="mod-content">
<p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<ul class="alternate" type="square">
<li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''
tree = etree.fromstring(content)
for bit in tree.xpath('//text()'):
if bit.strip(): # you can insert any kind of test here
print bit
It outputs:
Description
qqqqqqqqqqqqq,
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
HTH!

Categories