How to use XPath get content of same field? - python

I am a beginner of Xpath and just can not match content correctly. Here is my question:
How to use XPath to get the date '2010.09.07' (after 申请日:) and '2009.09.03'? Actually, there are 10 same items (g_item) below g_list hierarchy, here I just listed two of them. I try to copy Xpath from Chorm, while it doesn't work.
Also, I try to use regex as below,however, it just matches the first one. Is there a way to return all dates of all items?
Thanks!
s.find(string=('申请日:')).find_next().text.replace('\n', '').strip()
<div class="g_list">
<div class="g_item">
<div class="g_tit">
<ul>
<li class="g_li0">
<input id="CN201010274593.21" name="recordno" type="checkbox" value="CN201010274593.2" pnm="CN102403785B" sysid="B58C6C20BB7D5998B03811E0866F5981" appid="201010274593.2" sectionName="FMSQ" onclick="checkall()"/></li>
<input id="tifPath1" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2014/20140716/201010274593.2,12,CN201010274593.2" xmlvalue="FMSQ,CN201010274593.2,2014.07.16" pdfvalue="Granted_patent_for_invention/2014/20140716/CN102403785B/PDF_PID/CN102010000274593CN00001024037850BPDFZH20140716CN008.PDF,CN201010274593.2" pdfvalue2="CN102403785B,2014.07.16"/>
<li class="g_li" onclick="viewDetail(0)" style="cursor:pointer" name='patti' title="电源管理装置及其电源管理方法">
1.电源管理装置及其电源管理方法</li>
<li class="g_li1">发明授权 </li>
<li class="g_li2 cor3">无效</li>
<li class="g_li3">下载</li>
</ul>
<div class="clear"></div>
</div>
<div class="g_cont">
<div class="g_cont_left">
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td><span>申请号:</span> CN201010274593.2 </td>
<td><span>申请日:</span> 2010.09.07 </td>
</tr>
<tr>
<td><span>公开(公告)号:</span> CN102403785B </td>
<td><span>公开(公告)日:</span> 2014.07.16 </td>
</tr>
<tr>
<td><span>同日申请:
</td>
<td><span>分案原申请号:
</td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 鸿富锦精密工业(深圳)有限公司;鸿海精密工业股份有限公司 </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H02J13/00(2006.01) </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td>
</tr>
<tr>
<td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span>
<a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td>
</tr>
</table>
</div>
<div class="g_cont_rig" id="pic1">
<img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ\20140716\201010274593.2/201010274593.gif" class="imgstyle"/>
</div>
<div class="clear"></div>
</div>
</div>
<div class="g_item">
<div class="g_tit">
<ul>
<li class="g_li0">
<input id="CN200910171675.12" name="recordno" type="checkbox" value="CN200910171675.1" pnm="CN102006581B" sysid="E7025BBD105585DF6CE4193E52ECC322" appid="200910171675.1" sectionName="FMSQ" onclick="checkall()"/></li>
<input id="tifPath2" name="tifPath" type="hidden" tifvalue="BOOKS/SD/2013/20130911/200910171675.1,21,CN200910171675.1" xmlvalue="FMSQ,CN200910171675.1,2013.09.11" pdfvalue="Granted_patent_for_invention/2013/20130911/CN102006581B/PDF_PID/CN102009000171675CN00001020065810BPDFZH20130911CN008.PDF,CN200910171675.1" pdfvalue2="CN102006581B,2013.09.11"/>
<li class="g_li" onclick="viewDetail(1)" style="cursor:pointer" name='patti' title="IP地址强制续约的方法及装置">
2.IP地址强制续约的方法及装置</li>
<li class="g_li1">发明授权 </li>
<li class="g_li2 cor3">无效</li>
<li class="g_li3">下载</li>
</ul>
<div class="clear"></div>
</div>
<div class="g_cont">
<div class="g_cont_left">
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td><span>申请号:</span> CN200910171675.1 </td>
<td><span>申请日:</span> 2009.09.03 </td>
</tr>
<tr>
<td><span>公开(公告)号:</span> CN102006581B </td>
<td><span>公开(公告)日:</span> 2013.09.11 </td>
</tr>
<tr>
<td><span>同日申请:
</td>
<td><span>分案原申请号:
</td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>申请(专利权)人:</span> 中兴通讯股份有限公司 </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>分类号:</span> H04W8/08(2009.01);H04W36/14(2009.01);H04W84/12(2009.01);H04L29/12(2006.01) </td>
</tr>
<tr>
<td colspan="2" style="width:610px;word-break:break-all;"><span>优先权:</span></td>
</tr>
<tr>
<td colspan="2"><span>摘要:</span><span name="patab" style="font-weight:normal"></span>
<a name="abmtlink" href="javascript:return false;" style="color:blue">机器翻译</a></td>
</tr>
</table>
</div>
<div class="g_cont_rig" id="pic2">
<img name="tifpath" src="http://pic.cnipr.com/XmlData/SQ/20130911/200910171675.1/200910171675.gif" class="imgstyle"/>
</div>
<div class="clear"></div>
</div>
</div>
</div>
</div>

Your HTML has multiple markup validation errors, you can check for the errors using W3 validator. However, if you fix the following errors is possible to parse the string using lxml.
Unclosed element span.
From line 43, column 37; to line 43, column 42
Unclosed element span.
From line 46, column 37; to line 46, column 42
Unclosed element span.
From line 119, column 37; to line 119, column 42
Unclosed element span.
From line 122, column 37; to line 122, column 42
Stray end tag div.
From line 159, column 5; to line 159, column 10
from lxml import etree
pageHTML = """
<div class="g_list">
<div class="g_item">
...
...
"""
root = etree.fromstring(pageHTML)
dateList = root.xpath("//*[#class='g_cont_left']/table/tr[1]/td[2]/text()")
print(dateList)
#[' 2010.09.07 ', ' 2009.09.03 ']
If you still want to use a regex (which I would advise against, given the whole discussion about applying regular expressions over an HTML, XML, etc. or use a parser specific for that grammar) you can define a capturing group allowing only digits and a literal . (([\d\.]+)) surrounded by the exact words you expect to find.
import re
pageHTML = """
<div class="g_list">
<div class="g_item">
...
...
"""
date_Regex = re.findall("申请日:\s*</span>\s*([\d\.]+)\s*</td>", pageHTML)
print(date_Regex)
# ['2010.09.07', '2009.09.03']

Related

HTML table to database

At this point, my table looks as follows:
<table border="0" cellpadding="0" cellspacing="0" class="ms-formtable" id="formTbl" style="margin-top: 8px;" width="100%">
<tbody>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_FileLeafRef">
</a>
Name
</h3>
</td>
<td class="ms-formbody" id="SPFieldFile" valign="top" width="450px">
<a href="http://google.com" onclick="DispDocItemEx(this, 'FALSE', 'FALSE', 'FALSE', '');">
X
</a>
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Owner">
</a>
Name#
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
Z
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_DirectiveRank">
</a>
Age
</h3>
</td>
<td class="ms-formbody" id="SPFieldChoice" valign="top" width="450px">
52
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Number">
</a>
number
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
1
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_Title">
</a>
Name of File
</h3>
</td>
<td class="ms-formbody" id="SPFieldText" valign="top" width="450px">
Funny Names
</td>
</tr>
<tr>
<td class="ms-formlabel" nowrap="true" valign="top" width="165px">
<h3 class="ms-standardheader">
<a name="SPBookmark_EffectiveFrom">
</a>
date
</h3>
</td>
<td class="ms-formbody" id="SPFieldDateTime" valign="top" width="450px">
1.1.2022
</td>
</tr>
</tbody>
</table>
I basically need to open an HTML file, filter table with id "formTbl" and then either create JSON with values : {Firsttd:Secondtd, "Name":"Test", "Date":"Blank"} or insert into database where First td (in tr tag we have 2 td, first it name of column and second is value) in table A and second td in table B. Is there any way? I´ve tried using Python, where I got so far json looks like [["","Name","","Test",""],["","Age","","12",""]] and in C# I´ve tried HTMLAgilityPack but it wasn´t working.
Here is the solution with JQuery.
<html>
<body>
<table id="example-table">
<tr>
<th>Name</th>
<th>Name#</th>
<th>Age</th>
<th>Number</th>
<th>Name of file</th>
<th>Date</th>
</tr>
<tr>
<td>X</td>
<td>Z</td>
<td>52</td>
<td>1</td>
<td>Name of file</td>
<td>2021-22-10</td>
</tr>
</table>
<textarea rows="10" cols="50" id="jsonTextArea">
</textarea>
</body>
</html>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/table-to-json#1.0.0/lib/jquery.tabletojson.min.js"></script>
<script type="text/javascript">
var tableToJson = $('#example-table').tableToJSON();
var sendingData = JSON.stringify (tableToJson);
$('#jsonTextArea').val(sendingData);
// Send JSON data to backend
$.post('http://localhost/test.php', {sendingData}, function(data, textStatus, xhr) {
var backendResponse = data;
console.log(backendResponse);
});
</script>

BeautifulSoup: How to parse an un-id'ed list of TDs in table

Using bs4 I'm able to use soup.find_all() to find each of the s for the table. HTML is below.
However, how do I efficiently access specific columns within each ? Say I only want the 1st, 3rd and 5th column.
In other words, is there a way to so something similar to "date = row.td[1]" or "price_low = row.td[3]" etc?
Thanks.
<tr class="cmc-table-row" style="display:table-row">
<td class="cmc-table__cell cmc-table__cell--sticky cmc-table__cell--left">
<div class="">Dec 23, 2019</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,508.90</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,656.18</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,326.19</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,355.63</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">27,831,788,041</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">133,275,709,111</div>
</td>
</tr>
from bs4 import BeautifulSoup
html = """<tr class="cmc-table-row" style="display:table-row">
<td class="cmc-table__cell cmc-table__cell--sticky cmc-table__cell--left">
<div class="">Dec 23, 2019</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,508.90</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,656.18</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,326.19</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">7,355.63</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">27,831,788,041</div>
</td>
<td class="cmc-table__cell cmc-table__cell--right">
<div class="">133,275,709,111</div>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("div", {'class': ''})[0:5:2]:
print(item.text)
output:
Dec 23, 2019
7,656.18
7,355.63

How to find a specific tag by text with BeautifulSoup in Python

<table class="person show-interviews interviews-loaded" application="43352812" current-interview-stage-id="373822" candidate_hiring_plan="52607">
<tbody><tr class="basic-info clickable candidate">
<td class="photo-column" href="/people/34284587?application_id=43352812&src=search">
<img class="person-photo" width="40" height="40" alt="Candidate Profile Picture" src="https://gravatar.com/avatar/b6d305a017cc572d47807d9e6812bef1.png?s=40&d=https%3A%2F%2Fcdn.greenhouse.io%2Fassets%2Fsilhouette-7fdf9a27e7e8acd6f7cad72986479543.png">
</td>
<td class="person-info-column" href="/people/34284587?application_id=43352812&src=search">
<p class="name">
Chew Bacca
<img class="email-candidate-icon" title="Email Chew" width="16" modal_path="/people/34284587/email_candidate_modal?application_id=43352812" src="https://cdn.greenhouse.io/assets/icons/email-fd1e71440bb47a93b13bccdbffa4d311.png" alt="Email">
</p>
</td>
<td class="job-info-column" href="/people/34284587?application_id=43352812&src=search">
<p class="job">Consulting Engineer </p>
<div class="status">
<a class="toggle-interviews" href="#">1 interview to schedule for Face to Face</a>
</div>
</td>
<td class="interview-kit-column" nofollow="true">
<div class="interview-kit-wrapper">
<span class="interview-kit-icon"></span><br>
<a modal_path="/people/34284587/applications/43352812/submit_feedback_options" class="submit-feedback-link" href="#">interview kit</a>
</div>
<label class="bulk-checkbox-wrapper">
<input class="bulk-checkbox" type="checkbox">
</label>
</td>
</tr>
<tr class="availability">
<td colspan="3" class="details name">
<div class="header">
<div class="left-col">
<span class="title closed no-expand">Availability</span>
<span class="state">
<div class="dropdown">
<button name="button" type="submit" id="quick_action_304014813" class="link-like-button" data-toggle="dropdown" aria-has-popup="true" aria-expanded="false">Not Requested</button>
<ul class="dropdown-menu" aria-labelledby="quick_action_304014813">
<li data-type="state" data-url="/people/availability/304014813/state" data-state="not_requested" class="dropdown-item" data-current-state="true">Not Requested</li>
<li data-type="state" data-url="/people/availability/304014813/state" data-state="requested" class="dropdown-item">Requested</li>
<li data-type="state" data-url="/people/availability/304014813/state" data-state="received" class="dropdown-item">Received</li>
<li data-type="state" data-url="/people/availability/304014813/state" data-state="confirmation_sent" class="dropdown-item">Confirmation Sent</li>
<li data-type="action" data-url="/people/availability/edit_modal/304014813?force=true" data-action="edit_availability" class="dropdown-item action-item">ENTER AVAILABILITY MANUALLY</li>
<li data-type="action" data-url="/people/availability/cofirm_modal/304014813?force=true" data-action="send_confirmation" class="dropdown-item action-item">SEND INTERVIEW CONFIRMATION</li>
</ul>
</div>
<span class="action-time"></span>
</span>
</div>
<span class="action">
<button name="button" type="submit" class="link-like-button availability-modal-open" modal_path="/people/availability/request_modal/304014813" data-modal-path="/people/availability/request_modal/304014813">Request Availability</button>
</span>
</div>
<div class="body">
<div class="times-container">
<div class="times proposed">
<div class="title">Suggested Times:</div>
<ul>
</ul>
</div>
<div class="times candidate">
<div class="title">
Chew is available at these times:
</div>
Not yet responded <button name="button" type="button" modal_path="/people/availability/edit_modal/304014813" class="link-like-button availability-edit-modal-open">Edit</button>
</div>
</div>
</div>
</td>
<td class="interview-kit-column"></td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="553192" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/553364/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Cultural Fit Interview
</td>
<td class="details">
<div class="wrapper">
<div class="interview-info">
Skipped <span href="/interviews/49710750/unskip" class="unskip-link">Unskip</span>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="553193" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/553365/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Peer Panel Interview
</td>
<td class="details">
<div class="wrapper">
<div class="interview-info">
Skipped <span href="/interviews/49710751/unskip" class="unskip-link">Unskip</span>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="553194" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/553366/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Case Study
</td>
<td class="details">
<div class="wrapper">
<div class="interview-info">
Skipped <span href="/interviews/49710752/unskip" class="unskip-link">Unskip</span>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="553195" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/553367/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Executive Interview
</td>
<td class="details">
<div class="wrapper">
<div class="interview-info">
Skipped <span href="/interviews/49710753/unskip" class="unskip-link">Unskip</span>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="4883928" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/4884061/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Challenge
</td>
<td class="details schedulable removable" modal_path="/interviews/schedule?application_id=43352812&interview_kit_id=4884061" modal_title="Consulting Engineer (Austin, New York City, Palo Alto)" nofollow="true" title="Schedule Interview">
<div class="wrapper">
<span href="/interviews/49710754/skip" class="x" title="Skip this interview"></span>
<span class="to-be-scheduled-icon"></span>
<div class="interview-info">
Schedule Interview
<div class="integration-buttons">
</div>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
<tr class="interview spicy" application_id="43352812" step_id="4883933" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/4884066/people/34284587?application_id=43352812" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Personality Assessment
</td>
<td class="details">
<div class="wrapper">
<div class="interview-info">
Skipped <span href="/interviews/49710755/unskip" class="unskip-link">Unskip</span>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
</tbody></table>
<table class="person show-interviews interviews-loaded" application="31024648" current-interview-stage-id="373842" candidate_hiring_plan="52610">
<tbody><tr class="basic-info clickable candidate">
<td class="photo-column" href="/people/5879170?application_id=31024648&src=search">
<img class="person-photo" width="30" height="40" alt="Candidate Profile Picture" src="https://prod-heroku.s3.amazonaws.com/people/photos/005/879/170/resized/imgres.jpg?AWSAccessKeyId=AKIAIK36UTOKQ5F2YNMQ&Expires=1495711223&Signature=GuPHCM1nw%2B2tC%2F44rHejCRvnsx0%3D">
</td>
<td class="person-info-column" href="/people/5879170?application_id=31024648&src=search">
<p class="name">
Jessica Alba
<span class="alert" title="Jessica Alba has been in Phone Interview for more than 14 days">Alert</span>
</p>
<p class="title">New York University</p>
</td>
<td class="job-info-column" href="/people/5879170?application_id=31024648&src=search">
<p class="job">Enterprise Account Executive (North America)</p>
<div class="status">
<a class="toggle-interviews" href="#">1 interview to schedule for Phone Interview</a>
</div>
</td>
<td class="interview-kit-column" nofollow="true">
<div class="interview-kit-wrapper">
<span class="interview-kit-icon"></span><br>
<a modal_path="/people/5879170/applications/31024648/submit_feedback_options" class="submit-feedback-link" href="#">interview kit</a>
</div>
<label class="bulk-checkbox-wrapper">
<input class="bulk-checkbox" type="checkbox">
</label>
</td>
</tr>
<tr class="availability">
<td colspan="3" class="details name">
<div class="header">
<div class="left-col">
<span class="title closed no-expand">Availability</span>
<span class="state">
<div class="dropdown">
<button name="button" type="submit" id="quick_action_210624650" class="link-like-button" data-toggle="dropdown" aria-has-popup="true" aria-expanded="false">Not Requested</button>
<ul class="dropdown-menu" aria-labelledby="quick_action_210624650">
<li data-type="state" data-url="/people/availability/210624650/state" data-state="not_requested" class="dropdown-item" data-current-state="true">Not Requested</li>
<li data-type="state" data-url="/people/availability/210624650/state" data-state="requested" class="dropdown-item">Requested</li>
<li data-type="state" data-url="/people/availability/210624650/state" data-state="received" class="dropdown-item">Received</li>
<li data-type="state" data-url="/people/availability/210624650/state" data-state="confirmation_sent" class="dropdown-item">Confirmation Sent</li>
<li data-type="action" data-url="/people/availability/edit_modal/210624650?force=true" data-action="edit_availability" class="dropdown-item action-item">ENTER AVAILABILITY MANUALLY</li>
<li data-type="action" data-url="/people/availability/cofirm_modal/210624650?force=true" data-action="send_confirmation" class="dropdown-item action-item">SEND INTERVIEW CONFIRMATION</li>
</ul>
</div>
<span class="action-time"></span>
</span>
</div>
<span class="action">
<button name="button" type="submit" class="link-like-button availability-modal-open" modal_path="/people/availability/request_modal/210624650" data-modal-path="/people/availability/request_modal/210624650">Request Availability</button>
</span>
</div>
<div class="body">
<div class="times-container">
<div class="times proposed">
<div class="title">Suggested Times:</div>
<ul>
</ul>
</div>
<div class="times candidate">
<div class="title">
Jessica is available at these times:
</div>
Not yet responded <button name="button" type="button" modal_path="/people/availability/edit_modal/210624650" class="link-like-button availability-edit-modal-open">Edit</button>
</div>
</div>
</div>
</td>
<td class="interview-kit-column"></td>
</tr>
<tr class="interview spicy" application_id="31024648" step_id="553218" stage_id="" style="">
<td colspan="2" rowspan="1" class="name" href="/guides/553390/people/5879170?application_id=31024648" title="View Interview Kit">
<span class="interview-kit-icon small"></span>Technical Phone Interview
</td>
<td class="details schedulable removable" modal_path="/interviews/schedule?application_id=31024648&interview_kit_id=553390" modal_title="Enterprise Account Executive (North America)" nofollow="true" title="Schedule Interview">
<div class="wrapper">
<span href="/interviews/23067896/skip" class="x" title="Skip this interview"></span>
<span class="to-be-scheduled-icon"></span>
<div class="interview-info">
Schedule Interview
<div class="integration-buttons">
</div>
</div>
</div>
</td>
<td class="interview-kit-column">
</td>
</tr>
</tbody></table>
There are multiple table classes(person show-interviews interviews-loaded). I want to extract class from class where text mathes or contains Challenge. I want to ignore other classes. This is what I have tried so far :
with open('Page_Source.html') as page_source:
soup=BeautifulSoup(page_source,'html.parser')
for table in soup.findAll('table',{'class':'person show-interviews interviews-loaded'}):
name=table.find('p',{'class':'name'}).find('a').text
#print name
#print table['application']
#print table['current-interview-stage-id']
job_title=table.find('p',{'class':'job'}).text
#print job_title
next_interview_details=table.find('a',{'class':'toggle-interviews'}).text
#print next_interview_details
for tr in table.findAll('tr',{'class':'interview spicy'}):
i=tr.find('td',text='Challenge')
print i
You can filter the desired table(s) by applying a filtering function where you check for Challenge substring to be present in the table's "text":
for table in soup.find_all(lambda tag: tag.name == 'table' and 'Challenge' in tag.get_text()):
print(table.get('class'))
Prints:
['person', 'show-interviews', 'interviews-loaded']
Ask BeautifulSoup to give you the list of tables. Then look at each table, asking whether it contains 'Challenge'. If it does then display the class attribute for that table.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> tables = soup.findAll('table')
>>> for table in tables:
... if 'Challenge' in table.text:
... table.attrs['class']
...
['person', 'show-interviews', 'interviews-loaded']
EDIT: Response to comment. I haven't written the code as a filter this time because I wanted to make the logic more apparent.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> tables = soup.findAll('table')
>>> for table in tables:
... '----->', table.attrs['class']
... target_tds = [_.parent for _ in table.findAll('span', attrs={'class': 'interview-kit-icon small'})]
... for target_td in target_tds:
... target_td.text.strip(), 'Skipped' in target_td.fetchNextSiblings()[0].text
...
('----->', ['person', 'show-interviews', 'interviews-loaded'])
('Cultural Fit Interview', True)
('Peer Panel Interview', True)
('Case Study', True)
('Executive Interview', True)
('Challenge', False)
('Personality Assessment', True)
('----->', ['person', 'show-interviews', 'interviews-loaded'])
('Technical Phone Interview', False)

Selenium Python For loop through HTML table I want to iterate the 1st row not all rows

I am using a for loop to iterate over a HTML table. I only need to iterate the 1st row, not all of the rows.
My code snippet is below of my method:
def is_historical_tasks_have_any_error_of_the_completed_process(self):
try:
table_id = WebDriverWait(self.driver, 20).until(EC.presence_of_element_located((By.ID, 'operations_monitoring_tab_historical_tasks_ct_fields_body')))
rows = table_id.find_elements(By.TAG_NAME, "tr")
except NoSuchElementException, e:
return False
count = 1
for row in rows:
col_project_name = row.find_elements(By.TAG_NAME, "td")[4] # This is the project name column
col_name = row.find_elements(By.TAG_NAME, "td")[10]
col_status = row.find_elements(By.TAG_NAME, "td")[18]
col_last_start_time = row.find_elements(By.TAG_NAME, "td")[19]
col_last_end_time = row.find_elements(By.TAG_NAME, "td")[20]
col_notes = row.find_elements(By.TAG_NAME, "td")[21]
print col_project_name.text
print col_name.text
print col_status.text
print col_last_start_time.text
print col_last_end_time.text
print col_notes.text
count = count + 1
if col_notes.text == "": # If column Notes is not empty there was an error during the process, return false
return True
else:
print "Error in process"
print col_notes.text
return False
if count >= 1: # we want only the 1st row from the historical tasks. Break out of the for loop when count is greater than 1
break
return False
The HTML is (I have removed some cols otherwise it will be too long to paste):
<table id="operations_monitoring_tab_historical_tasks_ct_fields_body" cellspacing="0" style="table-layout: fixed; width: 100%; margin-bottom: 17px;">
<colgroup>
<tbody>
<tr class="GPI5XK1CEM" __gwt_subrow="0" __gwt_row="0">
<td class="GPI5XK1CDM GPI5XK1CFM GPI5XK1CGM">
<div __gwt_cell="cell-gwt-uid-3952" style="outline-style:none;">1</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3953" style="outline-style:none;" tabindex="0">
<input type="radio" name="rb444241113">
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3954" style="outline-style:none;">
<span class="" title="1" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">1</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3955" style="outline-style:none;">
<span class="" title="53" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">53</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3956" style="outline-style:none;">
<span class="" title="LADemo" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">LADemo</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3957" style="outline-style:none;">
<span class="" title="Generate Stats" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">Generate Stats</span>
</div>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3959" style="outline-style:none;">
<span class="" title="Stats" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">Stats</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3960" style="outline-style:none;">
<span class="" title="11" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">11</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3962" style="outline-style:none;">
<span class="" title="Possible match stats" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">Possible match stats</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3972" style="outline-style:none;">
<span class="" title="2015-10-28 16:10:40" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">2015-10-28 16:10:40</span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM">
<div __gwt_cell="cell-gwt-uid-3973" style="outline-style:none;">
<span class="" title="" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;"></span>
</div>
</td>
<td class="GPI5XK1CDM GPI5XK1CFM GPI5XK1CAN">
<div __gwt_cell="cell-gwt-uid-3974" style="outline-style:none;">
<span class="" title="" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;"></span>
</div>
</td>
</tr>
<tr class="GPI5XK1CDN" __gwt_subrow="0" __gwt_row="1">
<tr class="GPI5XK1CEM" __gwt_subrow="0" __gwt_row="2">
</tbody>
</table>
At the line if count >= 1 it says code is not reachable.
Where do I put this if count >=1?
Or what is the correct way to iterate only the 1st row of the HTML table?
I was trying to use a count to keep track of the rows in the for loop.
Just use find_element instead of find_elements:
first_row = table_id.find_element(By.TAG_NAME, "tr")

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance
You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Categories