Parsing Html data using LXML

Parsing Html data using LXML - python

<div id="descriptionmodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">Description</h3>
</div>
<div id="issue-description" class="mod-content">
<p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<ul class="alternate" type="square">
<li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul>
I want only the Q's . I tried this
doc=lh.fromstring(resp.read())
for id in doc.cssselect('div.mod-content' ):
print id.text_content()
This gives me the q's but it also gives me other details on the page with class mod-content.
How do i specifically get only the q's.
I am using lxml.
<div id="peoplemodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">People</h3>
</div>
<div class="mod-content">
<ul class="item-details" id="peopledetails">
<li class="people-details">
<dl>
<dt>Assignee:</dt>
<dd id="Assign-Val">
<a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
</dd>
</dl>
<dl>
<dt>Reporter:</dt>
<dd id="Report-Val">
<a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
</dd>
</dl>
<dl><dt> </dt><dd> </dd></dl>
<dl>
<dt title="Multiple Assignees">Multiple Assignees:</dt>
<dd id="customfield_10020-val"> <div class="shorten" id="customfield_10020-field">
<span class="tinylink"> <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>, <span class="tinylink"> <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span> </div>
</dd>
</dl>
</li>
</ul>
<div id="watchers-val">
<span class="icon icon-watch-off"></span><span class="action-text">Watch</span>
(<span id="watcher-data">1</span>)
</div>
</div>
</div>

First off: if you are parsing HTML there is a high chance humans will have messed up with it and it won't validate correctly. For example this is the case for the example you posted (there are a couple of </div> missing...). Consider passing to beautifulsoup instead, which is specifically designed to accommodate for these kind of errors.
That said, if your question is just about how to extract the "textual part of the HTML", or in other words how to convert HTML → plain text [as opposed to "extracting only the text contained in specific HTML containers], this is a minimal working example:
from lxml import etree
content = '''<div id="descriptionmodule" class="module toggle-wrap">
<div class="mod-header">
<h3 class="toggle-title">Description</h3>
</div>
<div id="issue-description" class="mod-content">
<p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>
<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>
<ul class="alternate" type="square">
<li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''
tree = etree.fromstring(content)
for bit in tree.xpath('//text()'):
if bit.strip(): # you can insert any kind of test here
print bit
It outputs:
Description
qqqqqqqqqqqqq,
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
HTH!

Related

How to extract the data from encoded HTML class using python

How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.

You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''

How to use xpath to get text from similar class?

(1)
</div>
<div class="n_cont5" id="nct7">
<div class="nc_tit">说明书：</div>
<div class="nc5" id="smsdiv">
正在查询请稍候......
</div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
</div>
(2)
<a href="javascript:noAction()" title="PDF下载"
onclick="pdfDownloadDetail('Unexamined_patent_for_invention/2016/20160330/CN105452223A/PDF_PID/CN112014000037041CN00001054522230APDFZH20160330CN00F.PDF,CN201480037041.6')" href="javascript:noAction()">PDF下载</a>
</dd>
</dl>
</div>
</li>
<li>打印</li>
<li><a class="icon7" href="javascript:noAction();" class="zidongfanyi"
onclick="translateToEn('CN201480037041.6', 'FMZL_EN,SYXX_EN')">中译英</a></li>
</ul>
<div class="clear"></div>
</div>
<div class="clear"></div>
</div>
<div class="n_page">上一篇第<span class="cur">2</span>篇下一篇共<span>53</span>篇转到第
<input type="text" name="pages" id="pages"
onkeydown="return SubmitKeyClick(this,event)"
onkeyup="value=value.replace(/[^\d]/g,'')"
onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"/>
篇</div>
Here are two very similar content in the html content. Iwant to get the number "53" from the first?. I used the below code which doesn't work. I also try from div class, but it also failed. How can I get the number "53" from the first html content?
html.xpath('//a[contains(text(),"下一篇")]/span/text()')

Why it didn't work : the span holding 53 is a sibling (not a child) of the a element.
To complete #super.single430's answer, here's an alternative (if encoding issues occur during the parsing process) :
//span[#class="cur"]/following-sibling::span/text()

html.xpath("//a[contains(text(),'下一篇')]/following-sibling::span/text()")

Trip Advisor Scraping 'moreLink'

I've been building a web scraper in BS4 and have gotten stuck. I am using Trip Advisor as a test for other data I will be going after, but am not able to isolate the tag of the 'entire' reviews. Here is an example:
https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html
Notice in the first review, there is an icon below "the wine list is...". I am able to easily isolate the partial reviews, but have not been able to figure out a way to get BS4 to pull the reviews after a simulated 'More' click. I'm trying to figure out what tool(s) are needed for this? Do I need to use selenium instead?
The original element looks like this:
<span class="partnerRvw">
<span class="taLnk hvrIE6 tr475091998 moreLink ulBlueLinks" onclick=" ta.util.cookie.setPIDCookie(4444); ta.call('ta.servlet.Reviews.expandReviews', {type: 'dummy'}, ta.id('review_475091998'), 'review_475091998', '1', 4444);
">
More </span>
<span class="ui_icon caret-down"></span>
</span>
Looking at the HTML after you click on the More link you would find a new dynamically added class that has a with the information I need (see below):
<div class="review dyn_full_review inlineReviewUpdate provider0 first newFlag" style="display: block;">
<a name="UR475091998" class=""></a>
<div id="UR475091998" class="extended provider0 first newFlag">
<div class="col1of2">
<div class="member_info">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-SRC_475091998" class="memberOverlayLink" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorwidth="90">
<div class="avatar profile_6875524F623CC948F4F9CA95BB4A9567 ">
<a onclick="">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/0d/97/43/bf/joannecarpenter.jpg" class="avatar potentialFacebookAvatar avatarGUID:6875524F623CC948F4F9CA95BB4A9567" width="74" height="74">
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname mbrName_6875524F623CC948F4F9CA95BB4A9567" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">joannecarpenter</span>
</div>
</div>
<div class="location">
Humble, Texas
</div>
</div>
<div class="memberBadging g10n">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-CONT" class="no_cpu" onclick="ta.util.cookie.setPIDCookie('15984'); requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'review_count');" data-anchorwidth="90">
<div class="levelBadge badge lvl_02">
Level <span><img src="https://static.tacdn.com/img2/badges/20px/lvl_02.png" alt="" class="icon" width="20" height="20/"></span> Contributor </div>
<div class="reviewerBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/rev_03.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 reviews</span> </div>
<div class="contributionReviewBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/Foodie.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 restaurant reviews</span>
</div>
</div>
</div>
</div>
<div class="col2of2">
<div class="innerBubble">
<div class="quote">“<span class="noQuotes">Dinner</span>”</div>
<div class="rating reviewItemInline">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s50" width="70" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="ratingDate relativeDate" title="April 12, 2017">Reviewed 3 days ago
<span class="new redesigned">NEW</span> </span>
<a class="viaMobile" href="/apps" target="_blank" onclick="ta.util.cookie.setPIDCookie(24687)">
<span class="ui_icon mobile-phone"></span>
via mobile
</a>
</div>
<div class="entry">
<p>
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
</p>
</div>
<div class="rating-list">
<div class="recommend">
<span class="recommend-titleInline noRatings">Visited April 2017</span>
</div>
</div>
<div class="expanded lessLink">
<span class="taLnk collapse ulBlueLinks no_cpu ">
Less
</span>
<span class="textArrow_more ui_icon caret-up"></span>
</div>
<div id="helpfulq475091998_expanded" class="helpful redesigned white_btn_container ">
<span class="isHelpful">Helpful?</span> <div class="tgt_helpfulq475091998 rnd_white_thank_btn" onclick="ta.call('ta.servlet.Reviews.helpfulVoteHandlerOb', event, this, 'LeJIVqd4EVIpECri1GII2t6mbqgqguuuxizSxiniaqgeVtIJpEJCIQQoqnQQeVsSVuqHyo3KUKqHMdkKUdvqHxfqHfGVzCQQoqnQQZiptqH5paHcVQQoqnQQrVxEJtxiGIac6XoXmqoTpcdkoKAUAAv0tEn1dkoKAUAAv0zH1o3KUK0pSM13vkooXdqn3XmffAdvqndqnAfbAo77dbAo3k0npEEeJIV1K0EJIVqiJcpV1U0Ii9VC1rZlU3XozxbZZxE2crHN2TDUJiqnkiuzsVEOxdkXqi7TxXpUgyR2xXvOfROwaqILkrzz9MvzCxMva7xEkq8xXNq8ymxbAq8AzzrhhzCxbx2vdNvEn2fnwEfq8alzCeqi53ZrgnMrHhshTtowGpNSmq89IwiVb7crUJxdevaCnJEqI33qiE5JGErJExXKx5ooItGCy5wnCTx2VA7RvxEsO3'); ta.trackEventOnPage('HELPFUL_VOTE_TEST', 'helpfulvotegiven_v2');">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_white.png" class="helpful_thumbs_up white">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_green.png" class="helpful_thumbs_up green">
<span class="helpful_text">Thank joannecarpenter</span> </div>
</div>
<div class="tooltips vertically_centered">
<div class="reportProblem">
<span id="ReportIAP_475091998" class="problem collapsed taLnk" onclick="ta.trackEventOnPage('Report_IAP', 'Report_Button_Clicked', 'member'); ta.call('ta.servlet.Reviews.iapFlyout', event, this, '475091998')" onmouseover="if (!this.getAttribute('data-first')) {ta.trackEventOnPage('Reviews', 'report_problem', 'hover_over_flag'); this.setAttribute('data-first', 1)} uiOverlay(event, this)" data-tooltip="" data-position="above" data-content="Problem with this review?">
<img src="https://static.tacdn.com/img2/icons/gray_flag.png" width="13" height="14" alt="">
<span class="reportTxt">Report</span> </span>
</div>
</div>
<div class="userLinks">
<div class="sameGeoActivity">
<a href="/members-citypage/joannecarpenter/g56010" target="_blank" onclick="ta.setEvtCookie('Reviews','more_reviews_by_user','',0,this.href); ta.util.cookie.setPIDCookie(19160)">
See all 5 reviews by joannecarpenter for Humble </a>
</div>
<div class="askQuestion">
<span class="taLnk ulBlueLinks" onclick="ta.trackEventOnPage('answers_review','ask_user_intercept_click' ); ta.load('ta-answers', (function() {require('answers/misc').askReviewerIntercept(this, '470148', 'joannecarpenter', '6875524F623CC948F4F9CA95BB4A9567', 'en', '475091998','Chez Nous', 39151)}).bind(this), true);">Ask joannecarpenter about Chez Nous</span>
</div>
</div>
<div class="note">
This review is the subjective opinion of a TripAdvisor member and not of TripAdvisor LLC. </div>
<div class="duplicateReviewsInline">
<div class="previous">joannecarpenter has 1 more review of Chez Nous</div> <ul class="dupReviews">
<li class="dupReviewItem">
<div class="reviewTitle">
“Joanne Carpenter”
</div>
<div class="rating">
<span class="rate sprite-rating_ss rating_ss"> <img class="sprite-rating_ss_fill rating_ss_fill ss50" width="50" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="date">Reviewed January 18, 2017</span>
</div>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="large">
</div>
<div class="ad iab_inlineBanner">
<div id="gpt-ad-468x60" class="adInner gptAd"></div>
</div>
</div>
Is there a way for BS4 to handle this for me?

Here's a simple example to get you started:
import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html"
driver.get(url)
elem = driver.get_element_by_class_name("taLnk")
...
You could find more info about the methods here:
http://selenium-python.readthedocs.io/

In all likelihood you will need to examine a few more of these pages, to identify variations in the HTML code. For the sample you have offered, and given that you are able to obtain it by simulating a press, the following code works to select the paragraph that you seem to want.
from bs4 import BeautifulSoup
HTML = open('temp.htm').read()
soup = BeautifulSoup(HTML, 'lxml')
para = soup.select('.entry > p')
print (para[0].text)
Result:
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
Note that there are newlines before and after the paragraph.

How to use the xpath to parse the director part from the html with python 3

I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/) with python 3 xpath.
please give your hand to help me solve this issue.
Thanks in advance!
<title> ufffff</title>
<div class="hiragana">2015<br>Dec 1st</br></div>
<div class="movies">
<div class="movie">
<h3>007</h3>
<dl class="directorList">
<dt>director</dt>
<dd>
bruce
</dd>
</dl>
</div>
</div>
<div class="movies">
<div class="movie">
<h3>wind love</h3>
<dl class="directorList">
<dt>director</dt>
<dd>
tom
</dd>
</dl>
<div class="movies">
<div class="movie">
<h3>river war</h3>
<dl class="directorList">
<dt>director</dt>
<dd>
July
</dd>
</dl>
</div>
</div>
<div class="mwb">
<div class="hiraganaLocalNavi">
<ul class="page_12">
<li class="text">o</li>
<li><a class="m01" href="/list/2015/01/">1月</a></li>
<li><a class="m02" href="/list/2015/02/">2月</a></li>
<li><a class="m03" href="/list/2015/03/">3月</a></li>
<li><a class="m04" href="/list/2015/04/">4月</a></li>
<li><a class="m05" href="/list/2015/05/">5月</a></li>
<li><a class="m06" href="/list/2015/06/">6月</a></li>
<li><a class="m07" href="/list/2015/07/">7月</a></li>
<li><a class="m08" href="/list/2015/08/">8月</a></li>
<li><a class="m09" href="/list/2015/09/">9月</a></li>
<li><a class="m10" href="/list/2015/10/">10月</a></li>
<li><a class="m11" href="/list/2015/11/">11月</a></li>
<li><a class="m12" href="/list/2015/12/">12月</a></li>
</ul>
</div>
</div>
..................

Definitively use lxml for this instead. Like this:
from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', #class, ' '), ' movies')]')
This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.
You can read more about it here. Cheers!

Read the link provided by alecxe. You are having that issue.
You have spaces in your raw string that do not occur in the
sample html
Quotes are special characters and need to be escaped
or replaced by '.'
You need to set the re.M flag for multiline
strings '.' by default does not match newlines
Regex and HTML are a match destined for madness.

Follow a sibling in Selenium/Python

I want to enter a text in a text area. The HTML code is as follows:
<li class="order-unavailable string-type-key string-block clear-fix status- require_changes expanded working autogrowed activity-opened" data-string_status="require_changes" data-master_unit_count="22" data-string_id="2394473">
<div class="key-area clear-fix">
<div class="key-area-container-one clear-fix">
<div class="key-area-container-two">
<div class="col-50 col-left">
<div class="string-controls">
<a class="control-expand-toggle selected" href="#"></a>
<a class="control-activity-toggle " href="#">2</a>
<input class="control-select-string" type="checkbox">
</div>
<div class="master-content">
</div>
<div class="col-50 col-right slave-side-container">
</div>
</div>
</div>
<div class="activity-area clear-fix">
<div class="col-50 col-left">
<div class="col-50 col-right">
<div class="comment-area-inner">
<h3>Add comment</h3>
<div class="comment-container">
<textarea class="comment-content" name="comment_content"></textarea>
</div>
<div class="col-right">
<div class="clear"></div>
<strong>Notification settings</strong>
<p>The people you select will get an email when you post this comment. They'll also be notified by email every time a new comment is added.</p>
<div class="notification-settings">
</div>
</div>
</div>
The textarea component name is comment-content
The xpath of the textarea is:
/html/body/div/section/ol/li[16]/div[2]/div[2]/div/div/textarea
This is the code I am using:
driver.find_element_by_xpath("*//div[#title=\"NOTIFICATION_HOMEPAGE_REDIRECT_CHANGED_SITE\"]
/following-sibling::div[2]/div[2]/div/div/textarea").send_keys("Test comment")
Can someone hekp me how to frame the sibling tag?
div[2]/div[2]/div/div/textarea
The tag before the following-sibling keyword is correct.

Choose the textarea and enter something,
driver.find_element_by_xpath(r'//textarea[#class='comment-content']').send_keys('Test Comment')
For xpath, you can use tool Firepath plugin for Firefox

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Html data using LXML - python

Related

How to extract the data from encoded HTML class using python

How to use xpath to get text from similar class?

Trip Advisor Scraping 'moreLink'

How to use the xpath to parse the director part from the html with python 3

Follow a sibling in Selenium/Python

Categories

Resources