Extracting text form specific nested nodes with attributes - python

I am trying to write XPath that will select <h3>, <ul> and <p> tags under div[#class="content"] but with p[position() > 1 and position() < last() - 1]
So far I have this....
//div[#class="content"]/*[self::h3 or self::ul or self::p[position() > 1 and position() < last() - 1]]//text()
But it doesn't work.
Here's HTML: https://gist.github.com/umrashrf/5167711

OK your XML was not well formed so I fixed this first.
<?xml version="1.0" encoding="UTF-8"?>
<div class="content">
<h1/>
<h2>
<p>Certified Nursing Assistant - Full Time</p>
Job Summary</h2>
<p>Responsible for providing personal care and assistance for residents in long
term care facility.</p>
<h2>
</h2>
<h3>Essential Functions:</h3>
<ul>
<li>
<span style="line-height: 1.5;">Responsible</span> for providing
personal care and assistance to residents </li>
<li>Assist residents in and out of bed, dressing, feeding, grooming and
personal hygiene. </li>
<li>Provide basic treatments as required and directed by nursing staff.
</li>
<li>Responsible for observing and reporting changes in residents' physical
and emotional conditions to charge nurse. </li>
</ul>
<h3>Qualifications: </h3>
<p>Education:</p>
<ul>
<li>High school diploma or equivalent </li>
<li>Successful completion of state approved certified nursing assistance
course </li>
</ul>
<p>Experience:</p>
<ul>
<li>Previous health care related experience preferred </li>
</ul>
<a id="ctl00_ctl01_namelink" class="btn" href="employment-application.aspx?
positionid=34">Apply Online</a>
<br/>
<br/>
<h2>
Apply in Person</h2>
<p>
To apply in persion please stop by Shenandoah Medical Center to pick up a job
application.</p>
<h2>
Apply by Mail</h2>
<p>
To apply by mail, download and print <a target="_blank" href="/filesimages/Careers/SMC
Employment Application.pdf">
this form</a>. Please fill out the application and then mail to:<br/>
<br/>
<strong>Shenandoah Medical Center, Human Resources<br/>
</strong>300 Pershing Avenue<br/>
Shenandoah, IA 51601</p>
</div>
Now if I understand your question correctly, you are looking to find all h3, ul and p tags, which are child nodes of div[#class="content"] and each selected child node must satisfy the condition [position() > 1 and position() < last() - 1]. For this I think this single XPATH will do:
//div[#class="content"]/h3[position() > 1 and position() < last() - 1] |
//div[#class="content"]/p[position() > 1 and position() < last() - 1] |
//div[#class="content"]/ul[position() > 1 and position() < last() - 1]

Related

Can anyone suggest me to take xpath of these fields

<div class="the_content">
<p><strong>Niki Jones Agency, Inc</strong></p>
<p>Ms. Niki Jones</p>
<p>39 Front Street</p>
<p>Port Jervis</p>
<p>NY, 12771</p>
<p>(845) 856-1266</p>
<p>njones#nikijones.com</p>
<p>www.Nikijones.com</p>
<p>20 Years in the PR & Marketing business : Graphic design, Publications, Websites design& Development, Digital Ads, Campaigns, Social Media, Direct Mail, Website security,ADA compliance 508</p>
<div class="apss-social-share apss-theme-1 clearfix">
<div class="the_content">
<p><strong>JMB Electric Supply, LLC</strong></p>
<p>Joanne M. Barish</p>
<p>17 Belmont Street</p>
<p>White Plains, New York 10605</p>
<p>Tel: (914) 260-1895</p>
<p>Fax: 914-722-3277</p>
<p>Email: jmbelec#optionline.net</p>
<p>Website: http://jmbelec.net/</p>
<p>Description: Master distributor of Electronic and Magnetic Low Voltage Transformers & Ballasts selling throughout the United States, as well as internationally.</p>
<div class="apss-social-share apss-theme-1 clearfix">
step 1: Use below xpath to get all div elements.
/div[#class='the_content']
Step 2: for each div elements,
For company name
/p/strong
For other p tags:
/p[2]
2 is the 2nd p tag. So, you can use 3,4,5...
Please see https://www.w3schools.com/xml/xml_xpath.asp

python xpath extract text outside tag based on the span text

I want to extract the text outside the tag and match it with the text inside the span.
This is the code:
<div class="info">
<p>
<i class="icon-trending-up"></i>
<span>Rank:</span>
600
</p>
<p>
<i class="icon-play"></i>
<span>Total Videos:</span>
36
</p>
<p>
<i class="icon-bar-chart"></i>
<span>Video Views:</span>
1,815,767
</p>
<hr>
<p>
<i class="icon-user-plus"></i>
<span>Followers:</span>
732
</p>
</div>
I want to extract something like this in separate items.
item['rank'] = rank
Rank: 600
item['videos'] = videos
Total Videos: 36
item['views'] = views
Video Views: 1,815,767
I do not want the <p> tag below <hr>
This is what i have tried by now:
rank = response.xpath("//div[#class='info']//hr/preceding-sibling::p//text()='Videos:'").extract()
This is the result:
[u'0']
OR
rank = response.xpath("//div[#class='info']//hr/preceding-sibling::p/span[contains(text(), 'Videos:')]/text()|//hr/preceding-sibling::p//text()[not(parent::span)]").extract()
This is the result:
[u' 600', u'Total Videos:', u' 36', u' 1,815,767']
Basically i want to extract The number Based on the span Text, and every <p> tag separated in it's on item.
Thank you
UPDATE
I can't use anything like p[1], p[2] etc...because those <p> may swap, or it might be only 2 on other pages. The <span> text will remain the same
What about:
item["rank"] = response.xpath('//span[.="Rank:"]/following-sibling::text()[1]').extract_first()
item["videos"] = response.xpath('//span[.="Video Views:"]/following-sibling::text()[1]').extract_first()
This should work. It looks a bit clumsy because it has to deal with the nested elements.
item['rank'] = ''.join(s.strip() for s in response.xpath('//div//span[contains(., "Rank:")]/ancestor::p/text()').extract())

Get value form multiple child which is having same parent name and child name in Selenium using Python

I want fetch the value only for "Publisher " since I have the same class name and for both the parent, I am not able to figure how to do it.
<div class="block-record-info">
<div class="title3">Publisher</div>
<p class="FR_field">
<value>INFORMS, 5521 RESEARCH PARK DR, SUITE 200, CATONSVILLE, MD 21228 USA</value>
</p>
</div>
<div class="block-record-info">
<div class="title3">Categories / Classification</div>
<p class="FR_field">
<span class="FR_label">Research Areas:</span>
Computer Science; Operations Research & Management Science
</p>
the code I used :
valuexpath1 = '//div[#class="block-record-info"]' valueElement1 =
driver.find_element_by_xpath(valuexpath1) valuexpath2 = '//*'
valueElement2 = valueElement1.find_element_by_xpath(valuexpath2)
valueValue2 = valueElement2.text print(valueValue2)
it is giving me the value of "Categories / Classification" and "Publisher ". but I want only publisher.
As the first div doesn't have span element, you can try exclude div which contain p/span:
valuexpath1 = '//div[#class="block-record-info" and not(p/span)]'

questions about scraping a review website

I am currently crawling a review site with beautiful soup.
the review page contains reviews from different students,
and each student would evaluate the school on several aspects.
therefore, the structure of the page generally looks like:
student A - title A:
aspect 1
comment toward aspect 1
aspect2
commet on aspect 2
aspect3
commet on aspect 3
student B - title B:
aspect 1
comment toward aspect 1
aspect2
commet on aspect 2
aspect4
commet on aspect 4
some students only made comments on particular aspects. the aspects they dont comment on will not be shown on the website.
each review in code
<!-- mod-reviewTop -->
<div class="mod-reviewTop">
<!-- mod-reviewTop-inner -->
<div class="mod-reviewTop-inner">
<dl>
<dd>
<div class="mod-reviewTitle" itemprop="summary">
title 1 : It was ok.
</div>
</dd>
</dl>
<!-- /mod-reviewItem -->
</div>
<!-- /mod-reviewTop -->
<!-- mod-reviewBottom -->
<div class="mod-reviewBottom">
<!-- mod-reviewList-list -->
<div class="mod-reviewList-list js-review-detail" itemprop="description">
<!-- js-mod-reviewList-list -->
<div class="js-mod-reviewList-list">
<ul>
<li>
<div class="mod-reviewTitle3">
Total Evaluation
</div>
<div class="mod-reviewList-txt">
We can freely choose the course we want, and thus a lot of different knowledge can be learned.
</div>
</li>
<li>
<div class="mod-reviewTitle3">
Course
</div>
<div class="mod-reviewList-txt">
the courses are good.
</div>
</li>
<li>
<div class="mod-reviewTitle3">
Lab
</div>
<div class="mod-reviewList-txt">
we don’t join lab in the first 2 year.
</div>
</li>
</ul>
</div>
<!-- /js-mod-reviewList-list -->
</div>
<!-- /mod-reviewList-list -->
</div>
<!-- /mod-reviewBottom -->
you can see that even though the title of aspects are different, they all start with 'div class="mod-reviewTitle3" ' , and the comments all start with 'div class="mod-reviewList-txt"'.
my question is that how do i write good code to store these information into a data set:
| title | aspect1 comment | aspect2 comment
A good very nice
i have tried the code below, but aspect comment in each block doesnt work well
datatest = soup.find_all("div", {"class":"mod-reviewTop"})
datatest1 = soup.find_all("div", {"class":"mod-reviewBottom"})
for item in datatest:
a = item.select('.mod-reviewTitle')
c = item.select('.mod-reviewTitle3')
d = item.select('.mod-reviewList-txt')
g = item.select('.js-mod-reviewList-list')
f= item.select('.mod-reviewItem')
for i in range(len(a)):
f1= f[i].text[7]
f2= f[i].text[17]
f3= f[i].text[26]
f4= f[i].text[37]
f5= f[i].text[46]
f6= f[i].text[55]
f7= f[i].text[63]
print a[i].text
print f1, f2, f3, f4, f5, f6, f7
for item in datatest1:
for k in range(len(g)):
print g[k].text
print e[k].text
print k
i regard this as a programming problem..
i have try loops but didnt work well
if you can kindly give me some reference or how the structure shall work logically, pls kindly leave me a comment.. thanks
Tips:
You should attach aspects and comments to corresponding titles, which means you store them together by using a proper data structure. like this(Just one possible way)
[
(title1,[
(aspect1, comment1),
(aspect2, comment2),
...
]),
(title2,[
(aspect1, comment1),
(aspect2, comment2),
...
]),
...
]
So when retrieving data you want, organize operations with nested for loops. i.e. Once you find an aspect, for example, try to get the corresponding comment and store them together. Escape finding all aspects, then all comments.
Code
Here is a demo.
blocks = soup.find_all("div", {"class":"mod-reviewTop"})
contents = soup.find_all("div", {"class":"mod-reviewBottom"})
data = []
for i,block in enumerate(blocks):
aspects = []
title = str(block.find('div',{'class':'mod-reviewTitle'}).text).strip()
for aspect_block in contents[i].find_all('li'):
aspect = str(aspect_block.find('div',{'class':'mod-reviewTitle3'}).text).strip()
comment = str(aspect_block.find('div',{'class':'mod-reviewList-txt'}).text).strip()
aspects.append((aspect,comment))
data.append((title,aspects))
print data
with open("output.txt","w") as file:
for title, aspects in data:
file.write(title)
for aspect in aspects:
file.write('|'+aspect[0]+'\t'+aspect[1])
file.write('\n')

Retrieve bbc weather data with identical span class and nested spans

I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))
your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind

Categories