Can anyone suggest me to take xpath of these fields

Can anyone suggest me to take xpath of these fields - python

<div class="the_content">
<p><strong>Niki Jones Agency, Inc</strong></p>
<p>Ms. Niki Jones</p>
<p>39 Front Street</p>
<p>Port Jervis</p>
<p>NY, 12771</p>
<p>(845) 856-1266</p>
<p>njones#nikijones.com</p>
<p>www.Nikijones.com</p>
<p>20 Years in the PR & Marketing business : Graphic design, Publications, Websites design& Development, Digital Ads, Campaigns, Social Media, Direct Mail, Website security,ADA compliance 508</p>
<div class="apss-social-share apss-theme-1 clearfix">
<div class="the_content">
<p><strong>JMB Electric Supply, LLC</strong></p>
<p>Joanne M. Barish</p>
<p>17 Belmont Street</p>
<p>White Plains, New York 10605</p>
<p>Tel: (914) 260-1895</p>
<p>Fax: 914-722-3277</p>
<p>Email: jmbelec#optionline.net</p>
<p>Website: http://jmbelec.net/</p>
<p>Description: Master distributor of Electronic and Magnetic Low Voltage Transformers & Ballasts selling throughout the United States, as well as internationally.</p>
<div class="apss-social-share apss-theme-1 clearfix">

step 1: Use below xpath to get all div elements.
/div[#class='the_content']
Step 2: for each div elements,
For company name
/p/strong
For other p tags:
/p[2]
2 is the 2nd p tag. So, you can use 3,4,5...
Please see https://www.w3schools.com/xml/xml_xpath.asp

Related

Web scraping a hidden table using Python

I am trying to scrape the "Traits" table from this website https://www.ebi.ac.uk/gwas/genes/SAMD12 (actually, the URL can change according to my necessity, but the structure will be the same).
The problem is that my knowledge is quite limited in web scraping, and I can't get this table using the basic BeautifulSoup workflow I've seen up to here.
Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)
I'm looking for the "efotrait-table":
efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())
<div class="row" id="efotrait-table-loading" style="margin-top:20px">
<div class="panel panel-default" id="efotrait_panel">
<div class="panel-heading background-color-primary-accent">
<h3 class="panel-title">
<span class="efotrait_label">
Traits
</span>
<span class="efotrait_count badge available-data-btn-badge">
</span>
</h3>
<span class="pull-right">
<span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
<span class="glyphicon glyphicon-chevron-up">
</span>
</span>
</span>
</div>
<div class="panel-body">
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
</div>
</div>
</div>
Specifically, this one:
soup.select('table#efotrait-table')[0]
<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>
As you can see, the table's content doesn't show up. In the website, there's an option for saving the table as csv. It would be awesome if I get this downloadable link somehow. But when I click in the link in order to copy it, I get "javascript:void(0)" instead. I've not studied javascript, should I?
The table is hidden, and even if it's not, I would need to interactively select more rows per page to get the whole table (and the URL doesn't change, so I can't get the table either).
I would like to know a way to get access to this table programmatically (unstructured info), then the minors about organizing the table will be fine. Any clues for how doing that (or what I should study) will be greatly appreciated.
Thanks in advance

Desired data is available within API call.
import requests
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data).json()
print(r)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
You can follow the r.keys() and load your desired data by access the dict.
But here's a quick load (Lazy Code):
import requests
import re
import pandas as pd
data = {
"q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
"max": "99999",
"group.limit": "99999",
"group.field": "resourcename",
"facet.field": "resourcename",
"hl.fl": "shortForm,efoLink",
"hl.snippets": "100",
"fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
"raw": "fq:resourcename:association or resourcename:study"
}
def main(url):
r = requests.post(url, data=data)
match = {item.group(2, 1) for item in re.finditer(
r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
df = pd.DataFrame.from_dict(match)
print(df)
main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")
Output:
0 heel bone mineral density Heel bone mineral density
1 interleukin-8 measurement Chronic obstructive pulmonary disease-related ...
2 self reported educational attainment Educational attainment (years of education)
3 waist-hip ratio Waist-hip ratio
4 eye morphology measurement Eye morphology
5 CC16 measurement Chronic obstructive pulmonary disease-related ...
6 age-related hearing impairment Age-related hearing impairment (SNP x SNP inte...
7 eosinophil percentage of leukocytes Eosinophil percentage of white cells
8 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
9 multiple sclerosis Multiple sclerosis
10 mathematical ability Highest math class taken (MTAG)
11 risk-taking behaviour General risk tolerance (MTAG)
12 coronary artery calcification Coronary artery calcified atherosclerotic plaq...
13 self reported educational attainment Educational attainment (MTAG)
14 pancreatitis Pancreatitis
15 hair colour measurement Hair color
16 breast carcinoma Breast cancer specific mortality in breast cancer
17 eosinophil count Eosinophil counts
18 self rated health Self-rated health
19 bone density Bone mineral density

How to store items separated by <br> into separate arrays using Python BeautifulSoup?

Trying to scrape climbing gym data. I'm using BeautifulSoup.
I want to store arrays of the gym name, location, phone number, link, and description.
Here is sample html:
div class="city">Alberta</div>
<p><b>Camp He Ho Ha Climbing Gym</b><br>
Seba Beach, Alberta, TOE 2BO Canada<br>
(780) 429-3277<br>
<a rel='nofollow' target='_blank' href='http://camphehoha.com/summer-camp/camp-life/'>Camp He Ho Ha Climbing Gym</a><br>
<span class='rt'></span> The Summit is Camp He Ho Ha's 40' climbing gym and ropes course. Facility is available for rent, with safety equipment, orientation to the course and staffing provided.</p>
<div class="city">Calgary</div>
<p><b>Bolder Climbing Community</b><br>
5508 1st Street SE, Calgary, Alberta, Canada<br>
403 988-8140<br>
<a rel='nofollow' target='_blank' href='http://www.bolderclimbing.com/'>Bolder Climbing Community</a><br>
<span class='rt'></span> Calgary's first bouldering specific climbing centre.</p>
I can easily move between each climbing gym because they are separated by <p> but the individual items I'm interested in are separated by <br>. How do I store these items into separate arrays?

You can do something like this. Basically, find the <br> tag, then the content right before it.
html = '''div class="city">Alberta</div>
<p><b>Camp He Ho Ha Climbing Gym</b><br>
Seba Beach, Alberta, TOE 2BO Canada<br>
(780) 429-3277<br>
<a rel='nofollow' target='_blank' href='http://camphehoha.com/summer-camp/camp-life/'>Camp He Ho Ha Climbing Gym</a><br>
<span class='rt'></span> The Summit is Camp He Ho Ha's 40' climbing gym and ropes course. Facility is available for rent, with safety equipment, orientation to the course and staffing provided.</p>
<div class="city">Calgary</div>
<p><b>Bolder Climbing Community</b><br>
5508 1st Street SE, Calgary, Alberta, Canada<br>
403 988-8140<br>
<a rel='nofollow' target='_blank' href='http://www.bolderclimbing.com/'>Bolder Climbing Community</a><br>
<span class='rt'></span> Calgary's first bouldering specific climbing centre.</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
final_content = []
ps = soup.find_all('p')
for p in ps:
content = []
breaks = p.find_all('br')
for br in breaks:
try:
b = br.previousSibling.strip()
content.append(b)
except:
continue
final_content.append(content)
Output:
print (final_content)
[['Seba Beach, Alberta, TOE 2BO Canada', '(780) 429-3277'], ['5508 1st Street SE, Calgary, Alberta, Canada', '403 988-8140']]

Can't scrape particular items from some elements

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.
Here are the elements:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
We Have No Idea if Universal Preschool Actually Helps Kids
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
At Last, Parent Resistance to Collective Standardized Tests
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
Day Care: Parents versus Professional Advocates
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
If you run my script, you can see that the scraped results is only the first one:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
Result:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
If you run this one, you will see that this script is able to parse all the results I'm after:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
Results:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.

If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement
post= tree.cssselect(".text-large a")[item].text or " "
You can also try below XPath solution:
container = tree.cssselect(".text-large")
for item in container:
post = item.xpath('./a')[0].text or " "
date = item.xpath('./following-sibling::p/span[#class="date-display-single"]')[0].text or " "
author = item.xpath('./following-sibling::p[#class="text-sans"]')[0].text.strip() or " "
print(post+'\n', date+'\n', author)

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

I want my output to be like:
count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim
Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.
Count:1 - Andre-Pierre Gignac set for Mexico move
Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.
My Program:
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
divTaginb = tag.find_all("div", {"class":"lc-title-container"})
divTaginp = tag.find_all("div",{"class":"lc-post-body"})
for tag1 in divTaginb:
h4Tag = tag1.find_all("b")
for tag2 in h4Tag:
print "count:%d - "%count,
print tag2.text
print '\n'
tagp = divTaginp[count].find_all('p')
for p in tagp:
print p
print '\n'
count +=1
My output:
Count:0 - ....
...
count:37 - ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process
<p><strong>STEPHEN MCGOWAN:</strong>┬áBundesliga giants Hamburg have been linked
with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>
<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>
count:38 - ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta
<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>
count:39 - Good morning everybody!
<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today┬áyou can read all
about them here.</p>
DailyMail Website looks like this:
<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
<div class="lc-post-time">18:03 </div>
</div>
<div class="lc-title-container">
<h4>
<b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b>
</h4>
</div>
<div class="lc-post-body">
<p><strong>SAMI MOKBEL: </strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
<b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
<div class="lc-post-time">16:54 </div>
</div>
<div class="lc-title-container">
<span><b>Andre-Pierre Gignac set for Mexico move</b></span>
</div>
<div class="lc-post-body">
<p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
<b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
Now my target is <div class="lc-title-container"> inside this <b></b>.Which I am getting easily. But when I am targeting <div class="lc-post-body"> inside this all <p></p>. I am not able to get only required text.
I tried p.text and p.strip() but still I am not able to solve my problem.
Error while using p.text
count:19 - City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality
MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.
Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
File "app.py", line 24, in <module>
print p.text
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
160: character maps to <undefined>
And while i am using p.strip() I am not getting any output.
Is there any good way to do it. Help me get the best way. I am trying this thing from morning and now its night.
I dont want to use any encoder or decoder if possible
dammit = UnicodeDammit(html) print(dammit.unicode_markup)

Here's my code. You should go though it. I was to lazy to add specific fields for the dataset and instead just combined everything.
from bs4 import BeautifulSoup, element
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
article_dataset = {}
# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})
# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:
container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})
body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})
# Find the article name, and initialize an empty dict as the value
for count, container in enumerate(container_tags):
# We know there is only 1 <b> tag in our container,
# so use find() instead of find_all()
article_name_tag = container.find('b')
# Our primary key is the article name, the corrosponding value is the body_tag.
article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}
for article_name, details in article_dataset.items():
content = []
content_line_tags = details['body_tag'].find_all('p')
# Go through each tag and collect the text
for content_tag in content_line_tags:
for data in content_tag.contents: # gather strings in our tags
if type(data) == element.NavigableString:
data = unicode(data)
else:
data = data.text
content += [data]
# combine the content
content = '\n'.join(content)
# Add the content to our data
article_dataset[article_name]['content'] = content
# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
del details['body_tag']
print
print
print 'Artilce Name: ' + name
print 'Player: ' + details['content'].split('\n')[0]
print 'Article Summary: ' + details['content']
print

Extracting text form specific nested nodes with attributes

I am trying to write XPath that will select <h3>, <ul> and <p> tags under div[#class="content"] but with p[position() > 1 and position() < last() - 1]
So far I have this....
//div[#class="content"]/*[self::h3 or self::ul or self::p[position() > 1 and position() < last() - 1]]//text()
But it doesn't work.
Here's HTML: https://gist.github.com/umrashrf/5167711

OK your XML was not well formed so I fixed this first.
<?xml version="1.0" encoding="UTF-8"?>
<div class="content">
<h1/>
<h2>
<p>Certified Nursing Assistant - Full Time</p>
Job Summary</h2>
<p>Responsible for providing personal care and assistance for residents in long
term care facility.</p>
<h2>
</h2>
<h3>Essential Functions:</h3>
<ul>
<li>
<span style="line-height: 1.5;">Responsible</span> for providing
personal care and assistance to residents </li>
<li>Assist residents in and out of bed, dressing, feeding, grooming and
personal hygiene. </li>
<li>Provide basic treatments as required and directed by nursing staff.
</li>
<li>Responsible for observing and reporting changes in residents' physical
and emotional conditions to charge nurse. </li>
</ul>
<h3>Qualifications: </h3>
<p>Education:</p>
<ul>
<li>High school diploma or equivalent </li>
<li>Successful completion of state approved certified nursing assistance
course </li>
</ul>
<p>Experience:</p>
<ul>
<li>Previous health care related experience preferred </li>
</ul>
<a id="ctl00_ctl01_namelink" class="btn" href="employment-application.aspx?
positionid=34">Apply Online</a>
<br/>
<br/>
<h2>
Apply in Person</h2>
<p>
To apply in persion please stop by Shenandoah Medical Center to pick up a job
application.</p>
<h2>
Apply by Mail</h2>
<p>
To apply by mail, download and print <a target="_blank" href="/filesimages/Careers/SMC
Employment Application.pdf">
this form</a>. Please fill out the application and then mail to:<br/>
<br/>
<strong>Shenandoah Medical Center, Human Resources<br/>
</strong>300 Pershing Avenue<br/>
Shenandoah, IA 51601</p>
</div>
Now if I understand your question correctly, you are looking to find all h3, ul and p tags, which are child nodes of div[#class="content"] and each selected child node must satisfy the condition [position() > 1 and position() < last() - 1]. For this I think this single XPATH will do:
//div[#class="content"]/h3[position() > 1 and position() < last() - 1] |
//div[#class="content"]/p[position() > 1 and position() < last() - 1] |
//div[#class="content"]/ul[position() > 1 and position() < last() - 1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can anyone suggest me to take xpath of these fields - python

step 1: Use below xpath to get all div elements. /div[#class='the_content'] Step 2: for each div elements, For company name /p/strong For other p tags: /p[2] 2 is the 2nd p tag. So, you can use 3,4,5... Please see https://www.w3schools.com/xml/xml_xpath.asp

Related

Web scraping a hidden table using Python

How to store items separated by <br> into separate arrays using Python BeautifulSoup?

Can't scrape particular items from some elements

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

Extracting text form specific nested nodes with attributes

Categories

Resources