Can't scrape particular items from some elements

Can't scrape particular items from some elements - python

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.
Here are the elements:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
We Have No Idea if Universal Preschool Actually Helps Kids
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
At Last, Parent Resistance to Collective Standardized Tests
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
Day Care: Parents versus Professional Advocates
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
If you run my script, you can see that the scraped results is only the first one:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
Result:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
If you run this one, you will see that this script is able to parse all the results I'm after:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
Results:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.

If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement
post= tree.cssselect(".text-large a")[item].text or " "
You can also try below XPath solution:
container = tree.cssselect(".text-large")
for item in container:
post = item.xpath('./a')[0].text or " "
date = item.xpath('./following-sibling::p/span[#class="date-display-single"]')[0].text or " "
author = item.xpath('./following-sibling::p[#class="text-sans"]')[0].text.strip() or " "
print(post+'\n', date+'\n', author)

Related

Parsing html elements

In my (downloaded) HTMLs i have in the top of every file executives mentioned (like Dror Ben Asher" in the code below):
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
Further along the html these executives name reaccurs multiple times where after the name follows an text element i want to parse
Example
<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade. We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects. In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>
For now i have a code (see below) which identifies one executive "Dror Ben Asher" and graps all the text which accurs after in the P element. But I would like this to work for all executives and for Multiple html files where different executives are mentioned (different company).
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")
Does anyone have a suggestion to tackle this challenge?

If I understand your question correctly, you could put the code in a function, to which you could pass the name you need as an argument and use that variable to construct your search strings.
for example:
def func(name_to_find):
# some code
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code
and call it like so:
func('Dror Ben Asher')

Can anyone suggest me to take xpath of these fields

<div class="the_content">
<p><strong>Niki Jones Agency, Inc</strong></p>
<p>Ms. Niki Jones</p>
<p>39 Front Street</p>
<p>Port Jervis</p>
<p>NY, 12771</p>
<p>(845) 856-1266</p>
<p>njones#nikijones.com</p>
<p>www.Nikijones.com</p>
<p>20 Years in the PR & Marketing business : Graphic design, Publications, Websites design& Development, Digital Ads, Campaigns, Social Media, Direct Mail, Website security,ADA compliance 508</p>
<div class="apss-social-share apss-theme-1 clearfix">
<div class="the_content">
<p><strong>JMB Electric Supply, LLC</strong></p>
<p>Joanne M. Barish</p>
<p>17 Belmont Street</p>
<p>White Plains, New York 10605</p>
<p>Tel: (914) 260-1895</p>
<p>Fax: 914-722-3277</p>
<p>Email: jmbelec#optionline.net</p>
<p>Website: http://jmbelec.net/</p>
<p>Description: Master distributor of Electronic and Magnetic Low Voltage Transformers & Ballasts selling throughout the United States, as well as internationally.</p>
<div class="apss-social-share apss-theme-1 clearfix">

step 1: Use below xpath to get all div elements.
/div[#class='the_content']
Step 2: for each div elements,
For company name
/p/strong
For other p tags:
/p[2]
2 is the 2nd p tag. So, you can use 3,4,5...
Please see https://www.w3schools.com/xml/xml_xpath.asp

Python: Parse HTML to remove a tag and apply text transformation to all text after the tag

I'm trying to detect strings that contain the HTML tag <p><strong class="title"> </strong></p> and also certain words inside the tag "shared" OR "amenities", and append the word "shared" to all comma separated substrings that appear after that tag. Is there an easy way to accomplish this?
Example input:
</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">
Example output:
swimming pool, barbecue, beach shared, tennis courts shared

You can use a few differnt libraries for this, common choices are Beautiful Soup, or lxml. I prefer lxml because most languages have implementations, similar to regex, so it feels like I'll get more mileage out of the investment.
from lxml import html
stuff = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
stuff = html.fromstring(stuff)
ptag = stuff.xpath('//p/*[contains(text(),"AMENITIES") or contains(text(), "SHARED")]//text()')
print(ptag)

I got this to work using the code below. Any comments and suggestions welcome!
from bs4 import BeautifulSoup
html_to_parse = '</strong></p> swimming pool, barbecue <hr /> <p><strong class="title">SHARED CLUB AMENITIES</strong></p> beach, tennis courts <hr /> <p><strong class="title">'
soup = BeautifulSoup(html_to_parse)
html_body = soup('body')[0]
shared_indicator = html_body.find('strong', 'title').get_text()
non_shared_amenities = html_to_parse.split(shared_indicator,1)[0]
non_shared_amenities = (BeautifulSoup(non_shared_amenities, 'html.parser')
.get_text()
.strip()
)
shared_amenities = html_to_parse.split(shared_indicator,1)[1]
shared_amenities_array = (pd.Series(BeautifulSoup(shared_amenities, 'html.parser')
.get_text()
.split(','))
.replace("[^A-Za-z0-9'`]+", " ", regex = True)
.str.strip()
.apply(lambda x: "{}{}".format(x, ' shared'))
)
shared_amenities_tagged = ", ".join(shared_amenities_array)
non_shared_amenities + ', ' + shared_amenities_tagged

Removing particular content from result parces using beautifulsoup

def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
Read all announcements in Prestige Estate </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
Read all announcements in Prestige Estate
from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

You can use extract() to remove unnecessary tags from the find() result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text

Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

I want my output to be like:
count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim
Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.
Count:1 - Andre-Pierre Gignac set for Mexico move
Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.
My Program:
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
divTaginb = tag.find_all("div", {"class":"lc-title-container"})
divTaginp = tag.find_all("div",{"class":"lc-post-body"})
for tag1 in divTaginb:
h4Tag = tag1.find_all("b")
for tag2 in h4Tag:
print "count:%d - "%count,
print tag2.text
print '\n'
tagp = divTaginp[count].find_all('p')
for p in tagp:
print p
print '\n'
count +=1
My output:
Count:0 - ....
...
count:37 - ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process
<p><strong>STEPHEN MCGOWAN:</strong>┬áBundesliga giants Hamburg have been linked
with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>
<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>
count:38 - ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta
<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>
count:39 - Good morning everybody!
<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today┬áyou can read all
about them here.</p>
DailyMail Website looks like this:
<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
<div class="lc-post-time">18:03 </div>
</div>
<div class="lc-title-container">
<h4>
<b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b>
</h4>
</div>
<div class="lc-post-body">
<p><strong>SAMI MOKBEL: </strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
<b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
<div class="lc-post-time">16:54 </div>
</div>
<div class="lc-title-container">
<span><b>Andre-Pierre Gignac set for Mexico move</b></span>
</div>
<div class="lc-post-body">
<p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
<b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
Now my target is <div class="lc-title-container"> inside this <b></b>.Which I am getting easily. But when I am targeting <div class="lc-post-body"> inside this all <p></p>. I am not able to get only required text.
I tried p.text and p.strip() but still I am not able to solve my problem.
Error while using p.text
count:19 - City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality
MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.
Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
File "app.py", line 24, in <module>
print p.text
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
160: character maps to <undefined>
And while i am using p.strip() I am not getting any output.
Is there any good way to do it. Help me get the best way. I am trying this thing from morning and now its night.
I dont want to use any encoder or decoder if possible
dammit = UnicodeDammit(html) print(dammit.unicode_markup)

Here's my code. You should go though it. I was to lazy to add specific fields for the dataset and instead just combined everything.
from bs4 import BeautifulSoup, element
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
article_dataset = {}
# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})
# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:
container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})
body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})
# Find the article name, and initialize an empty dict as the value
for count, container in enumerate(container_tags):
# We know there is only 1 <b> tag in our container,
# so use find() instead of find_all()
article_name_tag = container.find('b')
# Our primary key is the article name, the corrosponding value is the body_tag.
article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}
for article_name, details in article_dataset.items():
content = []
content_line_tags = details['body_tag'].find_all('p')
# Go through each tag and collect the text
for content_tag in content_line_tags:
for data in content_tag.contents: # gather strings in our tags
if type(data) == element.NavigableString:
data = unicode(data)
else:
data = data.text
content += [data]
# combine the content
content = '\n'.join(content)
# Add the content to our data
article_dataset[article_name]['content'] = content
# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
del details['body_tag']
print
print
print 'Artilce Name: ' + name
print 'Player: ' + details['content'].split('\n')[0]
print 'Article Summary: ' + details['content']
print

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't scrape particular items from some elements - python

Related

Parsing html elements

Can anyone suggest me to take xpath of these fields

Python: Parse HTML to remove a tag and apply text transformation to all text after the tag

Removing particular content from result parces using beautifulsoup

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

Categories

Resources