Parsing html elements - python

In my (downloaded) HTMLs i have in the top of every file executives mentioned (like Dror Ben Asher" in the code below):
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
Further along the html these executives name reaccurs multiple times where after the name follows an text element i want to parse
Example
<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade. We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects. In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>
For now i have a code (see below) which identifies one executive "Dror Ben Asher" and graps all the text which accurs after in the P element. But I would like this to work for all executives and for Multiple html files where different executives are mentioned (different company).
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")
Does anyone have a suggestion to tackle this challenge?

If I understand your question correctly, you could put the code in a function, to which you could pass the name you need as an argument and use that variable to construct your search strings.
for example:
def func(name_to_find):
# some code
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code
and call it like so:
func('Dror Ben Asher')

Related

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

The webpage I'm attempting to scrape has a section where the html tags are nested like so:
<div>
<h3>
<p>
<p>
<h3>
<p>
<p>
<p>
My code is able to navigate to the correct tag but I am struggling to split the text by as is a sibling, not a child. I am either able to print just the tags or print all the text within the tag without splitting into sections.
I've tried using for loops but I don't think is the right approach if searching within siblings. I think looping an if statement to determine if find_next_sibling().name = 'h3' might work but I've been unable to iterate this without nesting a large number of if statements.
Can anyone please advise on what approach I should take? Please see my full code below - the treaty files section works fine.
from bs4 import BeautifulSoup
import requests
url = 'https://www.gov.uk/government/publications/albania-tax-treaties'
get_url = requests.get(url)
url_html = get_url.content
soup = BeautifulSoup(url_html, 'lxml')
treaty_files = soup.find_all('div', class_='attachment-details')
for treaty_file in treaty_files:
file_name = treaty_file.h3.a.text
file_url = treaty_file.h3.a['href']
#print(f"Treaty Name: {file_name}")
#print(f"Treaty URL: {file_url}")
#print()
#Attempt 1
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_siblings()
for x in content:
test = x
a = test
#print(a)
#Attempt 2
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_sibling()
while content.name != 'h3':
print(f"Text: {content.text}")
content = content.find_next_sibling()
if content.name == 'h3':
break
One possible solution is to leverage pandas.Series.groupby function to group sections together:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.gov.uk/government/publications/albania-tax-treaties"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
govspeak = soup.select_one(".govspeak")
s = pd.Series(govspeak.find_all(recursive=False))
for _, g in s.groupby(s.apply(lambda x: x.name).eq("h3").cumsum()):
title = g.iloc[0].text
text = "\n".join(row.text for row in g.iloc[1:])
print(title)
print("-" * 120)
print(text)
print()
print()
Prints:
2021 UK-Albania Synthesised text of the Multilateral Instrument and the 2013 Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The 2013 UK-Albania Double Taxation Agreement has been modified by the Multilateral Instrument (MLI).
The modifications made by the Multilateral Instrument entered into force in:
the UK on 1 October 2018
Albania on 1 January 2021
They are effective in the UK from:
1 January 2021 for taxes withheld at source
1 April 2022 for Corporation Tax
6 April 2022 for Income Tax and Capital Gains Tax
They are effective in Albania from 1 July 2021.
2013 UK-Albania Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The agreement entered into force on 30 December 2013.
It is effective in the UK from:
1 April 2014 for Corporation Tax
6 April 2014 for Income Tax and Capital Gains Tax
It is effective in Albania from 1 January 2014 for Income Tax and Capital Gains Tax.

BeautifulSoup - Scraping a comment when the ID field changes

I'm scraping baseball game data for a number of seasons. Here's an example of the data.
https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml
For this question, I'm specifically looking for a way to pull out the comments that contain the umpire and game data. Note, these html files are now stored locally, so I'm trying to iterate through a folder. In the source code it looks like this:
<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div><div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>
As you can see it's inside a comment. The real challenge is that the ID value changes between venues and seasons. I'm parsing through 10 years of data. Can someone tell me how to pull the comment text when the ID actually changes?
Here's my code:
# import libraries and files
from bs4 import BeautifulSoup, Comment
import os
print
# Setup Games list for append
games = []
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"
for filename in os.listdir(path):
if filename.endswith(".html"):
fullpath = os.path.join(path, filename)
print 'Processing {:}...'.format(fullpath)
# Get Page, Make Soup
soup = BeautifulSoup(open(fullpath), 'lxml')
# Setting up game object to append to list
game = {}
# Get Description
# Note: Skip every other child because of 'Navigable Strings' from BS.
divs = soup.findAll('div', {'scorebox_meta'})
for div in divs:
for idx, child in enumerate(div.children):
if idx == 1:
game['date'] = child.text
elif idx == 3:
game['start_time'] = child.text.split(':', 1)[1].strip()
elif idx == 7:
game['venue'] = child.text.split(':', 1)[1].strip()
elif idx == 9:
game['duration'] = child.text.split(':', 1)[1].strip()
# Get Player Data from tables
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
data = BeautifulSoup(comment,"lxml")
for items in data.select("table tr"):
player_data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(player_data)
print '======================================================='
# Get Umpire Data
# Append game data to full list
games.append(game)
print
print 'Results'
print '*' * 80
# Print the games harvested to the console
for idx, game in enumerate(games):
print str(idx) + ': ' + str(game)
# Write to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(game)
Thanks so much,
Benny
I used re module to extract the comment section:
from bs4 import BeautifulSoup
import re
data = """<div class="section_wrapper setup_commented commented" id="all_342042674">
<div class="section_heading">
<span class="section_anchor" id="342042674_link" data-label="Other Info"></span>
<h2>Other Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="section_content" id="div_342042674">
<div><strong>Umpires:</strong> HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.</div>
<div><strong>Time of Game:</strong> 3:21.</div>
<div><strong>Attendance:</strong> 33,809.</div>
<div><strong>Start Time Weather:</strong> 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.</div>
</div>
-->
</div>"""
soup = BeautifulSoup(re.search(r'(?<=<!--)(.*?)(?=-->)', data, flags=re.DOTALL)[0], 'lxml')
umpires, time_of_game, attendance, start_time_weather = soup.select('div.section_content > div')
print('ID: ', soup.find('div', class_="section_content")['id'])
print('umpires: ', umpires.text)
print('time of game: ', time_of_game.text)
print('attendance: ', attendance.text)
print('start_time_weather: ', start_time_weather.text)
Output:
ID: div_342042674
umpires: Umpires: HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
time of game: Time of Game: 3:21.
attendance: Attendance: 33,809.
start_time_weather: Start Time Weather: 70° F, Wind 6mph out to Centerfield, Night, No Precipitation.
If you kick out those vicious signs <!--,--> from the html elements, you can easily access the content. This is how you can go:
import requests
from bs4 import BeautifulSoup
url = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
res = requests.get(url)
content = res.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(content,"lxml")
umpire, gametime, attendance, weather = soup.find_all(class_="section_content")[2]("strong")
print(f'{umpire.next_sibling}\n{gametime.next_sibling}\n{attendance.next_sibling}\n{weather.next_sibling}\n')
Output:
HP - Greg Gibson, 1B - Jerry Layne, 2B - Jordan Baker, 3B - Vic Carapazza.
3:21.
33,809.
70° F, Wind 6mph out to Centerfield, Night, No Precipitation.

Can't scrape particular items from some elements

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.
Here are the elements:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
We Have No Idea if Universal Preschool Actually Helps Kids
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
At Last, Parent Resistance to Collective Standardized Tests
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
Day Care: Parents versus Professional Advocates
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
If you run my script, you can see that the scraped results is only the first one:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
Result:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
If you run this one, you will see that this script is able to parse all the results I'm after:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
Results:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.
If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement
post= tree.cssselect(".text-large a")[item].text or " "
You can also try below XPath solution:
container = tree.cssselect(".text-large")
for item in container:
post = item.xpath('./a')[0].text or " "
date = item.xpath('./following-sibling::p/span[#class="date-display-single"]')[0].text or " "
author = item.xpath('./following-sibling::p[#class="text-sans"]')[0].text.strip() or " "
print(post+'\n', date+'\n', author)

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

I want my output to be like:
count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim
Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.
Count:1 - Andre-Pierre Gignac set for Mexico move
Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.The France international is a free agent after leaving Marseille and is set to undergo a medical later today.West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.
My Program:
from bs4 import BeautifulSoup
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
for tag in soup.find_all("div", {"id":"lc-commentary-posts"}):
divTaginb = tag.find_all("div", {"class":"lc-title-container"})
divTaginp = tag.find_all("div",{"class":"lc-post-body"})
for tag1 in divTaginb:
h4Tag = tag1.find_all("b")
for tag2 in h4Tag:
print "count:%d - "%count,
print tag2.text
print '\n'
tagp = divTaginp[count].find_all('p')
for p in tagp:
print p
print '\n'
count +=1
My output:
Count:0 - ....
...
count:37 - ICYMI: Hamburg target Celtic star Stefan Johansen as part of summer
rebuilding process
<p><strong>STEPHEN MCGOWAN:</strong> Bundesliga giants Hamburg have been linked
with a move for CelticΓÇÖs PFA Scotland player of the year Stefan Johansen.</p>
<p>German newspapers claim the Norwegian features on a three-man shortlist of po
tential signings for HSV as part of their summer rebuilding process.</p>
<p>Hamburg scouts are reported to have watched Johansen during Friday nightΓÇÖs
scoreless Euro 2016 qualifier draw with Azerbaijan.</p>
<p><a href="http://www.dailymail.co.uk/sport/football/article-3128854/Hamburg-ta
rget-Celtic-star-Stefan-Johansen-summer-rebuilding-process.html"><strong>CLICK H
ERE for more</strong></a></p>
count:38 - ICYMI: Sevilla agree deal with Chelsea to sign out-of-contract midfi
elder Gael Kakuta
<p>Sevilla have agreed a deal with Premier League champions Chelsea to sign out-
of-contract winger Gael Kakuta.</p>
<p>The French winger, who spent last season on loan in the Primera Division with
Rayo Vallecano, will arrive in Seville on Thursday to undergo a medical with th
e back-to-back Europa League winners.</p>
<p>A statement published on Sevilla's official website confirmed the 23-year-old
's transfer would go through if 'everything goes well' in the Andalusian city.</
p>
<p><strong><a href="http://www.dailymail.co.uk/sport/football/article-3128756/Se
villa-agree-deal-Chelsea-sign-Gael-Kakuta-contract-winger-aims-resurrect-career-
Europa-League-winners.html">CLICK HERE for more</a></strong></p>
count:39 - Good morning everybody!
<p>And welcome to <em>Sportsmail's</em> coverage of all the potential movers and
shakers ahead of the forthcoming summer transfer window.</p>
<p>Whatever deals will be rumoured, agreed or confirmed today you can read all
about them here.</p>
DailyMail Website looks like this:
<div id="lc-commentary-posts"><div id="lc-id-39" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_bournemouth.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_watford.png" class="lc-icon">
<div class="lc-post-time">18:03 </div>
</div>
<div class="lc-title-container">
<h4>
<b>Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim</b>
</h4>
</div>
<div class="lc-post-body">
<p><strong>SAMI MOKBEL: </strong>Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.</p>
<p class="mol-para-with-font">The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie Howe and Quique Sanchez Flores.</p>
<p class="mol-para-with-font"><font>Lorient of Ligue 1 and La Liga's Rayo Vallacano are also interested in the 24-year-old.</font></p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/18/1434647000147_lc_galleryImage_TEL_AVIV_ISRAEL_JUNE_11_A.JPG">
<b class="lc-image-caption">Abdisalam Ibrahim could return to England</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '39', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '39', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
<div id="lc-id-38" class="lc-commentary-post cleared">
<div class="lc-icons">
<img src="http://i.mol.im/i/furniture/live_commentary/football_icons/teams/60x60_west_brom.png" class="lc-icon">
<img src="http://i.mol.im/i/furniture/live_commentary/flags/60x60_mexico.png" class="lc-icon">
<div class="lc-post-time">16:54 </div>
</div>
<div class="lc-title-container">
<span><b>Andre-Pierre Gignac set for Mexico move</b></span>
</div>
<div class="lc-post-body">
<p>Former West Brom target Andre-Pierre Gignac is to complete a move to Mexican side Tigres.</p>
<p id="ext-gen225">The France international is a free agent after leaving Marseille and is set to undergo a medical later today.</p>
<p>West Ham, Stoke, Newcastle, West Brom and Dynamo Moscow all showed interest in the 30-year-old although Tony Pulis is understood to have cooled his interest after watching Gignac against Monaco towards the end of last season.</p>
</div>
<img class="lc-post-image" src="http://i.dailymail.co.uk/i/pix/2015/06/18/16/1434642784396_lc_galleryImage__FILES_A_file_picture_tak.JPG">
<b class="lc-image-caption">Andre-Pierre Gignac is to complete a move to Mexican side Tigres</b>
<div class="lc-clear"></div>
<ul class="lc-social">
<li class="lc-facebook"><span onclick="window.LiveCommentary.socialShare(postToFB, '38', 'facebook')"></span></li>
<li class="lc-twitter"><span onclick="window.LiveCommentary.socialShare(postToTWTTR, '38', 'twitter', window.twitterVia)"></span></li>
</ul>
</div>
Now my target is <div class="lc-title-container"> inside this <b></b>.Which I am getting easily. But when I am targeting <div class="lc-post-body"> inside this all <p></p>. I am not able to get only required text.
I tried p.text and p.strip() but still I am not able to solve my problem.
Error while using p.text
count:19 - City's pursuit of Sterling, Wilshere and Fabian Delph show a need fo
r English quality
MIKE KEEGAN: Colonial explorer Cecil Rhodes is famously reported to have once sa
id that to be an Englishman 'is to have won first prize in the lottery of life'.
Back in the 19th century, the vicar's son was no doubt preaching about the expan
ding Empire and his own experiences in Africa.
Traceback (most recent call last):
File "app.py", line 24, in <module>
print p.text
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
160: character maps to <undefined>
And while i am using p.strip() I am not getting any output.
Is there any good way to do it. Help me get the best way. I am trying this thing from morning and now its night.
I dont want to use any encoder or decoder if possible
dammit = UnicodeDammit(html) print(dammit.unicode_markup)
Here's my code. You should go though it. I was to lazy to add specific fields for the dataset and instead just combined everything.
from bs4 import BeautifulSoup, element
import urllib2
response = urllib2.urlopen('http://www.dailymail.co.uk/sport/football/article-3129389/Transfer-News-LIVE-Manchester-United-Arsenal-Liverpool-Real-Madrid-Barcelona-latest-plus-rest-Europe.html')
html = response.read()
soup = BeautifulSoup(html)
count=0
article_dataset = {}
# Try to make your variables express what your trying to do.
# Collect article posts
article_post_tags = soup.find_all("div", {"id":"lc-commentary-posts"})
# Set up the aricle_dataset with the artilce name as it's key
for article_post_tag in article_post_tags:
container_tags = article_post_tag.find_all("div", {"class":"lc-title-container"})
body_tags = article_post_tag.find_all("div",{"class":"lc-post-body"})
# Find the article name, and initialize an empty dict as the value
for count, container in enumerate(container_tags):
# We know there is only 1 <b> tag in our container,
# so use find() instead of find_all()
article_name_tag = container.find('b')
# Our primary key is the article name, the corrosponding value is the body_tag.
article_dataset[article_name_tag.text] = {'body_tag':body_tags[count]}
for article_name, details in article_dataset.items():
content = []
content_line_tags = details['body_tag'].find_all('p')
# Go through each tag and collect the text
for content_tag in content_line_tags:
for data in content_tag.contents: # gather strings in our tags
if type(data) == element.NavigableString:
data = unicode(data)
else:
data = data.text
content += [data]
# combine the content
content = '\n'.join(content)
# Add the content to our data
article_dataset[article_name]['content'] = content
# remove the body_tag from our aricle data_set
for name, details in article_dataset.items():
del details['body_tag']
print
print
print 'Artilce Name: ' + name
print 'Player: ' + details['content'].split('\n')[0]
print 'Article Summary: ' + details['content']
print

Get text from <a> element?

I would like to get the school name, "Perkins College..." from this link using beautifulSoup.
The code I use returns nothing.
school = soup.find('a','profiles-show-school-name-sm-link')
print 'school: ', school
print 'school.text: ', school.text
output:
school: <a class="profiles-show-school-name-sm-link" href="/profiles/show/online-degrees/stephen-f-austin-state-university/perkins-college-of-education-undergraduate/395/5401">
<img border="0" src="/images/profiles/243x60/4613/degrees/undergraduate-certificate-in-hospitality-administration.png"/>
</a>
school.text:
Suggestions for a BeautifulSoup implementation to extract school name (not URL)? Thx!
school = soup.find('a','profiles-show-school-name-sm-link')
url = school['href']
Assuming the school is always in the same spot in the url:
for i in range(5):
url = url[url.find("/")+1:]
schoolname = url[:url.find("/")]
print " ".join(schoolname.split("-")).title()
Yields:
Perkins College Of Education Undergraduate
Getting the University
for i in range(4):
url = url[url.find("/")+1:]
university= url[:url.find("/")]
print " ".join(university.split("-")).title()
Yields:
Stephen F Austin State University

Categories