Get text from <a> element? - python

I would like to get the school name, "Perkins College..." from this link using beautifulSoup.
The code I use returns nothing.
school = soup.find('a','profiles-show-school-name-sm-link')
print 'school: ', school
print 'school.text: ', school.text
output:
school: <a class="profiles-show-school-name-sm-link" href="/profiles/show/online-degrees/stephen-f-austin-state-university/perkins-college-of-education-undergraduate/395/5401">
<img border="0" src="/images/profiles/243x60/4613/degrees/undergraduate-certificate-in-hospitality-administration.png"/>
</a>
school.text:
Suggestions for a BeautifulSoup implementation to extract school name (not URL)? Thx!

school = soup.find('a','profiles-show-school-name-sm-link')
url = school['href']
Assuming the school is always in the same spot in the url:
for i in range(5):
url = url[url.find("/")+1:]
schoolname = url[:url.find("/")]
print " ".join(schoolname.split("-")).title()
Yields:
Perkins College Of Education Undergraduate
Getting the University
for i in range(4):
url = url[url.find("/")+1:]
university= url[:url.find("/")]
print " ".join(university.split("-")).title()
Yields:
Stephen F Austin State University

Related

Can't locate and capture few fields out of some unstructured html

I'm trying to scoop out four fields from a webpage using BeautifulSoup library. It's hard to identify the fields individually and that is the reason I seek help.
Sometimes both emails are present but that is not always the case. I used indexing to capture the email for this example but surely this is the worst idea to go with. Moreover, with the following attempt I can only parse the caption of the email, not the email address.
I've tried with (minimum working example):
from bs4 import BeautifulSoup
html = """
<p>
<strong>
Robert Romanoff
</strong>
<br/>
146 West 29th Street, Suite 11W
<br/>
New York, New York 10001
<br/>
Telephone: (718) 527-1577
<br/>
Fax: (718) 276-8501
<br/>
Email:
<a href="mailto:robert#absol.com">
robert#absol.com
</a>
<br/>
Additional Contact: William Locantro
<br/>
Email:
<a href="mailto:bill#absol.com">
bill#absol.com
</a>
</p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)
Current output:
Robert Romanoff Email: William Locantro Email:
Expected output:
Robert Romanoff robert#absol.com William Locantro bill#absol.com
For more complex html/xml parsing you should take a look at xpath which allows very powerful selector rules.
In python it's available in parsel package.
from parsel import Selector
html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (.+)")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff robert#absol.com William Locantro bill#absol.com
You can do like this.
Select the <div> that has the data you need.
Create a list of the data present inside the above selected <div>
Iterate over the list and extract the data you require.
Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.nyeca.org/find-a-contractor-by-name/'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
d = soup.find_all('div', class_='sabai-directory-body')
for i in d:
x = i.text.strip().split('\n')
data = [x[0].strip()]
for item in x:
if item.startswith('Email'):
data.append(item.split(':')[1].strip())
elif item.startswith('Additional'):
data.append(item.split(':')[1].strip())
print(data)
Gives a list of the contractor details and also additional details (if any).
['Ron Singh', 'rsingh#atechelectric.com']
['George Pacacha', 'Office#agvelectricalservices.com']
['Andrew Drazic', 'ADrazic#atjelectrical.com']
['Albert Barbato', 'Abarbato#abelectriccorp.com']
['Ralph Sica', 'Ralph.Sica#abm.com', 'Henry Kissinger', 'Henry.Kissinger#abm.com']
['Robert Romanoff', 'robert#absoluteelectric.com', 'William Locantro', 'bill#absoluteelectric.com']
.
.
Here is a solution you can give it a try,
import re
soup = BeautifulSoup(html, "lxml")
names_ = [
soup.select_one("p > strong").text.strip(),
soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]
email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]
print(" ".join(i + " " + j for i, j in zip(names_, email_)))
Robert Romanoff robert#absol.com William Locantro bill#absol.com

Python Regex: Data scrape <li>data<ol><li>data</li></ol></li>

I have got an HTML file and I read with Python and I would like to while I print customize it.
First I've to print Country name then players name which they belong to their country.
My HTML file looks like this:
<ul>
<li>
Australia
<ol>
<li>Steve Smith</li>
<li>David Warner</li>
<li>Aaron Finch</li>
</ol>
</li>
<li>
Bangladesh
<ol>
<li>Shakib Al Hasan</li>
<li>Tamim Iqbal</li>
<li>Mushfiqur Rahim</li>
</ol>
</li>
<li>
England
<ol>
<li>Ben Stokes</li>
<li>Joe Root</li>
<li>Eoin Morgan</li>
</ol>
</li>
Now I want to scrape this data from my HTML file:
Australia - Steve Smith, David Warner, Aaron Finch
Bangladesh - Shakib Al Hasan, Tamim Iqbal, Mushfiqur Rahim
England - Ben Stokes, Joe Root, Eoin Morgan
But I can only scrape with Players' name. This is my code:
import re
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
pat = re.compile(r'<li>(.*?)</li>')
result = pat.findall(team)
res = ", ".join([str(player) for player in result])
print(res)
Also, I don't' use any package like bs4. I would like to solve this issue by using regex.
Here the solution with using regex.
import re
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
regex = re.compile(r'<li>\s+(?P<country>[A-z ]+)|<li>(?P<name>[A-z ]+)</li>')
country_team_rel = {}
country = None
for result in regex.findall(team):
if result[0]:
country = result[0]
country_team_rel[country] = []
else:
country_team_rel[country].append(result[1])
# Or If you like to print
buffer = []
for result in regex.findall(team):
if result[0]:
if buffer:
print(", ".join(buffer))
buffer = []
print(result[0] + " - ", end='')
else:
buffer.append(result[1])
print(", ".join(buffer))
As already suggested, BeautifulSoup is the right tool for this task:
import bs4
file_name = "team.html"
mode = "r"
with open(file_name, mode) as fp:
team = fp.read()
soup = bs4.BeautifulSoup(team)
country = None
for i in soup.findAll('li'):
if '\n' in i.text:
if country:
print(country,'-', ', '.join(players))
country = i.text.splitlines()[1].strip()
players = []
else:
players.append(i.text)
print(country,'-', ','.join(players))
It could be a mistake to use regex in this case. (i am not 100% sur).
You should use Beautiful Soup
Or even other HTML parser

Parsing html elements

In my (downloaded) HTMLs i have in the top of every file executives mentioned (like Dror Ben Asher" in the code below):
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
Further along the html these executives name reaccurs multiple times where after the name follows an text element i want to parse
Example
<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade. We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects. In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>
For now i have a code (see below) which identifies one executive "Dror Ben Asher" and graps all the text which accurs after in the P element. But I would like this to work for all executives and for Multiple html files where different executives are mentioned (different company).
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")
Does anyone have a suggestion to tackle this challenge?
If I understand your question correctly, you could put the code in a function, to which you could pass the name you need as an argument and use that variable to construct your search strings.
for example:
def func(name_to_find):
# some code
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code
and call it like so:
func('Dror Ben Asher')

Python-Requests Scraping YouTube description with BS4 issue

I'm trying to get both the text and the links as shown in the picture. But I can only get the text via siblings and the links after. I need them to come together like in the image. I tried using br.next_element but it doesn't grab the a-links. What am I missing?
import requests
from bs4 import BeautifulSoup
url_id = 'aM7aW0G58CI'
s = requests.Session()
r = s.get('https://www.youtube.com/watch?v='+url_id)
html = r.text
soup = BeautifulSoup(html, 'lxml')
for i in soup.find_all('p', id='eow-description'):
for br in i.find_all('br'):
next_sib = br.next_sibling
print(next_sib)
for i in soup.find_all('p', id='eow-description'):
for a in i.find_all('a'):
print(a.text)
This is the output that I am getting. I'm not getting what the screenshot below shows.
OutPut:
Special shout to
Wanna support what we do? Livestream at 2PM PT!:
It Wasn’t Me, I Swear!:
TheDeFrancoFam Vlog:
————————————
CATCH UP ON THIS WEEK’S SHOWS:
<br/>
Why People Are Freaking Out About The Trump NFL Boycott and Anthony Weiner Going to Jail…:
WOW! Dirty Advertising Exposed And Major Backlash Following Unexpected Compromise…:
Why Trump's "HUGE Failure" Is A Massive Loss For His Enemies and A Shocking Change To Women's Rights:
DISGUSTING! The Horrible Truth About Belle Gibson Exposed, Controversial Video Blows Up, and More:
<br/>
————————————
GET SOME GEAR:
————————————
FACEBOOK:
TWITTER:
INSTAGRAM:
SNAPCHAT: TheDeFrancoFam
REDDIT:
ITUNES:
GOOGLE PLAY:
————————————
Edited by:
James Girardier -
Jason Mayer -
<br/>
Produced by:
Amanda Morones -
<br/>
Motion Graphics Artist:
Brian Borst -
<br/>
P.O. BOX
Attn: Philip DeFranco
16350 Ventura Blvd
Ste D #542
Encino, CA 91436
http://DKPhil.com
http://DeFrancoElite.com
https://youtu.be/fFxDbYE06zU
https://youtu.be/kR7DquGe4vY
https://youtu.be/qdWUQGHtyPk
https://youtu.be/CWlUs1-7KN4
https://youtu.be/kUWt-oipvOY
https://youtu.be/XVsTh4zxKNo
https://teespring.com/stores/defranco...
http://on.fb.me/mqpRW7
http://Twitter.com/PhillyD
https://instagram.com/phillydefranco/
https://www.reddit.com/r/DeFranco
http://DeFrancoMistakes.com
http://mistakeswithdefranco.com
https://twitter.com/jamesgirardier
https://www.instagram.com/jayjaymay/
https://twitter.com/MandaOhDang
https://twitter.com/brianjborst
Using children and checking tag name (child.name) I made
import requests
from bs4 import BeautifulSoup
url_id = 'aM7aW0G58CI'
s = requests.Session()
r = s.get('https://www.youtube.com/watch?v='+url_id)
soup = BeautifulSoup(r.text, 'lxml')
# to concatenate <br>
br = ''
for p in soup.find_all('p', id='eow-description'):
for child in p.children:
if child.name == 'a':
#print(' a:', child.text)
print(br, child.text)
br = '' # reset br
elif child.name == 'br':
if child.next_sibling.name != 'br': # skip <br/> ?
#print('br:', child.next_sibling)
br += str(child.next_sibling)
#else:
# print(child.name, child)
I get:
Special shout to http://DKPhil.com
Wanna support what we do? Livestream at 2PM PT!: http://DeFrancoElite.com
It Wasn’t Me, I Swear!: https://youtu.be/fFxDbYE06zU
TheDeFrancoFam Vlog: https://youtu.be/kR7DquGe4vY
———————————— CATCH UP ON THIS WEEK’S SHOWS: Why People Are Freaking Out About The Trump NFL Boycott and Anthony Weiner Going to Jail…: https://youtu.be/qdWUQGHtyPk
WOW! Dirty Advertising Exposed And Major Backlash Following Unexpected Compromise…: https://youtu.be/CWlUs1-7KN4
Why Trump's "HUGE Failure" Is A Massive Loss For His Enemies and A Shocking Change To Women's Rights: https://youtu.be/kUWt-oipvOY
DISGUSTING! The Horrible Truth About Belle Gibson Exposed, Controversial Video Blows Up, and More: https://youtu.be/XVsTh4zxKNo
————————————GET SOME GEAR: https://teespring.com/stores/defranco...
————————————FACEBOOK: http://on.fb.me/mqpRW7
TWITTER: http://Twitter.com/PhillyD
INSTAGRAM: https://instagram.com/phillydefranco/
SNAPCHAT: TheDeFrancoFamREDDIT: https://www.reddit.com/r/DeFranco
ITUNES: http://DeFrancoMistakes.com
GOOGLE PLAY: http://mistakeswithdefranco.com
————————————Edited by:James Girardier - https://twitter.com/jamesgirardier
Jason Mayer - https://www.instagram.com/jayjaymay/
Produced by:Amanda Morones - https://twitter.com/MandaOhDang
Motion Graphics Artist:Brian Borst - https://twitter.com/brianjborst
EDIT: you may have to use
else:
print(child.name, child)
to get PO BOX address
I found a really simple way:
for p in soup.find_all('p', id='eow-description'):
print(p.get_text('\n'))
Only issue now is that some of the links are stripped with ...
You can also play around with youtube-dl python module to get the description of a youtube video that way as well.
I have found this way..
import pafy
url='https://www.youtube.com/watch?v=aM7aW0G58CI'
vid=pafy.new(url)
print(vid.description)
By this method, you will get your content in the exact same way as shown in Youtube's video description.

How to scrape data from imdb business page?

I am making a project that requires data from imdb business page.I m using python. The data is stored between two tags like this :
Budget
$220,000,000 (estimated)
I want the numeric amount but have not been successful so far. Any suggestions.
Take a look at Beautiful Soup, its a useful library for scraping. If you take a look at the source, the "Budget" is inside an h4 element, and the value is next in the DOM. This may not be the best example, but it works for your case:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen('http://www.imdb.com/title/tt0118715/?ref_=fn_al_nm_1a')
soup = BeautifulSoup(page.read())
for h4 in soup.find_all('h4'):
if "Budget:" in h4:
print h4.next_sibling.strip()
# $15,000,000
This is whole bunch of code (you can find your requirement here).
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long
Well you asked for python and you asked for a scraping solution.
But there is no need for python and no need to scrape anything because the budget figures are available in the business.list text file available at http://www.imdb.com/interfaces
Try IMDbPY and its documentation. To install, just pip install imdbpy
from imdb import IMDb
ia = IMDb()
movie = ia.search_movie('The Untouchables')[0]
ia.update(movie)
#Lots of info for the movie from IMDB
movie.keys()
Though I'm not sure where to find specifically budget info

Categories