I am trying to scrape this page and all of the other pages like it. I have been using BeautifulSoup (also have tried lxml but there have been installation issues). I am using the following code:
value = "http://www.presidency.ucsb.edu/ws/index.php?pid=99556"
desiredTag = "span"
r = urllib2.urlopen(value)
data = BeautifulSoup(r.read(), 'html5lib')
displayText = data.find_all(desiredTag)
print displayText
displayText = " ".join(str(displayText))
displayText = BeautifulSoup(displayText, 'html5lib')
For some reason this isn't pull back the <span class="displaytext"> and also I have tried desiredTag as p
Am I missing something?
You are definitely experiencing the differences between different parsers used by BeautifulSoup. html.parser and lxml worked for me:
data = BeautifulSoup(urllib2.urlopen(value), 'html.parser')
Proof:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.presidency.ucsb.edu/ws/index.php?pid=99556"
>>>
>>> data = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
>>> data.find("span", class_="displaytext").text
u'PARTICIPANTS:Former Speaker of the House Newt Gingrich (GA);
...
Related
I am trying to scrape the links from the "box score" button on this page. The button is supposed to look like this
http://www.espn.com/nfl/boxscore?gameId=400874795
I tried to use this code to see if I could access the buttons but I cannot.
from bs4 import BeautifulSoup
import requests
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'
advanced = url
r = requests.get(advanced)
data = r.text
soup = BeautifulSoup(data,"html.parser")
for link in soup.find_all('a'):
print link
As wpercy mentions in his comment, you can't do this using requests, as a suggestion you should use selenium together with Chromedriver/PhantomJS for handling the JavaScript:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')
boxList = soup.findAll('a',{'name':'&lpos=nfl:scoreboard:boxscore'})
All score buttons's a tag have the attribute name = &lpos=nfl:scoreboard:boxscore, so we first use .findAll and now a simple list comprehension can extract each href attribute:
>>> links = [box['href'] for box in boxList]
>>> links
['/nfl/boxscore?gameId=400874795', '/nfl/boxscore?gameId=400874854', '/nfl/boxscore?gameId=400874753', '/nfl/boxscore?gameId=400874757', '/nfl/boxscore?gameId=400874772', '/nfl/boxscore?gameId=400874777', '/nfl/boxscore?gameId=400874767', '/nfl/boxscore?gameId=400874812', '/nfl/boxscore?gameId=400874761', '/nfl/boxscore?gameId=400874764', '/nfl/boxscore?gameId=400874781', '/nfl/boxscore?gameId=400874796', '/nfl/boxscore?gameId=400874750', '/nfl/boxscore?gameId=400873867', '/nfl/boxscore?gameId=400874775', '/nfl/boxscore?gameId=400874798']
here is the solution i did , and it scrapes all the link which are there on the url you have provided in your answer . you can check it out
# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'
# advanced = url
html = urllib.urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)
tags = soup('a')
# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
# print tag;
print i;
ans = tag.get('href',None)
print ans;
print "\n";
The answer from Gopal Chitalia didn't work for me, so I decided to post the working one (for python 3.6.5)
# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'
# advanced = url
html = urllib.request.urlopen(url)
# urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)
tags = soup('a')
# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
# print tag;
print (i);
ans = tag.get('href',None)
print (ans);
print ("\n");
I'm a newbie in web scraping. I do as below
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = soup.find_all('a', {'href': re.compile("r'\b?20\b'")})
print (res)
and get
[]
My goal is this fragment
<script language="javascript" type="text/javascript">
cont = new Array();
count = new Array();
for (i=1979; i <=2015; i++){count[i]=0};
cont[1979] = "<li><a href='?1979_1#24jan'>24 января</a>" +
..............
cont[2016] = "<li><a href='?2016/2016_spr#cur'>Весенняя серия</a>" +
"<li><a href='?2016/2016_sum#cur'>Летняя серия</a>" +
"<li><a href='?2016/2016_aut#cur'>Осенняя серия</a>" +
"<li><a href='?2016/2016_win#cur'>Зимняя серия</a>";
And i try to get the result like this
'?2016/2016_spr#cur'
'?2016/2016_sum#cur'
'?2016/2016_aut#cur'
'?2016/2016_win#cur'
From 2000 to this moment (so '20' in "r'\b?20\b'" is for this reason). Can you help me, please?
Preliminaries:
>>> import requests
>>> import bs4
>>> page = requests.get('http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
Having done this it might seem that the most straightforward way of identifying the script element might be to use this:
>>> scripts = soup.findAll('script', text=bs4.re.compile('cont = new Array();'))
However, scripts proves to be an empty list. (I don't know why.)
The basic approach works, if I choose a different target within the script but it would appear the it's unsafe to depend on the exact formatting of the content of Javascript script element.
>>> scripts = soup.find_all(string=bs4.re.compile('i=1979'))
>>> len(scripts)
1
Still, this might be good enough for you. Please just notice that the script has the change function at the end to be discarded.
A safer approach might be to look for the containing table element, then the second td element within that and finally the script within that.
>>> table = soup.find_all('table', class_='common_table')
>>> tds = table[0].findAll('td')[1]
>>> script = tds.find('script')
Again, you will need to discard function change.
You can use get('attribute') and then filter the results if needed:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = [link.get('href') for link in soup.find_all('a')]
print (res)
The big goal is to find specific house bills.
With this code I am trying to select the link: /legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D to narrow down my search to house bills.
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
for link in soup.find_all('a'):
soup_links = link.get('href')
import re
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
print r1.findall(soup_links)
When I do this I get an empty list instead of the link.
It isn't my regular express because the following works:
r2 = re.compile(r'\S+congress\S+chamber\S+House\S+')
newstring = '/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D'
print r2.findall(newstring)
You are re-assigning a new value to soup_links each iteration; in the end only the last href attribute is assigned.
BeautifulSoup can do the searching for you:
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_links = [l['href'] for l in soup.find_all('a', href=r1)]
print soup_links
This produces the one matching link:
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
>>> r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
>>> [l['href'] for l in soup.find_all('a', href=r1)]
['/legislation?q=%7B%22congress%22%3A%22113%22%2C%22chamber%22%3A%22House%22%7D']
If you only expect one link to match, use soup.find() instead of soup.find_all():
soup = BeautifulSoup(urllib2.urlopen("https://beta.congress.gov/legislation"))
r1 = re.compile(r'/legislation(\?\S+congress\S+chamber\S+House\S+)')
soup_link = soup.find('a', href=r1)
print soup_link['href']
If I parse a website using BS4, and from its source code i want to print the text "+26.67%"
<font color="green"><b><nobr>+26.67%</nobr></b></font>
I have been messing around with the .find_all() command (http://www.crummy.com/software/BeautifulSoup/bs4/doc/) to no avail. What would be the correct way to search the source code and print just the text?
my code:
import requests
from bs4 import BeautifulSoup
set_url = "*insert web address here*"
set_response = requests.get(set_url)
set_data = set_response.text
soup = BeautifulSoup(set_data)
e = soup.find("nobr")
print(e.text)
A small example:
>>> s="""<font color="green"><b><nobr>+26.67%</nobr></b></font>"""
>>> print s
<font color="green"><b><nobr>+26.67%</nobr></b></font>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> e = soup.find("nobr")
>>> e.text #or e.get_text()
u'+26.67%'
find return the first Tag, find_all return a ResultSet:
>>> type(e)
<class 'bs4.element.Tag'>
>>> es = soup.find_all("nobr")
>>> type(es)
<class 'bs4.element.ResultSet'>
>>> for e in es:
... print e.get_text()
...
+26.67%
If you want the specified nobr under b and font, it can be:
>>> soup.find("font",{'color':'green'}).find("b").find("nobr").get_text()
u'+26.67%'
Continuous .find may cause an exception if prior .find returns None, pay attention.
Use a CSS selector:
>>> s = """<font color="green"><b><nobr>+26.67%</nobr></b></font>"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.select('font[color="green"] > b > nobr')
[<nobr>+26.67%</nobr>]
Add or remove properties or element names form the selector string to make the match more or less precise.
Here you have my solution
s = """<font color="green"><b><nobr>+26.67%</nobr></b></font>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s)
a = soup.select('font')
print a[0].text
You can fetch the text without using the requests library. Following is the edit I made to your code and it gave your expected result.
from bs4 import BeautifulSoup
html_snippet="""<font color="green"><b><nobr>+26.67%</nobr></b></font>"""
soup = BeautifulSoup(html_snippet)
e = soup.find("nobr")
print(e.text)
The result was
+26.67%
Good luck!
I have tried hard to get the link (i.e. /d/Hinchinbrook+25691+Masjid-Bilal) from "result" below while using beautifulsoup in Python. Please help?
result:
<div class="subtitleLink"><b>Masjid Bilal</b></div>
code:
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
print result
br = result.find('a')
pos = br.get_text()
print pos
import urllib2
from bs4 import BeautifulSoup
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
print link.get('href')
This should work if you want all the links. Let me know if it doesn't.
The get_text method returns only the string components of a tag. To get the link here, reference it as an attribute. For this specific instance, you can change br.get_text() to br['href'] to get your desired result.
...
>>> br = result.find('a')
>>> pos = br['href']
>>> print pos
/d/Hinchinbrook+25691+Masjid-Bilal