Scraping visible text - python

I am an absolute newbie in the field of web scraping and right now I want to extract visible text from a web page. I found a piece of code online :
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
To the above code, I get the following result :
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
Anyway I could get a more concrete result and what wrong is happening with the code. Sorry for being clueless.

Try passing the html document and not url to prettify to:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))

soup = BeautifulSoup(web_page, "lxml")
you should pass a file-like object to BeautifulSoup,not url.
url is handled by urllib2.urlopen(url) and stored in web_page

Related

HtmlAgilityPack and BeautifulSoup missing tags

I am scraping this site: https://finance.yahoo.com/quote/MSFT/press-releases.
In the browser, there are 20+ articles. However, when I pull the site's HTML down and load it into HTML agility pack, only the first three articles are appearing.
let client = new WebClient()
let uri = "https://finance.yahoo.com/quote/MSFT/press-releases"
let response = client.DownloadString(uri)
let doc = HtmlDocument()
doc.LoadHtml(response)
works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[1]")
node.InnerText
no works:
let node = doc.DocumentNode.SelectSingleNode("//*[#id=\"summaryPressStream-0-Stream\"]/ul/li[10]")
node.InnerText
Is it because there are some jenky li tags in the yahoo site? Is it is limitation of the HtmlAgilityPack?
I also did the same script in Python using BeautifulSoup and have the same problem:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://finance.yahoo.com/quote/MSFT/press-releases?"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', href=True):
print(link['href'])
Thanks

Web scraping withe BeautifulSoup and unfound content

I'm trying to make a basic web scraper using BeautifulSoup in Python. However my target page is making it difficult.
When I make the request, I get a response with the HTML. However in the body, it only displays 1 div as:
'<div id="miniwidget" style="width:100%; height:100%;"></div>'
I've navigated through the websites HTML in Google Chrome, but I'm new enough to this to not exactly understand why the page doesn't generate all of the content within that div.
How would I go about making a request that would generate the rest of the HTML?
Here's what I've written:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def Call_Webpage(url):
html = urlopen(url)
bsObj = BeautifulSoup(html, features="html.parser")
soup = bsObj.body.findAll('div')
print(soup)
Response:
<div id="miniwidget" style="width:100%; height:100%;"></div>
depends what you want to find on that page...
e.g. for inspecting meta tags you would use soup.find_all('meta') etc.
you also could do
request = urllib.request.Request(domain_url, None, headers)
result = urllib.request.urlopen(request,timeout=timeout)
resulttext = result.read()
to get the whole page as text

Beautiful Soup python .get full information from html

I am trying to get the number of views of my post on Telegram via BeautifulSoup. For example I want to take it from my channel post number 956: https://t.me/dayygesstt/956
<span class="tgme_widget_message_views">3.1K</span>
So "3.1K" is what I need.
import requests
from bs4 import BeautifulSoup
def get_html(url):
r = requests.get(url,'lxml')
return r.text
url='https://t.me/dayygesstt/956'
html=get_html(url)
soup=BeautifulSoup(html, )
x = soup.findAll("div", {"class": "tgme_page tgme_page_post"})
for i in x :
r=i.findAll("div", {"class": "tgme_page_widget"})
print(r)
and it prints:
[<div class="tgme_page_widget" id="widget">
<script async="" data-telegram-post="dayygesstt/956" data-width="100%" src="https://telegram.org/js/telegram-
widget.js?4"></script>
</div>]
I tried different stuff but I can't get more info. Please help me, what am I doing wrong? How to get information properly?
You can use the URL that loads the iframe in your script. Then you get just the widget without the cruft. For this take the original URL and append a query string "embed=1".
import requests
from bs4 import BeautifulSoup
url = 'https://t.me/dayygesstt/956?embed=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
views = soup.find("span", {"class": "tgme_widget_message_views"})
print(views.text)
I think you need to define which parser to use with BeautifulSoup in order for it to parse the HTML correctly, so this line;
soup=BeautifulSoup(html, )
needs to be this;
soup=BeautifulSoup(html, 'html.parser')

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Categories