I am currently learning Python and have tried to pick up web scraping. I have been using example code that I got from some tutorials, but I have encountered a problem with one of the sites I was looking at. The following code was supposed to return the title of the website:
import urllib
import re
urls = ["http://www.libyaherald.com"]
i=0
regex='<title>(.+?)</title>'
pattern = re.compile(regex)
while i< len(urls):
htmlfile = urllib.urlopen(urls[i])
htmltext = htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
The title for Libya Herald website returned back an error. I checked the source code for Libya Herald and the DOC TYPE is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">.
Does the doc type have something to do with me not being able to scrape from it?
As #Puciek said, with regex is gonna be very difficult to scrape html. I would recommend you start using some package, a very easo to use and install is BeautifulSoup.
Once you install it you can try this simple example:
from bs4 import BeautifulSoup
import requests
html = requests.get('http://www.libyaherald.com').text
bs = BeautifulSoup(html)
title = bs.find('title').text
print title
For serious python web scraping I strongly suggest Scrapy.
And as far as I know, when it comes to html parsing, regex is not a recommended way. Try BeautifulSoup (BS4) like Pizza guy said :)
Related
i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
html_content = html.fromstring(page.content)
for i in html_content.iterchildren():
print i
news_stories = html_content.xpath('//h2[#data-analytics]/a/span/text()')
news_links = html_content.xpath('//h2[#data-analytics]/a/#href')
I am trying to run this code to understand how web scraping in python works.
I want to scrap top news stories and their links from CNN.
When i run this in Python Shell, the output for news_stories and news_links i get is:
[]
My question is where am i going wrong with this and is there a better way to achieve what i am trying to than this one?
In your code html_content is returning only page address and not the actual content of the page.
html_content = html.fromstring(page.content)
You can try printing following to see complete HTML code for that page:
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
print page.text
Even though if you'll get the content also somehow, you will get it a gzipped response from the server. (Get html using Python requests?)
I would highly recommend you to use httplib2 library and BeautifulSoup to scrape news stories from CNN. That is really handy in use and get you what you want. You can see another stackoverflow post here (retrieve links from web page using python and BeautifulSoup)
I hope that help you.
I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>
Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))
You can extract a chunk of content then use some python language detection like langdetect or guess-language.
Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.
You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])
I am having a weird problem with my code
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
import requests
def get_text(url):
data=""
p=requests.get(url).content
soup=BeautifulSoup(p)
paragraphs=soup.select("p.story-body-text.story-content")
data=p
text=""
for paragraph in paragraphs:
text+=paragraph.text
text=text.encode('ascii', 'ignore')
return str(text)
Basically what my code should be doing is getting the html by using "requests" and then using BS4 to find all the "p.story-body-text.story-content" which contains the actual article content.
It works great on some articles such as:
http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0
and
http://www.nytimes.com/2014/04/13/world/asia/coalition-building-season-in-india.html?
However, it will not work on these links:
http://www.nytimes.com/2014/04/06/world/middleeast/break-in-syrian-war-brings-brittle-calm.html?_r=0#
and
http://www.nytimes.com/2014/02/23/magazine/instagram-travel-diary.html?nav
I think it is a problem with the "requests" library because it does not fetch the correct HTML
Any ideas?
Edit: pastebin link http://pastebin.com/n3svnKTQ
I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()
The is the output produced by BeautifulSoup.
-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
<!--
function highlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
}
function unhighlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
}
//-->
</script>
Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr#segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"
Try with version 3.0.7a as Ćukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"
Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.
In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:
SCRIPT type=""javascript""
(Notice the double quoting.)
The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.
Samj: If I get things like
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>"
I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:
html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)
I had problems parsing the following code too:
<script>
function show_ads() {
document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
}
</script>
HTMLParseError: bad end tag: u'', at line 26, column 127
Sam
I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.
import urllib
from BeautifulSoup import BeautifulSoup
>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
In my case by executing the above statements, it returns the entire HTML page.