Web scraping using urllib2

Web scraping using urllib2 - python

I am trying to scrape all the titles off of this RSS Feed:
http://www.quora.com/Python-programming-language-1/rss
This is my code for the same:
import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles = re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
print list[e]
However, instead of getting a list of titles as the output, I am getting a bunch of code from the rss source. What am I doing wrong?

You should use non-greedy mark (?) in expression:
#allTitles = re.compile('<title>(.*)</title>')
allTitles = re.compile('<title>(.*?)</title>')
Without ? all text except last </title> placed in (.*) group...

As already mentioned, your code lacks greedy specifier for regexp, and can be fixed with it. But I strongly recommend switching from regular expressions to tools, more suited for xml parsing, such as lxml, BeautifulSoup or specialised rss parsing modules such as feedparser.
For example, see how your task can be done with lxml:
>>> import lxml.etree
>>> rss = lxml.etree.fromstring(content)
>>> titles = rss.findall('.//title')
>>> print '\n'.join(title.text for title in titles[:2])
Questions About Python (programming language) on Quora
Could someone explain for me the following Python function that uses #wraps from functools?

Related

Find all HTML and non-HTML encoded URLs in string

I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.
For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.
On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.
I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:
import lxml.html
example_data = """Click Me!
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/#href'):
print "Found Link: ", link
As expected, this results in:
Found Link: http://www.some-random-domain.com/abc123/def.html
I also tried the twitter-text-python library that #Yannisp mentioned, but it doesn't seem to extract both URLS:
>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']
What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?

I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml

You could use RE to find all URLs:
import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)
It's including alphanumerics, '/' and "Characters allowed in a URL"

Based on the answer by #YannisP, I was able to come up with this solution:
import lxml.html
from ttp.ttp import Parser
def extract_urls(data):
urls = set()
# First extract HTML-encoded URLs
dom = lxml.html.fromstring(data)
for link in dom.xpath('//a/#href'):
urls.add(link)
# Next, extract URLs from plain text
parser = Parser()
results = parser.parse(data)
for url in results.urls:
urls.add(url)
return list(urls)
This results in:
>>> example_data
'Click Me!\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']
I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do.

Not getting expected results using findall in python

I am new to python(using 2.7.3). I was trying to do web scraping using python but I am ot getting the expected outputs:
import urllib
import re
regex='<title>(.+?)<\title>'
pattern=re.compile(regex)
dummy="fsdfsdf<title>Test<\title>dsf"
html=urllib.urlopen('http://www.google.com')
text=html.read()
print pattern.findall(text)
print pattern.findall(dummy)
while the second print statement is working fine but the first one should print Google but it is giving a blank list.

Try changing:
regex='<title>(.+?)<\title>'
to
regex='<title>(.+?)</title>'

You mistyped the slash:
regex='<title>(.+?)<\title>'
should be:
regex='<title>(.+?)</title>'
HTML uses a forward slash in closing tags.
That said, don't use regular expressions to parse HTML. Matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Extract java script from html document using regular expression

I am trying to extract java script from google.com using regular expression.
Program
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall(r'<script>(.*?)</script>', gdoc)
print scriptlis
Output:
['']
Can any one tell me how to extract java script from html doc by using regular expression only.

This works:
import urllib
import re
gdoc = urllib.urlopen('http://google.com').read()
scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc)
print scriptlis
The key here is (?si). The "s" sets the "dotall" flag (same as re.DOTALL), which makes Regex match over newlines. That was actually the root of your problem. The scripts on google.com span multiple lines, so Regex can't match them unless you tell it to include newlines in (.*?).
The "i" sets the "ignorcase" flag (same as re.IGNORECASE), which allows it to match anything that can be JavaScript. Now, this isn't entirely necessary because Google codes pretty well. But, if you had poor code that did stuff similar to <SCRIPT>...</SCRIPT>, you will need this flag.

If you don't have an issue with third party libraries, requests combined with BeautifulSoup makes for a great combination:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.google.com')
p = bs(r.content)
p.find_all('script')

I think the problem is that the text between <script> and </script> is several lines, so you could try something like this:
rg = re.compile('<script>(.*)</script>', re.DOTALL)
result = re.findall(rg, gdoc)

What you probably could try to do is
scriptlis = re.findall(r'<script\s*([^>]*)\s*>(.*?)</script', gdoc, re.I|re.S)
Because most script tags are of type:
<script language="javascript" src="foo"></script>
or
<script language="javascript">alert("foo")</script>
and some even are <SCRIPT></SCRIPT>
Neither of which match your regex. My regex would grab attributes in group 1, and the possible inline code in group 2. And also all tags within HTML comments. But it is about the best possible without BeautifulSoup et al

Scraping with Python?

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.

You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.

You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Find following tag with pyparsing

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.
example:
import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.
EDIT:
Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:
import BeautifulSoup
import urllib
import re
import socket
socket.setdefaulttimeout(3)
# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()
success, failure = 0.0, 0.0
for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
print url
try:
BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
except IOError:
pass
except Exception, e:
print e
failure += 1
else:
success += 1
print failure / (failure + success)
When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:
embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br />blah
""")
print result.dump()
If you want to capture the character location of an expression within your parser, insert one of these, with a results name:
loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") +
pyparsing.Optional(aTag)

Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.

you don't prefer using normal regex? or because its bad habit to parse html? :D
re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a
Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping using urllib2 - python

You should use non-greedy mark (?) in expression: #allTitles = re.compile('<title>(.)</title>') allTitles = re.compile('<title>(.?)</title>') Without ? all text except last </title> placed in (.*) group...

Related

Find all HTML and non-HTML encoded URLs in string

Not getting expected results using findall in python

Extract java script from html document using regular expression

Scraping with Python?

Find following tag with pyparsing

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping using urllib2 - python

You should use non-greedy mark (?) in expression: #allTitles = re.compile('<title>(.*)</title>') allTitles = re.compile('<title>(.*?)</title>') Without ? all text except last </title> placed in (.*) group...

Related

Find all HTML and non-HTML encoded URLs in string

Not getting expected results using findall in python

Extract java script from html document using regular expression

Scraping with Python?

Find following tag with pyparsing

Categories

Resources

You should use non-greedy mark (?) in expression: #allTitles = re.compile('<title>(.)</title>') allTitles = re.compile('<title>(.?)</title>') Without ? all text except last </title> placed in (.*) group...