Regex to extract all URLs from a page - python

This question has been asked a few times on SO but I couldn't get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don't need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?
I'd like to do this using Regexs and not BeautifulSoup, etc.
Thank you.

HTML is not a regular language, and thus cannot be parsed by regular expressions.
It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
That said, if you're willing to go that path, see John Gruber's regex for the purpose:
def extract_urls(your_text):
url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
for match in url_re.finditer(your_text):
yield match.group(0)
This can be used as follows:
>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
... print uri
http://foo.bar/
irc://freenode.org

I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html

Related

How can I exclude the list of web-pages from google-search results?

"minus" sign doesn't fit because the list consists of ~2000 entries.
I'm just beginner in python, so, please explain as to 5-year old, if possible
Thank you much in advance!
Presumably you are fetching the Google search results from a Python program. So you can exclude the web pages in your list in your Python program as you read the results, instead of trying to make Google do it for you. You can use a functional programming technique like calling filter for this.
Ideally you would do this by comparing the URLs of the links, but if you were willing to sacrifice accuracy you could do it by comparing the titles of the links instead, if you only had titles in your list and not URLs. But URLs are definitely better for this purpose.
So you could parse the Google search results using a library like Beautiful Soup, extract the URLs of the links, and filter out (using filter) the ones that were equal to any of the URLs on your list (you could define a function using def, for checking whether a given URL is on your list). You'll have to be careful though because sometimes Google search result links go via a Google website which redirects to the real URL, for ranking purposes.

Scraping news article data and formatting the results in Word

For a list of article URLs I need to scrape the title, author, date, publication, and body of the article. Then each article needs to appear in Word, formatted according to a template (bold title, pub in italics, table of contents at the top with hyperlinks etc).
I've used some of this in the past and I would recommend you two things to clean an HTML code and get the text:
html2text: Library to get text from html code
Regular expressions: Python module which you can use to delete HTML expressions
(Be careful with RegEx, it's possible to miss some data or some string in some cases)
For Word I would recommend this:
python-docs: Library to work with MS Word into Python.
PS: This is just a little resume. You would have a lot of results if you just use the SO searcher.

Python - Parsing a string for URLs and extracting them

I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then extract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.
So basically I have an input string, and I need to find and extract all the URLs within that string.
What's a clean way of going about this.
You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.
Example:
possible_urls = re.findall(r'\S+:\S+', text)
If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:
possible_urls = re.findall(r'https?://\S+', text)
You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:
Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!
Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.
Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. Link!)
Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.
But when you get down to it, this is not a trivial task!

How would I look for all URLs on a web page and then save them to a individual variables with urllib2 In Python?

How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?
Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.
You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.
You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Categories