I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then extract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.
So basically I have an input string, and I need to find and extract all the URLs within that string.
What's a clean way of going about this.
You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.
Example:
possible_urls = re.findall(r'\S+:\S+', text)
If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:
possible_urls = re.findall(r'https?://\S+', text)
You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:
Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!
Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.
Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. Link!)
Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.
But when you get down to it, this is not a trivial task!
Related
I am trying to use python's regex to collect all text matches from multiline html code.
Here is the html code (I kept it as it is, messy):
<p class="SearchCard">
Hi,
Need AWS devops with 10+ experience
If anyone interested plz revert
</p>
If I use regex='\"SearchCard\">(\n*\s*.*)' I get match up to word Hi, included.
But if I use regex='\"SearchCard\">(\n*\s*.*)*?<' I get everything until end of the html file but I want to capture matched group before first '<' symbol.
Python code is bellow:
re.findall(regex, html)
I know it can be (and probably should be) done with other modules (e.g. BS4, or similar), but I am looking for solution with re.
My application is listing some game servers IP addresses.
I want to add a simple search engine, taking a regular expression in it. I would type ^200. to list only the IP addresses beginning with 200.
The form would redirect me to the results page by sending a GET request like that :
/servers/search/^200./page/1/range/30/
This is the line I'm using in urls.py :
url(r'^servers/search/(?P<search>[a-zA-Z0-9.]+)/page/(?P<page>\d+)/range/(?P<count>\d+)/$', gameservers.views.index)
But it doesn't work the way I expected. No results are shown. I've intentionally made a syntax error to see the local variables. Then I realized that the search variable's value is the following :
^200./page/1/range/30/
How can I fix this ? I've thought about moving the search parameter in the url's ending, but it might be very interesting to see if there is a way to limit the value with the next /.
Your regex doesn't match at all: you are not accepting the ^ character. But even if it was, there's no way that the full URL could all be captured in the search variable, because then the rest of the URL wouldn't match.
However, I wouldn't try to fix this. Trying to capture complicated patterns in the URL itself is usually a mistake. For a search value, it's perfectly acceptable to move that to a GET query parameter, so that your URL would look something like this:
/servers/search/?search=^200.&page=1&range=30
or, if you like, you could still capture the page and range values in the URL, but leave the search value as a query param.
This question has been asked a few times on SO but I couldn't get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don't need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?
I'd like to do this using Regexs and not BeautifulSoup, etc.
Thank you.
HTML is not a regular language, and thus cannot be parsed by regular expressions.
It's possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
That said, if you're willing to go that path, see John Gruber's regex for the purpose:
def extract_urls(your_text):
url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
for match in url_re.finditer(your_text):
yield match.group(0)
This can be used as follows:
>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
... print uri
http://foo.bar/
irc://freenode.org
I know you can use the DOM object in PHP to parse an HTML document. I'm not familiar with python but this might help: http://docs.python.org/library/xml.dom.html
How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?
Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.
You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.
You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags