I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']
Related
I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:
\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But I can mark at least the following issues:
http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=
http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
CSS / HTML styling is extracted, for example xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;color etc
How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.
No regex is perfect, but this one might help you:
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[-A-Z0-9+&##\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[A-Z0-9+&##\/%=~_|$])
Flag to enable: insensitive, global, multiline (igm)
Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/
I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.
I have a string
<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />
What is the Regex to find ABCDXYZ in Python
Don't use regex to parse HTML. Use BeautifulSoup.
from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']
If you're looking for the value of that alt attribute, you can do this:
>>> r = r'alt="(.*?)"'
Then:
>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'
And you can use re.findall if you want to find more than one.
However, this code will be easily fooled by something like this:
<span>Here's some text explaining how to do alt="foo" in an img tag.</span>
On the other hand, it'll also fail to pick up something like this:
<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />
How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.
It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.
One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…
Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.
If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.
First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this
Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:
<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />
and you could access the text via the match object's groups attribute.
Could someone tell me whats a better way to clean up bad HTML so BeautifulSoup can handle it - should one use the massage methods of BeautifulSoup or clean it up using regular expressions?
Thought I should reword my answer.
The built-in massages are good for light damage (extra whitespace, no closing slashes, etc). I would certainly try and get away with these before getting any more involved.
You can pass in your own massages and I would suggest you extend the default set:
import copy, re
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
You're probably better off doing it this way as it all goes into one parsing pot, gaining BeautifulSoups optimisations... Although the runtime performance is probably pretty similar.
From the documentation, massage methods are just pairs of (regular expression, replacement function) so I don't think it's really a case of use massaging or regexps.
e.g. to tidy up malformed comments:
(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))
If you look at the source of the _feed method in BeautifulSoup.py you will see that these are just run in sequence against the markup:
for fix, m in self.markupMassage:
markup = fix.sub(m, markup)
So whilst you could do some regexp processing of your own before BeautifulSoup gets to see the markup you are probably better combining any additional tidying needed with the default builtin MARKUP_MASSAGE as shown in Oli's answer.
So, here's my question:
I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:
(http://www.example.com/subdirectory/)(.*?)
...which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:
(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)
...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:
(http://www.example.com/)([^=\?#](.*?))
etc. Any help would be really appreciated!
EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)
EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:
(http://www.example.com/)([^=\?#]*)$
(http://www.example.com/)([^=?#]*?)
Should do it, this will allow any URL that does not contain the characters you don't want.
It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.
This expression should be what you're looking for:
(http://www.example.com/subdirectory/)([^=?#]*)$
[^=\?#] Will match anything except for the characters you specified.
For Example:
http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1¶m=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match
You will need to crawl the pages upto ?param=1¶m=5
because normally param=1 and param=2 could give you completely different web page.
pick up one the wordpress website to confirm that.
Try like this one, It will try to match just before # char
(http://www.example.com/)([^#]*?)
I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is
([^=?#]*)
As an alternative there's always the urlparse module which is designed for parsing urls.
from urlparse import urlparse
urls= [
'http://www.example.com/subdirectory/',
'http://www.example.com/subdirectory/index.php',
'http://www.example.com/subdirectory/somepage?param=1¶m=5#print',
'http://www.example.com/subdirectory/index.php?param=1',
]
for url in urls:
# in python 2.5+ you can use urlparse(url).query instead
if not urlparse(url)[4]:
print url
Provides the following:
http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php