Regex matching specific HTML string with Python - python

The pattern is as follows
page_pattern = 'manual-data-link" href="(.*?)"'
The matching function is as follows, where pattern is one of the predefined patterns like the above page_pattern
def get_pattern(pattern, string, group_num=1):
escaped_pattern = re.escape(pattern)
match = re.match(re.compile(escaped_pattern), string)
if match:
return match.group(group_num)
else:
return None
The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.
Test string
<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>

You have three problems.
1) You shouldn't call re.escape in this case. re.escape prevents special characters (like ., *, or ?) from having their special meanings. You want them to have special meanings here.
2) You should use re.search, not re.match re.match matches from the beginning of the string; you want to find a match anywhere inside the string.
3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.

re.match tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search instead of re.match
>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'
Use html parsers like BeautifulSoup to parse html files.
>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
print(i['href'])
/data/123421

Related

Regex to extra part of the url

I'm trying to extract part of a url using regex. I'm trying todo this ideally in one line and word for both url types.
I'm trying the following but not sure how I should get the second url. I am trying to extract the 4FHP from both.
>>> import re
>>>
>>> a="/url_redirect/4FHP"
>>> b="/url/4FHP/asdfasdfas/"
>>>
>>> re.search('^\/(url_redirect|url)\/(.*)', a).group(2)
'4FHP'
>>> re.search('^\/(url_redirect|url)\/(.*)', b).group(2)
'4FHP/asdfasdfas/'
The following code will extract 4FHP from either string. Noticed that I changed .* (match a sequence of any non-newline character) to [^/]* (match a sequence of any non-/ character).
re.search('^\/(url_redirect|url)\/([^/]*)', b).group(2)
Your problem is that the * operator is 'greedy', so it will grab to the end of the string which is why you get '4FHP/asdfasdfas/' in your second example
you need to stop matching when you see another /, the easiest way is to use a character class that specifically excludes it, eg [^/]
you can also use non-capturing groups (?: <regex> ) to only return matched group that you're interested in
re.search('^\/(?:url_redirect|url)\/([^/]*)', b).group(1)

How can Python's regular expressions work with patterns that have escaped special characters?

Is there a way to get Python's regular expressions to work with patterns that have escaped special characters? As far as my limited understanding can tell, the following example should work, but the pattern fails to match.
import re
string = r'This a string with ^g\.$s' # A string to search
pattern = r'^g\.$s' # The pattern to use
string = re.escape(string) # Escape special characters
pattern = re.escape(pattern)
print(re.search(pattern, string)) # This prints "None"
Note:
Yes, this question has been asked elsewhere (like here). But as you can see, I'm already implementing the solution described in the answers and it's still not working.
Why on earth are you applying re.escape to the string?! You want to find the "special" characters in that! If you just apply it to the pattern, you'll get a match:
>>> import re
>>> string = r'This a string with ^g\.$s'
>>> pattern = r'^g\.$s'
>>> re.search(re.escape(pattern), re.escape(string)) # nope
>>> re.search(re.escape(pattern), string) # yep
<_sre.SRE_Match object at 0x025089F8>
For bonus points, notice that you just need to re.escape the pattern one more times than the string:
>>> re.search(re.escape(re.escape(pattern)), re.escape(string))
<_sre.SRE_Match object at 0x025D8DE8>

Regex returning extra, unwanted values upon searching for file names in URLS

So, if I have a string "http://www.images.com/place/folder/file_name.gif"
I want a regex that returns:
"file_name.gif"
So far I have this (in python):
re.findall(r'([\w]+\.*?(gif|jpeg|jpg|png))',f)
but it returns
( "file_name.gif" , "gif" )
What am I doing wrong?
In your expression, you have two capture groups. Keep in mind that a set of () is a capture group. You want to combine the extension and the filename in one capture group, so that they are both returned try this one:
>>> exp = r'(\w+\.\w+)$'
>>> url = 'http://www.foo.com/hello.html'
>>> re.findall(exp, url)
['hello.html']
This expression is one or more word characters, followed by a ., then one or more word characters.
You can further enhance this by adding your specific extensions in place of the second \w. As long as you keep it in one set of (), you'll get the entire result of the expression as one match.
There is a basic flaw in that a valid URL like http://www.example.com/this-file.gif will fail:
>>> url = 'http://www.example.com/this-link.gif'
>>> re.findall(exp, url)
['link.gif']
Because \w does not include -, which is a valid file name. You can mitigate this by adding it in a character class:
>>> exp = r'([\w-]+\.\w+)$'
>>> re.findall(exp, url)
['this-link.gif']
This is rather in-elegant in that it doesn't match urls that have a fragment or a query string.
It will also be easily fooled if your URL doesn't end in a file name:
>>> url = 'http://www.example.com/this-is-a-valid-url'
>>> re.findall(exp, url)
[]
Since its specifically looking for a ., but then it will also be tripped up by this:
>>> url = 'http://www.example.com/this.is.a.url.gif'
>>> re.findall(exp, url)
['url.gif']
You could take that and build up on it, but as its difficult to predict the many combinations of possible URL endings beyond the very basic, it is recommended to use the existing tools:
>>> import os
>>> import urlparse
>>> os.path.basename(urlparse.urlsplit(url).path)
'this.is.a.url.gif'
In Python 3, use urllib.parse.

Python match regex always returning None

I have a python regex that match method always return None. I tested in pythex site and the pattern seems OK.
Pythex example
But when I try with re module, the result is always None:
import re
a = re.match(re.compile("\.aspx\?.*cp="), 'page.aspx?cpm=549&cp=168')
What am I doing wrong?
re.match() only matches at the start of a string. Use re.search() instead:
re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
Demo:
>>> import re
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168')
<_sre.SRE_Match object at 0x105d7e440>
>>> re.search(r"\.aspx\?.*cp=", 'page.aspx?cpm=549&cp=168').group(0)
'.aspx?cpm=549&cp='
Note that any re functions that take a pattern, accept a string and will call re.compile() for you (which caches compilation results). You only need to use re.compile() if you want to store the compiled expression for re-use, at which point you can call pattern.search() on it:
pattern = re.compile(r"\.aspx\?.*cp=")
pattern.search('page.aspx?cpm=549&cp=168')

regular expressions in python, matching words outside of html tags

I am trying to match a phrase using regular expressions, so long as none of the words in that phrase appear within an html tag.
For this example, I am using the following url:
url = "http://www.sidley.com/people/results.aspx?lastname=B"
The regexp that I am using is:
regexp = "Babb(?!([^<]+)?>).+?Jonathan(?!([^<]+)?>).+?C(?!([^<]+)?>)"
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
With that regexp, I get the following output:
[('', '', '')]
When I change the regexp to (*note the outer parens):
regexp = "(Babb(?!([^<]+)?>).+?Jonathan(?!([^<]+)?>).+?C(?!([^<]+)?>))"
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
I get:
[('Babb, Jonathan C', '', '', '')]
I am confused as to why this is.
1) Why am I getting these empty strings as matches?
2) Why for the first regexp, do I not get the actual match?
and finally,
How do I fix this?
Thanks in advance for your help.
The reason you are getting empty strings is that you are using non-greedy. If you don't want that information, just remove some of your parentheses. In fact, you should really look into non-grouping parentheses or just some of the extraneous pairs.
The final code that I would use (for the whole process) would be
import re
import urllib2
url = 'http://www.sidley.com/people/results.aspx?lastname=B'
regexp = 'Babb(?!<+?>).+?Jonathan(?!<+?>).+?C(?!<+?>)'
page = urllib2.urlopen(url).read()
re.findall(regexp, page, re.DOTALL)
A breakdown of the regexp:
We select for the first word. Babb
We don't want to match any HTML tags, so we use a must-not-match anti-group. (?!)
Within this, we place a regexp that selects for an HTML tag (not quite sure why it is this particular expression that works, rather than .+?>). <+?>
We select for at least one more character, non-greedily. .+?
We repeat this process for each of the other words (Jonathan and C).

Categories