python and regex - python

#!/usr/bin/python
import re
str = raw_input("String containing email...\t")
match = re.search(r'[\w.-]+#[\w.-]+', str)
if match:
print match.group()
it's not the most complicated code, and i'm looking for a way to get ALL of the matches, if it's possible.

It sounds like you want re.findall():
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
As far as the actual regular expression for identifying email addresses goes... See this question.
Also, be careful using str as a variable name. This will hide the str built-in.

I guess that re.findall is what you're looking for.

You should give a try for find() or findall()
findall() matches all occurrences of a
pattern, not just the first one as
search() does. For example, if one was
a writer and wanted to find all of the
adverbs in some text, he or she might
use findall()
http://docs.python.org/library/re.html#finding-all-adverbs

You don't use raw_input in the way you used. Just use raw_input to get the input from the console.
Don't override built-in's such as str. Use a meaningful name and assign it a whole string value.
Also it is a good idea many a times to compile your pattern have it a Regex object to match the string against. (illustrated in the code)
I just realized that a complete regex to match an email id exactly as per RFC822 could be a pageful otherwise this snippet should be useful.
import re
inputstr = "something#exmaple.com, 121#airtelnet.com, ra#g.net, etc etc\t"
mailsrch = re.compile(r'[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
matches = mailsrch.findall(inputstr)
print matches

Related

Can re.findall() return only the part of the regex in parens?

Looping through some data, I want to capture string of numbers that appear as page IDs (with more than one per line.) However, I only want to match number strings as part of a particular URL, but I DON'T want to record the URL, just the number.
URLs are relative, with digits strings of variable length, of the form
/view/123456.htm
Data to be returned here would be '123456'
I am currently using re.findall to identify the right URLs, and then re.sub to extract the number strings.
views = re.findall(r"/view/\d*?.htm", line)
for view in views:
view = re.sub(r"/view/(\d+).htm", r"\1", view)
pagelist.append(view)
Is there a way to do something like
views = re.findall(r"/view/(\d*?).htm", r"\1", line) #I know this doesn't work
where the original findall() only returns the part of the match in parens?
Can re.findall() return only the part of the regex in parens?
It not only can, it does:
>>> import re
>>> re.findall(r"/view/(\d*?).htm", "/view/123.htm /view/456.htm")
['123', '456']
Did you not try it? The documentation describes it as well.
You could use a lookbehind and a lookahead assertion to make findall only return the numbers. For example:
>>> re.findall(r"(?<=/view/)\d*?(?=\.htm)", "/view/123.htm /view/456.htm")
['123', '456']
These kind of assertions can be used to define what should being before and after a match - without including them into the actual match.
Update: Please check Stefan Pochmann's answer, If you are using a single capturing group only, findall() will behave exactly as you requested.

Extract string using regex in Python

I'm struggling a bit on how to extract (i.e. assign to variable) a string based on a regex. I have the regex worked out -- I tested on regexpal. But I'm lost on how I actually implement that in Python. My regex string is:
http://jenkins.mycompany.com/job/[^\s]+
What I want to do is take string and if there's a pattern in there that matches the regex, put that entire "pattern" into a variable. So for example, given the following string:
There is a problem with http://jenkins.mycompany.com/job/app/4567. We should fix this.
I want to extract http://jenkins.mycompany.com/job/app/4567and assign it a variable. I know I'm supposed to use re but I'm not sure if I want re.match or re.search and how to get what I want. Any help or pointers would be greatly appreciated.
import re
p = re.compile('http://jenkins.mycompany.com/job/[^\s]+')
line = 'There is a problem with http://jenkins.mycompany.com/job/app/4567. We should fix this.'
result = p.search(line)
print result.group(0)
Output:
http://jenkins.mycompany.com/job/app/4567.
Try the first found match in the string, using the re.findall method to select the first match:
re.findall(pattern_string, input_string)[0] # pick the first match that is found

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Python regex - difference between search and find all

I am trying to use python regex on a URL string.
id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
>>> re.search('news|ejournals|theses',id).group()
'ejournals'
>>> re.findall('news|ejournals|theses',id)
['ejournals', 'news']
Based on the docs at http://docs.python.org/2/library/re.html#finding-all-adverbs, it says search() matches the first one and find all matches all the possible ones in the string.
I am wondering why 'news' is not captured with search even though it is declared first in the pattern.
Did i use the wrong pattern ? I want to search if any of those keywords occur in the string.
You're thinking about it backwards. The regex goes through the target string looking for "news" OR "ejournals" OR "theses" and returns the first one it finds. In this case "ejournals" appears first in the target string.
The re.search() function stops after the first occurrence that satisfies your condition, not the first option in the pattern.
Be aware that there are some other differences between search and findall which aren't stated here.
For example:
python-regex why findall find nothing, but search works?
`id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
re.search('news|ejournals|theses',id).group()
'ejournals'
re.search -> search for first appearance in string and then exit.
re.findall('news|ejournals|theses',id)
['ejournals', 'news']
re.findall -> search for all occurrences of match in string and return in list form.

How to get a list of character positions in Python?

I'm trying to write a function to sanitize unicode input in a web application, and I'm currently trying to reproduce the PHP function at the end of this page : http://www.iamcal.com/understanding-bidirectional-text/
I'm looking for an equivalent of PHP's preg_match_all in python. RE function findall returns matches without positions, and search only returns the first match. Is there any function that would return me every match, along with the associated position in the text ?
With a string abcdefa and the pattern a|c, I want to get something like [('a',0),('c',2),('a',6)]
Thanks :)
Try:
text = 'abcdefa'
pattern = re.compile('a|c')
[(m.group(), m.start()) for m in pattern.finditer(text)]
I don't know of a way to get re.findall to do this for you, but the following should work:
Use re.findall to find all the matching strings.
Use str.index to find the associate index of all strings returned by re.findall. However, be careful when you do this: if a string has two exact substrings in distinct locations, then re.findall will return both, but you'll need to tell str.index that you're looking for the second occurrence or the nth occurrence of a string. Otherwise, it will return an index that you already have. The best way I can think of to do this would be to maintain a dictionary that has the strings from the result of re.findall as keys and a list of indices as values
Hope this helps

Categories