Python regex - difference between search and find all - python

I am trying to use python regex on a URL string.
id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
>>> re.search('news|ejournals|theses',id).group()
'ejournals'
>>> re.findall('news|ejournals|theses',id)
['ejournals', 'news']
Based on the docs at http://docs.python.org/2/library/re.html#finding-all-adverbs, it says search() matches the first one and find all matches all the possible ones in the string.
I am wondering why 'news' is not captured with search even though it is declared first in the pattern.
Did i use the wrong pattern ? I want to search if any of those keywords occur in the string.

You're thinking about it backwards. The regex goes through the target string looking for "news" OR "ejournals" OR "theses" and returns the first one it finds. In this case "ejournals" appears first in the target string.

The re.search() function stops after the first occurrence that satisfies your condition, not the first option in the pattern.

Be aware that there are some other differences between search and findall which aren't stated here.
For example:
python-regex why findall find nothing, but search works?

`id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
re.search('news|ejournals|theses',id).group()
'ejournals'
re.search -> search for first appearance in string and then exit.
re.findall('news|ejournals|theses',id)
['ejournals', 'news']
re.findall -> search for all occurrences of match in string and return in list form.

Related

If i am using re.finditer(), how can I limit the number of occurrences in python regex?

I am working on a python script that needs to go through a file and find certain paragraphs. I am able to successfully match the pattern using regex, however, the number of times that same paragraph occurs is more than 1. I simply need the first occurrence of the paragraph to be printed out.
Is there anything that I could add to my regular expression that would only return the first occurance.
This is my regex expression thus far... pattern = re.compile(//#|//\s#).+[\S\s] , then i did matches = pattern.finditer(file_name) , lastly i traversed through a for loop and printed print(i.group()). Note: the reason why i did finditer() instead of findall() is because i need it to be printed out as a string rather then a list.
Any guidance as to how I can tweak my current approach to only yield the first matched paragraph would be great!
You might simply use .search rather than .finditer, example
import re
text = 'A1B2C3'
pattern = re.compile(r'([0-9])')
found = pattern.search(text).group()
print(found) # 1
print(isinstance(found,str)) # True

Regex for search specific substring [duplicate]

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 4 years ago.
I tried this code:
re.findall(r"d.*?c", "dcc")
to search for substrings with first letter d and last letter c.
But I get output ['dc']
The correct output should be ['dc', 'dcc'].
What did i do wrong?
What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].
When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:
d.{0}c
d.{1}c
d.{2}c
d.{3}c
...
If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.
Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:
def findall_inner(expr, text):
explore = list(re.findall(expr, text))
matches = set()
while explore:
word = explore.pop()
if len(word) >= 2 and word not in matches:
explore.extend(re.findall(expr, word[1:])) # try more removing first letter
explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
matches.add(word)
return list(matches)
found = findall_inner(r"d.*c", "dcc")
print(found)
This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.
Try this regex:
^d.*c$
Essentially, you are looking for the start of the string to be d and the end of the string to be c.
This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

How to get a list of character positions in Python?

I'm trying to write a function to sanitize unicode input in a web application, and I'm currently trying to reproduce the PHP function at the end of this page : http://www.iamcal.com/understanding-bidirectional-text/
I'm looking for an equivalent of PHP's preg_match_all in python. RE function findall returns matches without positions, and search only returns the first match. Is there any function that would return me every match, along with the associated position in the text ?
With a string abcdefa and the pattern a|c, I want to get something like [('a',0),('c',2),('a',6)]
Thanks :)
Try:
text = 'abcdefa'
pattern = re.compile('a|c')
[(m.group(), m.start()) for m in pattern.finditer(text)]
I don't know of a way to get re.findall to do this for you, but the following should work:
Use re.findall to find all the matching strings.
Use str.index to find the associate index of all strings returned by re.findall. However, be careful when you do this: if a string has two exact substrings in distinct locations, then re.findall will return both, but you'll need to tell str.index that you're looking for the second occurrence or the nth occurrence of a string. Otherwise, it will return an index that you already have. The best way I can think of to do this would be to maintain a dictionary that has the strings from the result of re.findall as keys and a list of indices as values
Hope this helps

python and regex

#!/usr/bin/python
import re
str = raw_input("String containing email...\t")
match = re.search(r'[\w.-]+#[\w.-]+', str)
if match:
print match.group()
it's not the most complicated code, and i'm looking for a way to get ALL of the matches, if it's possible.
It sounds like you want re.findall():
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
As far as the actual regular expression for identifying email addresses goes... See this question.
Also, be careful using str as a variable name. This will hide the str built-in.
I guess that re.findall is what you're looking for.
You should give a try for find() or findall()
findall() matches all occurrences of a
pattern, not just the first one as
search() does. For example, if one was
a writer and wanted to find all of the
adverbs in some text, he or she might
use findall()
http://docs.python.org/library/re.html#finding-all-adverbs
You don't use raw_input in the way you used. Just use raw_input to get the input from the console.
Don't override built-in's such as str. Use a meaningful name and assign it a whole string value.
Also it is a good idea many a times to compile your pattern have it a Regex object to match the string against. (illustrated in the code)
I just realized that a complete regex to match an email id exactly as per RFC822 could be a pageful otherwise this snippet should be useful.
import re
inputstr = "something#exmaple.com, 121#airtelnet.com, ra#g.net, etc etc\t"
mailsrch = re.compile(r'[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
matches = mailsrch.findall(inputstr)
print matches

Finding Regex Pattern after doing re.findall

This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.

Categories