I'm trying to write a function to sanitize unicode input in a web application, and I'm currently trying to reproduce the PHP function at the end of this page : http://www.iamcal.com/understanding-bidirectional-text/
I'm looking for an equivalent of PHP's preg_match_all in python. RE function findall returns matches without positions, and search only returns the first match. Is there any function that would return me every match, along with the associated position in the text ?
With a string abcdefa and the pattern a|c, I want to get something like [('a',0),('c',2),('a',6)]
Thanks :)
Try:
text = 'abcdefa'
pattern = re.compile('a|c')
[(m.group(), m.start()) for m in pattern.finditer(text)]
I don't know of a way to get re.findall to do this for you, but the following should work:
Use re.findall to find all the matching strings.
Use str.index to find the associate index of all strings returned by re.findall. However, be careful when you do this: if a string has two exact substrings in distinct locations, then re.findall will return both, but you'll need to tell str.index that you're looking for the second occurrence or the nth occurrence of a string. Otherwise, it will return an index that you already have. The best way I can think of to do this would be to maintain a dictionary that has the strings from the result of re.findall as keys and a list of indices as values
Hope this helps
Related
I'm using python but code in any language will do as well for this question.
Suppose I have 2 strings.
sequence ='abcd'
string = 'axyzbdclkd'
In the above example sequence is a subsequence of string
How can I check if sequence is a subsequence of string using regex? Also check the examples here for difference in subsequence and subarray and what I mean by subsequence.
The only think I could think of is this but it's far from what I want.
import re
c = re.compile('abcd')
c.match('axyzbdclkd')
Just allow arbitrary strings in between:
c = re.compile('.*a.*b.*c.*d.*')
# .* any character, zero or more times
You can, for an arbitrary sequence construct a regex like:
import re
sequence = 'abcd'
rgx = re.compile('.*'.join(re.escape(x) for x in sequence))
which will - for 'abcd' result in a regex 'a.*b.*c.*d'. You can then use re.find(..):
the_string = 'axyzbdclkd'
if rgx.search(the_string):
# ... the sequence is a subsequence.
pass
By using re.escape(..) you know for sure that for instance '.' in the original sequence will be translated to '\.' and thus not match any character.
I don't think the solution is as simple as #schwobaseggl claims. Let me show you another sequence from your database: ab1b2cd. By using the abcd subsequence for pattern matching you can get 2 results: ab(1b2)cd and a(b1)b(2)cd. So for testing purposes the proposed ^.*a.*b.*c.*d.*$ is ok(ish), but for parsing the ^a(.*)b(.*)cd$ will always be greedy. To get the second result you'll need to make it lazy: ^a(.*?)b(.*)cd$. So if you need this for parsing, then you should know how many variables are expected and to optimize the regex pattern you need to parse a few example strings and put the gaps with capturing groups only to the positions you really need them. An advanced version of this would inject the pattern of the actual variable instead of .*, so for example ^ab(\d\w\d)cd$ or ^a(\w\d)b(\d)cd$ in the second case.
I have a python code like
for i in re.finditer('something(.+?)"', html):
I am now trying to find out how many times its going to loop before going to that loop..in other words the length of array i.
Could anyone give me an alternative but similar code with with I get length of the loop.
x = list(re.finditer('something(.+?)"', html))
if len(x)
....
for i in x:
....
findall is not an adequate replacement since it returns strings, not match objects.
You can't do that with re.finditer because it returns an iterator which doesn't know when it's finished until it does (since it finds the next match on each iteration) ..., you'll have to use re.findall.
matches = re.findall('something(.+?)"', html)
num_loops = len(matches)
or use #thg435's approach if you do in fact need the match objects.
finditer returns the results as it finds them. There is no way finditer can tell you how many times you will loop in advance.
You need to use something else. Either re.findall or possibly re.search to get the length
I am trying to use python regex on a URL string.
id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
>>> re.search('news|ejournals|theses',id).group()
'ejournals'
>>> re.findall('news|ejournals|theses',id)
['ejournals', 'news']
Based on the docs at http://docs.python.org/2/library/re.html#finding-all-adverbs, it says search() matches the first one and find all matches all the possible ones in the string.
I am wondering why 'news' is not captured with search even though it is declared first in the pattern.
Did i use the wrong pattern ? I want to search if any of those keywords occur in the string.
You're thinking about it backwards. The regex goes through the target string looking for "news" OR "ejournals" OR "theses" and returns the first one it finds. In this case "ejournals" appears first in the target string.
The re.search() function stops after the first occurrence that satisfies your condition, not the first option in the pattern.
Be aware that there are some other differences between search and findall which aren't stated here.
For example:
python-regex why findall find nothing, but search works?
`id= 'edu.vt.lib.scholar:http/ejournals/VALib/v48_n4/newsome.html'
re.search('news|ejournals|theses',id).group()
'ejournals'
re.search -> search for first appearance in string and then exit.
re.findall('news|ejournals|theses',id)
['ejournals', 'news']
re.findall -> search for all occurrences of match in string and return in list form.
I'm trying to run this code:
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data)
and when i do print picture.groups(1)
it returns me the value but with parentheses, why?
Output:
('http://sample.com/img/file.jpg',)
The group is a tuple containing one element. You can access the string (which is the first match) as output[0]. The important part is the comma after the string.
BUT
DON'T PARSE HTML WITH REGEX
You should use a proper HTML parser. This will save you innumerable headaches in the future, when your regex fails to match or gets too much. Look into BeautifulSoup or lxml.
Notice the comma before the closing parenthesis? This is a tuple (albeit one with just one element in it).
As the documentation for MatchObject.groups() says:
groups([default])
Return a tuple containing all the subgroups of the match, from 1 up to
however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
As noted by other posters, you want to use MatchObject.group() instead.
You should be using
picture.group(1)
not groups() in plural if you're only looking for one specific group. groups() always returns a tuple, group() is the one you're looking for.
groups() returns a tuple of all the groups. You want pictures.group(1) which returns the string that matched group 1.
As the groups help says is returns "a tuple containing all the subgroups of the match".
If you want a single group use the group method.
#!/usr/bin/python
import re
str = raw_input("String containing email...\t")
match = re.search(r'[\w.-]+#[\w.-]+', str)
if match:
print match.group()
it's not the most complicated code, and i'm looking for a way to get ALL of the matches, if it's possible.
It sounds like you want re.findall():
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
As far as the actual regular expression for identifying email addresses goes... See this question.
Also, be careful using str as a variable name. This will hide the str built-in.
I guess that re.findall is what you're looking for.
You should give a try for find() or findall()
findall() matches all occurrences of a
pattern, not just the first one as
search() does. For example, if one was
a writer and wanted to find all of the
adverbs in some text, he or she might
use findall()
http://docs.python.org/library/re.html#finding-all-adverbs
You don't use raw_input in the way you used. Just use raw_input to get the input from the console.
Don't override built-in's such as str. Use a meaningful name and assign it a whole string value.
Also it is a good idea many a times to compile your pattern have it a Regex object to match the string against. (illustrated in the code)
I just realized that a complete regex to match an email id exactly as per RFC822 could be a pageful otherwise this snippet should be useful.
import re
inputstr = "something#exmaple.com, 121#airtelnet.com, ra#g.net, etc etc\t"
mailsrch = re.compile(r'[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
matches = mailsrch.findall(inputstr)
print matches