How to match the following regex python? - python

How to match the following with regex?
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
I am trying the following:
groupsofmatches = re.match('(?P<booknumber>.*)\)([ \t]+)?(?P<item>.*)(\(.*\))?\(.*?((\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
The issue is when I apply it to string2 it works fine, but when I apply the expression to string1, I am unable to get the "m.group(name)" because of the "(TUD)" part. I want to use a single expression that works for both strings.
I expect:
booknumber = 1.0
item = The Ugly Duckling (TUD)

Your problem is that .* matches greedily, and it may be consuming too much of the string. Printing all of the match groups will make this more obvious:
import re
string1 = '1.0) The Ugly Duckling (TUD) (10 Dollars)'
string2 = '1.0) Little 1 Red Riding Hood (9.50 Dollars)'
result = re.match(r'(.*?)\)([ \t]+)?(?P<item>.*)\(.*?(?P<dollaramount>(\d+)?(\.\d+)?).*([ \t]+)?Dollars(\))?', string1)
print repr(result.groups())
print result.group('item')
print result.group('dollaramount')
Changing them to *? makes the match the minimum.
This can be expensive in some RE engines, so you can also write eg \([^)]*\) to match all the parenthesis. If you're not processing a lot of text it probably doesn't matter.
btw, you should really use raw strings (ie r'something') for regexps, to avoid surprising backslash behaviour, and to give the reader a clue.
I see you had this group (\(.*?\))? which presumably was cutting out the (TUD), but if you actually want that in the title, just remove it.

You could impose some heavier restrictions on your repeated characters:
groupsofmatches = re.match('([^)]*)\)[ \t]*(?P<item>.*)\([^)]*?(?P<dollaramount>(?:\d+)?(?:\.\d+)?)[^)]*\)$', string1)
This will make sure that the numbers are taken from the last set of parentheses.

I would write it as:
num, name, value = re.match(r'(.+?)\) (.*?) \(([\d.]+) Dollars\)', s2).groups()

This is how I would do it with a Demo
(?P<booknumber>\d+(?:\.\d+)?)\)\s+(?P<item>.*?)\s+\(\d+(?:\.\d+)?\s+Dollars\)

I suggest you to use regex pattern
(?P<booknumber>[^)]*)\)\s+(?P<item>.*\S)\s+\((?!.*\()(?P<amount>\S+)\s+Dollars?\)

Related

How to improve the performance of this regular expression?

Consider the regular expression
^(?:\s*(?:[\%\#].*)?\n)*\s*function\s
It is intended to match Octave/MATLAB script files that start with a function definition.
However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,
>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102
In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!
What is it about my regex that is making it so slow, and what could I do to fix it?
EDIT: To clarify, I would like my regex to match the following string containing comments:
# Hello world
function abc
including any amount of whitespace, but not
x = 10
function abc
because then the string does not start with "function". Note that comments can start with either "%" or with "#".
Replace your \s with [\t\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n).
The problem is that you have three greedy consumers that all match '\n' (\s*, (...\n)* and again \s*).
In your last timing example, they will try out all strings a, b and c (one for each consumer) that make up 25*'\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\n'.
Now find all combinations of a, b, c and e so that a+b+c+e == d+e == 25*'\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.
To speedup you can use this regex:
p = re.compile(r"^\s*function\s", re.MULTILINE)
Since you're not actually capturing lines starting with # or % anyway, you can use MULTILINE mode and start matching from the same line where function keyword is found.

How do I strip patterns or words from the end of the string backwards?

I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the
rfind method of a string) then no, it is not directly supported. You
could use re.findall() and chose the last match but if the matches can
overlap this may not give the correct result.
But .rstrip is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?
The simplest would be to use old-fashing string splitting and limiting the split:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
You've already got practically all the solution. re can't do backwards, but you can:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
Of course, as mentioned, this is way easier with a proper parser:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
I would look into regular expressions and use one such pattern to use a split
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
Sorry, can't comment, but will give it as an answer.
in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>.
You just should be aware of this.
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

Python regex to match multiple times

I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)

What should I use the Non-greedy match in this case

Assume I have a string which includes some data fields that are separated by "|", like
|1|2|3|4|5|6|7|8|
My purpose is to get the 8th field. This is what I'm doing:
pattern = re.compile(r'^\s+(\|.*?\|){8}')
match = pattern.match(test_line)
if match:
print:match.group(8)
But looks like it can not match. I know in this case I need to use ? for non-greedy match, but why I can not get the 8th field?
Thanks
Regex might be complicating this problem rather than simplifying it. A simple way to get an eighth item from a | delimited string is using split():
a = '|here|is|some|data|separated|by|bars|hooray!|'
print a.split('|')[8]
RETURNS
hooray!
Using regex, one way to get it would be:
import re
a = '|here|is|some|data|separated|by|bars|hooray!|'
pattern = re.compile(r'([^\|]+)')
match = pattern.findall(a)
print match[7]
RETURNS
hooray!

How to extract longest of overlapping groups?

How can I extract the longest of groups which start the same way
For example, from a given string, I want to extract the longest match to either CS or CSI.
I tried this "(CS|CSI).*" and it it will return CS rather than CSI even if CSI is available.
If I do "(CSI|CS).*" then I do get CSI if it's a match, so I gues the solution is to always place the shorter of the overlaping groups after the longer one.
Is there a clearer way to express this with re's? somehow it feels confusing that the result depends on the order you link the groups.
No, that's just how it works, at least in Perl-derived regex flavors like Python, JavaScript, .NET, etc.
http://www.regular-expressions.info/alternation.html
As Alan says, the patterns will be matched in the order you specified them.
If you want to match on the longest of overlapping literal strings, you need the longest one to appear first. But you can organize your strings longest-to-shortest automatically, if you like:
>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'
Intrigued to know the right way of doing this, if it helps any you can always build up your regex like:
import re
string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"
re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"
re_result = re.search(re_to_use,string_to_look_in)
print string_to_look_in[re_result.start():re_result.end()]
similar functionality is present in vim editor ("sequence of optionally matched atoms"), where e.g. col\%[umn] matches col in color, colum in columbus and full column.
i am not aware if similar functionality in python re,
you can use nested anonymous groups, each one followed by ? quantifier, for that:
>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']

Categories