problems using regex expressions re.search and re.compile - python

A stack overflow user has kindly shown me to https://pythex.org/ which allows you to build and test regular expressions.
I have successfully been able to write just the expression but when it comes to actually using it in python with the re. module I am confused.
What I don't understand is when to use.compile, and when to do re.search -->
if I search for the text inside of brackets for example and there is more than one, I gather I am supposed to use .group[x] where x is the index of the item you want to return
Example
pattern = re.compile(r'View All \((\d*)\)')
number = pattern.search(data).group(2);
As I understand if I had the following, the number_connections variable, when printed would be
View All (8) View All (16) View All (12)
Result
Print number
16
What I don't get is: When there is more than one occurance of the text you are looking for how do you loop over them, and how do you get a count of how many there are?
For example: number.count() would return, found 3
for i in number: (this doesn't work because match is a regular expression object???)
print i
But What happens when there is only one of the text you are looking for in the regular expression?
Example
Regular Expression:
pattern = re.compile('[a-zA-Z]\s[a-zA-Z]/[a-zA-Z]/[a-zA-Z]')
email = pattern.search(data).group(1);
Result
data: "email-id":"FisrtName LastName/Australia/ABC"}]</p></body></html>
should return: firstname lastname/Australia/ABC
There may or may not be more than one of these on the page - in which case always using result[0] will not work, as there may be only one instance of the email address on the page.
Now I realise my syntax is obviously wrong but doing this also gave me the following, so I'm looking for guidance on how to use the regular expression properly once I have built it using https://pythex.org/:
email = pattern.search(data)
print email
<_sre.SRE_Match object at 0x0553B090>

It sounds to me like you're at a stage with Python regex where you need to read a bit of documentation or a full tutorial—rather than trying to acquire knowledge in disconnected pieces.
You have access to exactly the same match whether you compile the regex or not.
Quoting Jan Goyvaerts, author of RegexBuddy and co-author of the Regular Expressions Cookbook:
If you want to use the same regular expression more than once, you
should compile it into a regular expression object. Regular expression
objects are more efficient, and make your code more readable. To
create one, just call re.compile(regex) or re.compile(regex, flags).
The flags are the matching options described above for the re.search()
and re.match() functions.
The regular expression object returned by re.compile() provides all
the functions that the re module also provides directly: search(),
match(), findall(), finditer(), sub() and split(). The difference is
that they use the pattern stored in the regex object, and do not take
the regex as the first parameter. re.compile(regex).search(subject) is
equivalent to re.search(regex, subject).
For multiple matches, you can use findall or finditer (more details on the same page).

Related

How to use regular expression to remove all math expression in latex file

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Python regular expression expansion

I am really bad with regular expressions, and stuck up to generate all the possible combinations for a regular expression.
When the regular expression is abc-defghi00[1-24,2]-[1-20,23].walmart.com, it should generate all its possible combinations.
The text before the braces can be anything and the pattern inside the braces is optional.
Need all the python experts to help me with the code.
Sample output
Here is the expected output:
abc-defghi001-1.walmart.com
.........
abc-defghi001-20.walmart.com
abc-defghi001-23.walmart.com
..............
abc-defghi002-1.walmart.com
Repeat this from 1-24 and 2.
Regex tried
([a-z]+)(-)([a-z]+)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(-)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(.*)
Lets say we would like to match against abc-defghi001-1.walmart.com. Now, if we write the following regex, it does the job.
s = 'abc-defghi001-1.walmart.com'
re.match ('.*[1-24]-[1-20|23]\.walmart\.com',s)
and the output:
<_sre.SRE_Match object at 0x029462C0>
So, its found. If you want to match to 27 in the first bracket, you simply replace it by [1-24|27], or if you want to match to 0 to 29, you simply replace it by [1-29]. And ofcourse, you know that you have to write import re, before all the above commands.
Edit1: As far as I understand, you want to generate all instances of a regular expression and store them in a list.
Use the exrex python library to do so. You can find further information about it here. Then, you have to limit the regex you use.
import re
s = 'abc-defghi001-1.walmart.com'
obj=re.match(r'^\w{3}-\w{6}00(1|2)-([1-20]|23)\.walmart\.com$',s)
print(obj.group())
The above regex will match the template you're looking for I hope!

regex to dictionary using group label ?P<>

I'm using the regex module instead of default re module in python
https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails
I'm trying to do the following
>>> regex.compile('(?P<heavy>heavily|heavy)').search("My laptop is heavy or heavily").groupdict()
{'heavy': 'heavy'}
I expect it returns
{'heavy': ['heavy','heavily]}
regex.findall will match both heavy and heavily, but it doesn't work with group label
I have to solve it with regex, so iterative through string solutions are not acceptable.
[Have you read the python documents on regexes?][1]
Relevant portion:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
This means that your regex:
(?P<heavy>heavily|heavy)
Will find the first matching string, which is "heavy", and save that string. It then says "congrats, I'm done!" and finishes scanning.
You need a regex that will capture both strings.
It then saves that string, heavy, into a group (as your regex requests) also called heavy. Your group dict command then returns this information. So you have a group named heavy with one regex match, also heavy, which gives you the return result of
{"heavy": "heavy"}
To resolve your issue, there are two solutions.
Use the regex findall method, which will return a list, and then you can turn this list into a dictionary. This is the easier route.
Craft a regex that will actually find both terms and place them into the same group. While doable, this is very convoluted.
I highly recommend you use the findall method instead, if you wish to search for multiple matches.

Why is my Python regular expression checking for more than one group taking so long?

This question originates in Django URL resolver, but the problem seems to be a general one.
I want to match URLs built like this:
1,2,3,4,5,6/10,11,12/
The regular expression I'm using is:
^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/$
When I try to match it against a "valid" URL (i.e. one that matches), I get an instant match:
In [11]: print datetime.datetime.now(); re.compile(r"^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/$").search("114,414,415,416,417,418,419,420,113,410,411,412,413/1/"); print datetime.datetime.now()
2011-10-18 14:27:42.087883
Out[11]: <_sre.SRE_Match object at 0x2ab0960>
2011-10-18 14:27:42.088145
However, when I try to match an "invalid" URL (non-matching), the whole regular expression takes a magnitude of time to return nothing:
In [12]: print datetime.datetime.now(); re.compile(r"^(?P<apples>([0123456789]+,?)+)/(?P<oranges>([0123456789]+,?)+)/").search("114,414,415,416,417,418,419,420,113,410,411,412,413/"); print datetime.datetime.now()
2011-10-18 14:29:21.011342
2011-10-18 14:30:00.573270
I assume there is something in the regexp engine that slows down extremely when several groups need to be matched. Is there any workaround for this? Maybe my regexp needs to be fixed?
This is a known deficiency in many regular expression engines, including Python's and Perl's. What is happening is the engine is backtracking and getting an exponential explosion of possible matches to try. Better regular expression engines do not use backtracking for such a simple regular expression.
You can fix it by getting rid of the optional comma. This is what is allowing the engine to look at a string like 123 and decide whether to parse it as (123) or (12)(3) or (1)(23) or (1)(2)(3). That's a lot of matches to try just for three digits, so you can see how it would explode rather quickly for a couple dozen digits.
^(?P<apples>[0-9]+(,[0-9]+)*)/(?P<oranges>[0-9]+(,[0-9]+)*)/$
This will make the regular expression engine always group 123,456 as (123),(456) and never as (12)(3),(4)(56) or something else. Because it will only match in that one way, the backtracking engine won't hit a combinatorial explosion of possible parses. Again, better regular expression engines do not suffer from this flaw.
Update: If I were writing it, I would do it this way:
^(?P<apples>[0-9,]+)/(?P<oranges>[0-9,]+)$
This would match a few bogus URLs (like ,/,), but you can always return a 404 after you've parsed and routed it.
try:
apples = [int(x) for x in apples.split(',')]
except ValueError:
# return 404 error
You could use this regexp:
^(?P<apples>(?:\d+,)*\d+)/(?P<oranges>(?:\d+,)*\d+)/$
\d matches a digit

Categories