Python regular expression expansion - python

I am really bad with regular expressions, and stuck up to generate all the possible combinations for a regular expression.
When the regular expression is abc-defghi00[1-24,2]-[1-20,23].walmart.com, it should generate all its possible combinations.
The text before the braces can be anything and the pattern inside the braces is optional.
Need all the python experts to help me with the code.
Sample output
Here is the expected output:
abc-defghi001-1.walmart.com
.........
abc-defghi001-20.walmart.com
abc-defghi001-23.walmart.com
..............
abc-defghi002-1.walmart.com
Repeat this from 1-24 and 2.
Regex tried
([a-z]+)(-)([a-z]+)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(-)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(.*)

Lets say we would like to match against abc-defghi001-1.walmart.com. Now, if we write the following regex, it does the job.
s = 'abc-defghi001-1.walmart.com'
re.match ('.*[1-24]-[1-20|23]\.walmart\.com',s)
and the output:
<_sre.SRE_Match object at 0x029462C0>
So, its found. If you want to match to 27 in the first bracket, you simply replace it by [1-24|27], or if you want to match to 0 to 29, you simply replace it by [1-29]. And ofcourse, you know that you have to write import re, before all the above commands.
Edit1: As far as I understand, you want to generate all instances of a regular expression and store them in a list.
Use the exrex python library to do so. You can find further information about it here. Then, you have to limit the regex you use.

import re
s = 'abc-defghi001-1.walmart.com'
obj=re.match(r'^\w{3}-\w{6}00(1|2)-([1-20]|23)\.walmart\.com$',s)
print(obj.group())
The above regex will match the template you're looking for I hope!

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

How to use regular expression to remove all math expression in latex file

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

problems using regex expressions re.search and re.compile

A stack overflow user has kindly shown me to https://pythex.org/ which allows you to build and test regular expressions.
I have successfully been able to write just the expression but when it comes to actually using it in python with the re. module I am confused.
What I don't understand is when to use.compile, and when to do re.search -->
if I search for the text inside of brackets for example and there is more than one, I gather I am supposed to use .group[x] where x is the index of the item you want to return
Example
pattern = re.compile(r'View All \((\d*)\)')
number = pattern.search(data).group(2);
As I understand if I had the following, the number_connections variable, when printed would be
View All (8) View All (16) View All (12)
Result
Print number
16
What I don't get is: When there is more than one occurance of the text you are looking for how do you loop over them, and how do you get a count of how many there are?
For example: number.count() would return, found 3
for i in number: (this doesn't work because match is a regular expression object???)
print i
But What happens when there is only one of the text you are looking for in the regular expression?
Example
Regular Expression:
pattern = re.compile('[a-zA-Z]\s[a-zA-Z]/[a-zA-Z]/[a-zA-Z]')
email = pattern.search(data).group(1);
Result
data: "email-id":"FisrtName LastName/Australia/ABC"}]</p></body></html>
should return: firstname lastname/Australia/ABC
There may or may not be more than one of these on the page - in which case always using result[0] will not work, as there may be only one instance of the email address on the page.
Now I realise my syntax is obviously wrong but doing this also gave me the following, so I'm looking for guidance on how to use the regular expression properly once I have built it using https://pythex.org/:
email = pattern.search(data)
print email
<_sre.SRE_Match object at 0x0553B090>
It sounds to me like you're at a stage with Python regex where you need to read a bit of documentation or a full tutorial—rather than trying to acquire knowledge in disconnected pieces.
You have access to exactly the same match whether you compile the regex or not.
Quoting Jan Goyvaerts, author of RegexBuddy and co-author of the Regular Expressions Cookbook:
If you want to use the same regular expression more than once, you
should compile it into a regular expression object. Regular expression
objects are more efficient, and make your code more readable. To
create one, just call re.compile(regex) or re.compile(regex, flags).
The flags are the matching options described above for the re.search()
and re.match() functions.
The regular expression object returned by re.compile() provides all
the functions that the re module also provides directly: search(),
match(), findall(), finditer(), sub() and split(). The difference is
that they use the pattern stored in the regex object, and do not take
the regex as the first parameter. re.compile(regex).search(subject) is
equivalent to re.search(regex, subject).
For multiple matches, you can use findall or finditer (more details on the same page).

I need help figuring out some Python Regex

I have tried to properly wrap my head around the below but I still have big hole in my reasoning. What is ?::, and could someone explain it properly for me
rule_syntax = re.compile('(\\\\*)'\
'(?:(?::([a-zA-Z_][a-zA-Z_0-9]*)?()(?:#(.*?)#)?)'\
'|(?:<([a-zA-Z_][a-zA-Z_0-9]*)?(?::([a-zA-Z_]*)'\
'(?::((?:\\\\.|[^\\\\>]+)+)?)?)?>))')
There are two tools that you may wish to look into to help with your understanding
Regexper creates a visual representation of regex, here's yours:
Regexpal is a tool that allows you to input a regex and various strings and see what matches, here's yours with some example matches
(?:expr) is just like normal parentheses (expr), except that for purposes of retrieving groups later (backreferences, re.sub, or MatchObject.group), parenthesized groups beginning with ?: are excluded. This can be useful if you need to capture a complex expression in parentheses to apply another operator like * to it, but don't want to get it mixed in with groups that you'll actually need to retrieve later.'
?:: is simply ?: followed by a literal :.

Categories