I am trying to find a way using regex to match words that have 3 unique sets of double letters. so far i have this:
r".*([a-z])\1.*([a-z])\2.*([a-z])\3.*"
But that doesn't account for unique sets for double letters. Thanks in advance =)
Maybe like this? Seems to work for me.
r".*([a-z])\1.*((?=(?!\1))[a-z])\2.*((?=(?!\1))(?=(?!\2))[a-z])\3.*"
(?=expr) is a non-consuming regular expression, and (?!expr) is regex NOT operator.
(?=expr) is a non-consuming regular expression
however
(?!expr) is also a non-consumer expression. This time a not equals in place of an equals.
So enclosing the 'not' in the 'equals' adds nothing. It works without that as well. However stacking non-consuming expressions does not always work, and a single non-consuming will do the job anyway by using an 'or' ('|' character).
so
r".([a-z])\1.(?!\1)([a-z])\2.(?!\1|\2)([a-z])\3."
Tidied some braces also. I think this is cleaner and will be more reliable between versions.
Related
I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?
Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.
I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'
I'm using the following regex to clean up a document that has had apostrophes accidentally replaced with double quotes:
([a-zA-Z]\"[a-zA-Z])
That find's the first pattern match within the string, but not any subsequent ones. I've used the '*' operator after the group, which I understood would return multiple matches of that pattern, but this returns none. I've tested the regex here by adding double quotes to the example string.
Does anyone know what the operator I need is for this example?
Thanks
You might need to turn on global matching, which in Python is done by using re.findall() instead of re.search(). On Regexr, the global flag is enabled like this:
regex flags menu on top right corner http://puu.sh/kgLFC/5958420d09.png
So I have one variable that has all the code from some file. I need to remove all comments from this file. One of my regexp lines is this
x=re.sub('\/\*.*\*\/','',x,re.M,re.S);
What I want this to be doing is to remove all multi line comments. For some odd reason though, its skipping two instances of */, and removing everything up to the third instance of */.
I'm pretty sure the reason is this third instance of */ has code after it, while the first two are by themselves on the line. I'm not sure why this matters, but I'm pretty sure thats why.
Any ideas?
.* will always match as many characters as possible. Try (.*?) - most implementations should try to match as few characters as possible then (should work without the brackets but not sure right now). So your whole pattern should look like this: \/\*.*?\*\/ or \/\*(.*?)\*\/
The expression .* is greedy, meaning that it will attempt to match as many characters as possible. Instead, use (.*?) which will stop matching characters as soon as possible.
The regular expression is "greedy" and when presented with several stopping points will take the farthest one. Regex has some patterns to help control this, in particular the
(?>!...)
which matches the following expression only if it is Not preceeded by a match of the pattern in parens. (put in a pointy brace for > in the above - I don't know the forum convention for getting on in my answer).
(?*...) was not in Python 2.4 but is a good choice if you are using a later version.