Regex match even number of letters - python

I need to match an expression in Python with regular expressions that only matches even number of letter occurrences. For example:
AAA # no match
AA # match
fsfaAAasdf # match
sAfA # match
sdAAewAsA # match
AeAiA # no match
An even number of As SHOULD match.

Try this regular expression:
^[^A]*((AA)+[^A]*)*$
And if the As don’t need to be consecutive:
^[^A]*(A[^A]*A[^A]*)*$

This searches for a block with an odd number of A's. If you found one, the string is bad for you:
(?<!A)A(AA)*(?!A)
If I understand correctly, the Python code should look like:
if re.search("(?<!A)A(AA)*(?!A)", "AeAAi"):
print "fail"

Why work so hard coming up with a hard to read pattern? Just search for all occurrences of the pattern and count how many you find.
len(re.findall("A", "AbcAbcAbcA")) % 2 == 0
That should be instantly understandable by all experienced programmers, whereas a pattern like "(?
Simple is better.

'A*' means match any number of A's. Even 0.
Here's how to match a string with an even number of a's, upper or lower:
re.compile(r'''
^
[^a]*
(
(
a[^a]*
){2}
# if there must be at least 2 (not just 0), change the
# '*' on the following line to '+'
)*
$
''',re.IGNORECASE|re.VERBOSE)
You probably are using a as an example. If you want to match a specific character other than a, replace a with %s and then insert
[...]
$
'''%( other_char, other_char, other_char )
[...]

'*' means 0 or more occurences
'AA' should do the trick.
The question is if you want the thing to match 'AAA'. In that case you would have to do something like:
r = re.compile('(^|[^A])(AA)+(?!A)',)
r.search(p)
That would work for match even (and only even) number of'A'.
Now if you want to match 'if there is any even number of subsequent letters', this would do the trick:
re.compile(r'(.)\1')
However, this wouldn't exclude the 'odd' occurences. But it is not clear from your question if you really want that.
Update:
This works for you test cases:
re.compile('^([^A]*)AA([^A]|AA)*$')

First of all, note that /A*/ matches the empty string.
Secondly, there are some things that you just can't do with regular expressions. This'll be a lot easier if you just walk through the string and count up all occurences of the letter you're looking for.

A* means match "A" zero or more times.
For an even number of "A", try: (AA)+

It's impossible to count arbitrarily using regular expressions. For example, making sure that you have matching parenthesis. To count you need 'memory' which requires something at least as strong as a pushdown automaton, although in this case you can use the regular expression that #Gumbo provided.
The suggestion to use finditeris the best workaround for the general case.

Related

Regex for string that has 5 numbers or IND/5numbers

I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?
Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).
Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo
This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .
Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number
I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8
With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']
Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.
If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Python - match all repetition of a char between groups of chars

I've a little problem with a regex best described below:
Original string is:
{reply_to={message_type=login}|login_id=pippo|user_description=pippo=pluto|version=2013.2.1|utc_offset=7200|login_date=2014-07-03|login_time=09:43:02|error=0}
This is what I would like to obtain:
{reply_to:{message_type:login}|login_id:pippo|user_description:pippo=pluto|version:2013.2.1|utc_offset:7200|login_date:2014-07-03|login_time:09:43:02|error:0}
It happens that if there is an "=" also in the value of the key I cannot substitute it.
What I've tried to do is to match and substitute grouping a set of chars:
re.sub(r'([\{\}\|])=([\{\}\|])',r'\1":"\2',modOutput)
Obiously it doens't work! Any Idea ?
This works at least with the given example:
re.sub(r'=([^{|}]*)', r':\1', s)
We're looking for a =, then capturing up to the next delimiter (one of {|}) in order to skip over subsequent = signs.

python regexp matching one or more characters in a group, except for particular alternatives

I want to match all combinations of <>=/*+- except for = and =>. How can I do this?
r = re.compile(r'[<>=/*+-]+')
This matches one or more characters in the set but I don't know how to prevent it from matching the = or => patterns. I'd guess it has something to do with negative lookahead or lookbehind but it's hard for me to wrap my head around that.
clarification: I literally want to match all combinations of the characters in <>=/*+- except for = and =>. In other words, I want to find maximal-length consecutive substrings consisting only of these characters -- and if the substring equals = or =>, it should not be considered a match.
I apologize for not clarifying earlier, but it seemed like a simple enough problem statement not to need the extra clarification.
Example cases:
pow pow -> bah bah contains the match ->
a +++->* b // c contains the matches +++->* and //
=> 3 <= 4 = 5 == 6 contains the matches <= and == (remember, = and => are not matches)
a <=> b <#> c contains the matches <=> and < and >
---= =--- contains the matches ---= and =---
edited: Implemented abarnert's suggestions below:
I would split this into two parts:
The first part will return a list of all matches - including the '=>' and '=' that you don't wish to match.
p1 = re.compile(r'[<>=/*+-]+')
The second part will filter these matches out.
all_matches = p1.finditer(your_string)
matches = [match.group() for match in all_matches if match.group() not in ('=', '=>')]
This might work:
pat = re.compile(r'((?!=|=>)[<>=/*+-]+)')
It uses negative look-around syntax, described in detail here:
Regular expression to match a line that doesn't contain a word?
EDIT: The simple look-around above will unfortunately match ">" when fed "=>" so to work around that it can get a little hairy:
pat = re.compile(r'((?!=>|(?!=)>)([<>/*+-]|[<>=/*+-]{2,10}))')
I'm assuming you don't want to match strings longer than 10. This separates the matches into single-character operators (from which we exclude "=") and multi-character operators (where "=" is ok) except for "=>" -- It also excludes an edge case we're not interested in, just the ">" of the rejected "=>"
This is completely unreadable, however, and if it makes it into your code there should be copious comments. Agree with other commenters that a single regex is not suited for this problem.

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Categories