Regular Expression in Python

Regular Expression in Python - python

I don't know how to find the string using regular expression, the format of string is below.
[ any symbol 0~n times any number 1~n times] 1~n times.
It's seems like phone number matched. But the difference is that can insert any symbols and white space between numbers, for example
458###666###2##111####111
OR
(123)))444###555%%6222%%%%
I don't know if I explain the question clearly.
Anyway, thanks for your reply.

I think this represents the pattern you described
^(?:(\D?)\1*\d+)+$
See it here on Regexr
^ matches the start of the string
(\D?)\1* will match an optional non digit (\D), put it into a backreference and match this same character again 0 or more times using \1*
\d+ at least 1 digit
(?:(\D?)\1*\d+)+ the complete non capturing group is repeated 1 or more times
$ matches the end of the string
It will match
458###666###2##111####111
(123)))444###555%%6222%%%%1
(((((((((123)))444###555%%6222%%%%1
But not
s(123)))444###555%%6222%%%%1
(123)))444###555%%6222%%%%
Your statement:
[ any symbol 0~n times any number 1~n times] 1~n times.
does not fit to your second example (123)))444###555%%6222%%%% that does not end with a digit.

If you need to gather all the groups of digits from the string you can use \d+ regex:
>>> re.findall('\d+', '458###666###2##111####111 OR (123)))444###555%%6222%%%%')
['458', '666', '2', '111', '111', '123', '444', '555', '6222']

[ NOTE, I am ignoring the 'in python', opting instead for a more general 'build regular expressions' answer, in the hope that this will not only provide the desired answer but be something to take away for different RE-related problems ]
First, you want to match any symbol (or possibly any symbol, except a number), 0 or more times. That would be one of .* or [^0-9]* (the first is the 'anything wildcard', the second is a character class of everything except the numbers 0 to 9. The * is a 'match at least no times'.
Second, you want to match one or more digits. That, too, is relatively easy: [0-9]+ (or if you have a sufficiently old and anal RE library, [0-9][0-9]*, but that is highly unlikely to be the case outside a CS exam).
Third, you want to group that and repeat the grouping at least one time.
The general syntax for grouping is to enclose the group in parentheses (except in emacs, where you need \(, as the plain parenthesis is frequently matched). So, something along the lines of ([^0-9]*[0-9]+)+ should do the trick.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!

The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)

You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8

With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']

Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.

If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Python Regular Expression -- not matching digits at end of string

This will be really quick marks for someone...
Here's my string:
Jan 13.BIGGS.04222 ABC DMP 15
I'm looking to match:
the date at the front (mmm yy) format
the name in the second field
the digits at the end. There could be between one and three.
Here is what I have so far:
(\w{3} \d{2})\.(\w*)\..*(\d{1,3})$
Through a lot of playing around with http://www.pythonregex.com/ I can get to matching the '5', but not '15'.
What am I doing wrong?

Use .*? to match .* non-greedily:
In [9]: re.search(r'(\w{3} \d{2})\.(\w*)\..*?(\d{1,3})$', text).groups()
Out[9]: ('Jan 13', 'BIGGS', '15')
Without the question mark, .* matches as many characters as possible, including the digit you want to match with \d{1,3}.

Alternatively to what #unutbu has proposed, you can also use word boundary \b - this matches "word border":
(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$
From the site you referred:
>>> regex = re.compile("(\w{3} \d{2})\.(\w*)\..*\b(\d{1,3})$")
>>> regex.findall('Jan 13.BIGGS.04222 ABC DMP 15')
[(u'Jan 13', u'BIGGS', u'15')]

.* before numbers are greedy and match as much as it can, leaveing least possible digits to the last block. You either need to make it non-greedy (with ? like unutbu said) or make it do not match digits, replacing . with \D

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.

How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.

Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.

Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Meaning of regex Python

Is the meaning of this regex: (\d+).*? - group a set of numbers, then take whatever that comes after (only one occurance of it at maximum, except a newline)?
Is there a difference in: (\d+) and [\d]+?

Take as many digits as possible (at least 1), then take the smallest amount of characters as possible (except newline). The non greedy qualifier (?) doesn't really help unless you have the rest of your pattern following it, otherwise it will just match as little as possible, in this case, always 0.
>>> import re
>>> re.match(r'(\d+).*?', '123').group()
'123'
>>> re.match(r'(\d+).*?', '123abc').group()
'123'
The difference between (\d+) and [\d]+ is the fact that the former groups and the latter doesn't. ([\d]+) would however be equivalent.
>>> re.match(r'(\d+)', '123abc').groups()
('123',)
>>> re.match(r'[\d]+', '123abc').groups()
()

(\d)+ One or more occurance of digits,
.* followed by any characters,
? lazy operator i.e. return the minimum match.

group1 will be at least one number and group0 will contain group1 and maybe other characters but not necessarily.
edit to answer the edited question: AFAIK there should be no difference in the matching between those 2 other than the grouping.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression in Python - python

If you need to gather all the groups of digits from the string you can use \d+ regex: >>> re.findall('\d+', '458###666###2##111####111 OR (123)))444###555%%6222%%%%') ['458', '666', '2', '111', '111', '123', '444', '555', '6222']

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

Limiting regex length

Python Regular Expression -- not matching digits at end of string

Isolate the first number after a letter with regular expressions

Meaning of regex Python

Categories

Resources