Limiting regex length - python

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8

With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']

Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.

If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

regex conditional matching

I am trying to use re.findall to find this pattern:
01-234-5678
regex:
(\b\d{2}(?P<separator>[-:\s]?)\d{2}(?P=separator)\d{3}(?P=separator)\d{3}(?:(?P=separator)\d{4})?,?\.?\b)
however, some cases have shortened to 01-234-5 instead of 01-234-0005 when the last four digits are 3 zeros followed by a non-zero digit.
Since there does't seem to be any uniformity in formatting I had to account for a few different separator characters or possibly none at all. Luckily, I have only noticed this shortening when some separator has been used...
Is it possible to use a regex conditional to check if a separator does exist (not an empty string), then also check for the shortened variation?
So, something like if separator != '': re.findall(r'(\b\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)(\d{4}|\d{1})\.?\b)', text)
Or is my only option to include all the possibly incorrect 6 digit patterns then check for a separator with python?
If you want the last group of digits to be "either one or four digits", try:
>>> import re
>>> example = "This has one pattern that you're expecting, 01-234-5678, and another that maybe you aren't: 23:456:7"
>>> pattern = re.compile(r'\b(\d{2}(?P<sep>[-:\s]?)\d{3}(?P=sep)\d(?:\d{3})?)\b')
>>> pattern.findall(example)
[('01-234-5678', '-'), ('23:456:7', ':')]
The last part of the pattern, \d(?:\d{3})?), means one digit, optionally followed by three more (i.e. one or four). Note that you don't need to include the optional full stop or comma, they're already covered by \b.
Given that you don't want to capture the case where there is no separator and the last section is a single digit, you could deal with that case separately:
r'\b(\d{9}|\d{2}(?P<sep>[-:\s])\d{3}(?P=sep)\d(?:\d{3})?)\b'
# ^ exactly nine digits
# ^ or
# ^ sep not optional
See this demo.
It is not clear why you are using word boundaries, but I have not seen your data.
Otherwise you can shorten the entire this to this:
re.compile(r'\d{2}(?P<separator>[-:\s]?)\d{3}(?P=separator)\d{1,4}')
Note that \d{1,4} matched a string with 1, 2, 3 or 4 digits
If there is no separator, e.g. "012340008" will match the regex above as you are using [-:\s]? which matches 0 or 1 times.
HTH

Isolate the first number after a letter with regular expressions

I am trying to parse a chemical formula that is given to me in unicode in the format C7H19N3
I wish to isolate the position of the first number after the letter, I.e 7 is at index 1 and 1 is at index 3. With is this i want to insert "sub" infront of the digits
My first couple attempts had me looping though trying to isolate the position of only the first numbers but to no avail.
I think that Regular expressions can accomplish this, though im quite lost in it.
My end goal is to output the formula Csub7Hsub19Nsub3 so that my text editor can properly format it.
How about this?
>>> re.sub('(\d+)', 'sub\g<1>', "C7H19N3")
'Csub7Hsub19Nsub3'
(\d+) is a capturing group that matches 1 or more digits. \g<1> is a way of referring to the saved group in the substitute string.
Something like this with lookahead and lookbehind:
>>> strs = 'C7H19N3'
>>> re.sub(r'(?<!\d)(?=\d)','sub',strs)
'Csub7Hsub19Nsub3'
This matches the following positions in the string:
C^7H^19N^3 # ^ represents the positions matched by the regex.
Here is one which literally matches the first digit after a letter:
>>> re.sub(r'([A-Z])(\d)', r'\1sub\2', "C7H19N3")
'Csub7Hsub19Nsub3'
It's functionally equivalent but perhaps more expressive of the intent? \1 is a shorter version of \g<1>, and I also used raw string literals (r'\1sub\2' instead of '\1sub\2').

Categories