Why does this regex to find repeated characters fail? - python

I'm trying to build a regex to match any occurrence of two or more repeated alphanumeric characters. The following regex fails:
import re
s = '__commit__'
m = re.search(r'([a-zA-Z0-9])\1\1', s)
But when I change it to this it works:
m = re.search(r'([a-zA-A0-9])\1+', s)
I'm pretty baffled as to why this is the way it is. Can anyone provide some insight?

Look at this line.
m = re.search(r'([a-zA-Z0-9])\1\1', s)
You are using a pattern and two backreferences (A reference of already matched pattern). So, it will match only when minimum of three consecutive characters appear. You can do:
m = re.search(r'([a-zA-Z0-9])\1', s)
Which will match when minimum of two consecutive character appears.
However, the following one is much better.
m = re.search(r'([a-zA-A0-9])\1+', s)
That's because, now you are trying to match at least one or more backreferences \1+, that is minimum two consecutive characters.

The \1 is a back-reference to any of the previously matching groups. So the original regex that does not work for you essentially means :
Match alphanumeric strings that contain 3 occurences of the previously matchd group. In this case the previously matched group ([a-zA-Z0-9]) contains a single character a-z or A-Z or 0-9. You then have two '\1 in your regex which accounts for two back-references to the previously matched character.
In the second regex the back-reference \1 has a + in front of it which means match atleast one occurence of the previously captured character - which means that the string confirming to this pattern has to be atleast 2 characters in length.
Hope this helps.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Extract the string from the document using regex in python

I need to extract a string from a document with the following regex pattern in python.
string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order)
This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs
something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall.
I want conditions to be following to extract only 1 string "AK3418CPMP":
string should start with AK or BK
It should followed by letters and numbers or numbers and letters
It can contain "-" or "/"
How can I only extract "AK3418CPMP"
You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b A word boundary to prevent a partial match
[AB]K Match either AK or BK
[A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
[0-9] Match at least a single digit
[A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit
Regex demo
You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)
You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Regex pattern to match substring

Would like to find the following pattern in a string:
word-word-word++ or -word-word-word++
So that it iterates the -word or word- pattern until the end of the substring.
the string is quite large and contains many words with those^ patterns.
The following has been tried:
p = re.compile('(?:\w+\-)*\w+\s+=', re.IGNORECASE)
result = p.match(data)
but it returns NONE. Does anyone know the answer?
Your regex will only match the first pattern, match() will only find one occurrence, and that only if it is immediately followed by some whitespace and an equals sign.
Also, in your example you implied you wanted three or more words, so here's a version that was changed in the following ways:
match both patterns (note the leading -?)
match only if there are at least three words to the pattern ({2,} instead of +)
match even if there's nothing after the pattern (the \b matches a word boundary. It is not really necessary here, since the preceding \w+ guarantees we are at a word boundary anyway)
returns all matches instead of only the first one.
Here's the code:
#!/usr/bin/python
import re
data=r"foo-bar-baz not-this -this-neither nope double-dash--so-nope -yeah-this-even-at-end-of-string"
p = re.compile(r'-?(?:\w+-){2,}\w+\b', re.IGNORECASE)
print p.findall(data)
# prints ['foo-bar-baz', '-yeah-this-even-at-end-of-string']

Categories