Making Regex combination of multiple chars and numbers combined - python

How to write a regex that will combine numbers and chars in a string in any order?
For example, If I want to read some kind of invoice number, I have example like this:
example 1: 32ah3af
example 2: 32ahPP-A2ah3af
example 3: 3A63af-3HHx3B-APe5y5-9OPiis
example 4: 3A63af 3HHx3B APe5y5 9OPiis
So each 'block' have length between 3 and 7 chars (letters or numbers) that can be in any order (letters can be lowercase or uppercase). Each. 'block' can start with letter or with number.
It can have one "block" or max 4 blocks that are separated with ' ' or -.
I know that I can make separators like: \s or \-, but I have no idea how to make these kind of blocks that have (or do not have) separator.
I tried with something like this:
([0-9]?[A-z]?){3,7}
But it does not work

You could use
^[A-Za-z0-9]{3,7}(?:[ -][A-Za-z0-9]{3,7}){0,3}\b
The pattern matches:
^ Start of string
[A-Za-z0-9]{3,7} Match 3-7 times either a lower or uppercase char a-z or number 0-9
(?: Non capture group
[ -][A-Za-z0-9]{3,7} Match either a space or - and 3-7 times either a lower or uppercase char a-z or number 0-9
){0,3} Close the non capture group and repeat 0-3 times to have a maximum or 4 occurrences
\b A word boundary to prevent a partial match
Regex demo
Note that [A-z] matches more than [A-Za-z0-9]

As long as you want to only capture / search for the invoice ids, the suggestion from Hao Wu is valid:
r'\w{3,7}'
for regex (check here).
If you can drop the remaining part, then this should be enough.
You can more precisely capture the whole string with example 1:
r'example (\d+): ((\w{3,7}[\- ]?)+)'
See here how it works. Please note how capturing groups are represented.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Regex to extract first 5 digit+character from last hyphen

I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/

regex to find a pair of adjacent digits with different digits around them

I'm a beginner to regex and I am trying to make an expression to find if there are two of the same digits next to each other, and the digit behind and in front of the pair is different.
For example,
123456678 should match as there is a double 6,
1234566678 should not match as there is no double with different surrounding numbers.
12334566 should match because there are two 3s.
So far i have this which works only with 1, and as long as the double is not at the start or end of the string, however I can deal with that by adding a letter at the start and end.
^.*([^1]11[^1]).*$
I know i can use [0-9] instead of the 1s but the problem is having them all be the same digit.
Thank you!
I have divided my answer into four sections.
The first section contains my solution to the problem. Readers interested in nothing else may skip the other sections.
The remaining three sections are concerned with identifying the pairs of equal digits that are preceded by a different digit and are followed by a different digit. The first of the three sections matches them; the other two capture them in a group.
I've included the last section because I wanted to share The Greatest Regex Trick Ever with those unfamiliar with it, because I find it so very cool and clever, yet simple. It is documented here. Be forewarned that, to build suspense, the author at that link has included a lengthy preamble before the drum-roll reveal.
Determine if a string contains two consecutive equal digits that are preceded by a different digit and are followed by a different digit
You can test the string as follows:
import re
r = r'(\d)(?!\1)(\d)\2(?!\2)\d'
arr = ["123456678", "1123455a666788"]
for s in arr:
print(s, bool(re.search(r, s)) )
displays
123456678 True
1123455a666788 False
Run Python code | Start your engine!1
The regex engine performs the following operations.
(\d) : match a digit and save to capture group 1 (preceding digit)
(?!\1) : next character cannot equal content of capture group 1
(\d) : match a digit in capture group 2 (first digit of pair)
\2 : match content of capture group 2 (second digit of pair)
(?!\2) : next character cannot equal content of capture group 2
\d : match a digit
(?!\1) and (?!\2) are negative lookaheads.
Use Python's regex module to match pairs of consecutive digits that have the desired property
You can use the following regular expression with Python’s regex module to obtain the matching pairs of digits.
r'(\d)(?!\1)\K(\d)\2(?=\d)(?!\2)'
Regex Engine
The regex engine performs the following operations.
(\d) : match a digit and save to capture group 1 (preceding digit)
(?!\1) : next character cannot equal content of capture group 1
\K : forget everything matched so far and reset start of match
(\d) : match a digit in capture group 2 (first digit of pair)
\2 : match content of capture group 2 (second digit of pair)
(?=\d) : next character must be a digit
(?!\2) : next character cannot equal content of capture group 2
(?=\d) is a positive lookahead. (?=\d)(?!\2) could be replaced with (?!\2|$|\D).
Save pairs of consecutive digits that have the desired property to a capture group
Another way to obtain the matching pairs of digits, which does not require the regex module, is to extract the contents of capture group 2 from matches of the following regular expression.
r'(\d)(?!\1)((\d)\3)(?!\3)(?=\d)'
Re engine
The following operations are performed.
(\d) : match a digit in capture group 1
(?!\1) : next character does not equal last character
( : begin capture group 2
(\d) : match a digit in capture group 3
\3 : match the content of capture group 3
) : end capture group 2
(?!\3) : next character does not equal last character
(?=\d) : next character is a digit
Use The Greatest Regex Trick Ever to identify pairs of consecutive digits that have the desired property
We use the following regular expression to match the string.
r'(\d)(?=\1)|\d(?=(\d)(?!\2))|\d(?=\d(\d)\3)|\d(?=(\d{2})\d)'
When there is a match, we pay no attention to which character was matched, but examine the content of capture group 4 ((\d{2})), as I will explain below.
The Trick in action
The first three components of the alternation correspond to the ways that a string of four digits can fail to have the property that the second and third digits are equal, the first and second are unequal and the third and fourth are equal. They are:
(\d)(?=\1) : assert first and second digits are equal
\d(?=(\d)(?!\2)) : assert second and third digits are not equal
\d(?=\d(\d)\3) : assert third and fourth digits are equal
It follows that if there is a match of a digit and the first three parts of the alternation fail the last part (\d(?=(\d{2})\d)) must succeed, and the capture group it contains (#4) must contain the two equal digits that have the required properties. (The final \d is needed to assert that the pair of digits of interest is followed by a digit.)
If there is a match how do we determine if the last part of the alternation is the one that is matched?
When this regex matches a digit we have no interest in what digit that was. Instead, we look to capture group 4 ((\d{2})). If that group is empty we conclude that one of the first three components of the alternation matched the digit, meaning that the two digits following the matched digit do not have the properties that they are equal and are unequal to the digits that precede and follow them.
If, however, capture group 4 is not empty, it means that none of the first three parts of the alternation matched the digit, so the last part of the alternation must have matched and the two digits following the matched digit, which are held in capture group 4, have the desired properties.
1. Move the cursor around for detailed explanations.
With regex, it is much more convenient to use a PyPi regex module with the (*SKIP)(*FAIL) based pattern:
import regex
rx = r'(\d)\1{2,}(*SKIP)(*F)|(\d)\2'
l = ["123456678", "1234566678"]
for s in l:
print(s, bool(regex.search(rx, s)) )
See the Python demo. Output:
123456678 True
1234566678 False
Regex details
(\d)\1{2,}(*SKIP)(*F) - a digit and then two or more occurrences of the same digit
| - or
(\d)\2 - a digit and then the same digit.
The point is to match all chunks of identical 3 or more digits and skip them, and then match a chunk of two identical digits.
See the regex demo.
Inspired by the answer or Wiktor Stribiżew, another variation of using an alternation with re is to check for the existence of the capturing group which contains a positive match for 2 of the same digits not surrounded by the same digit.
In this case, check for group 3.
((\d)\2{2,})|\d(\d)\3(?!\3)\d
Regex demo | Python demo
( Capture group 1
(\d)\2{2,} Capture group 2, match 1 digit and repeat that same digit 2+ times
) Close group
| Or
\d(\d) Match a digit, capture a digit in group 3
\3(?!\3)\d Match the same digit as in group 3. Match the 4th digit, but is should not be the same as the group 3 digit
For example
import re
pattern = r"((\d)\2{2,})|\d(\d)\3(?!\3)\d"
strings = ["123456678", "12334566", "12345654554888", "1221", "1234566678", "1222", "2221", "66", "122", "221", "111"]
for s in strings:
match = re.search(pattern, s)
if match and match.group(3):
print ("Match: " + match.string)
else:
print ("No match: " + s)
Output
Match: 123456678
Match: 12334566
Match: 12345654554888
Match: 1221
No match: 1234566678
No match: 1222
No match: 2221
No match: 66
No match: 122
No match: 221
No match: 111
If for example 2 or 3 digits only is also ok to match, you could check for group 2
(\d)\1{2,}|(\d)\2
Python demo
You can also use a simple way .
import re
l=["123456678",
"1234566678",
"12334566 "]
for i in l:
matches = re.findall(r"((.)\2+)", i)
if any(len(x[0])!=2 for x in matches):
print "{}-->{}".format(i, False)
else:
print "{}-->{}".format(i, True)
You can customize this based on you rules.
Output:
123456678-->True
1234566678-->False
12334566 -->True

Searching multiple repeating patterns of text using regular exressions

I am trying to search for texts from a document, which have repeating portions and occur multiple times in the document. However, using the regex.match, it shows only the first match from the document and not others.
The patterns which I want to search looks like:
clauses 5.3, 12 & 15
clause 10 C, 10 CA & 10 CC
The following line shows the regular expression which I am using.
regex_crossref_multiple_1=r'(clause|Clause|clauses|Clauses)\s*\d+[.]?\d*\s*[a-zA-Z]*((,|&|and)\s*\d+[.]?\d*\s*[A-Z]*)+'
The code used for matching and the results are shown below:
cross=regex.search(regex_crossref_multiple_1,des)
(des is string containing text)
For printing the results, I am using print(cross.group()).
Result:
clauses 5.3, 12 & 15
However, there are other patterns as well in des which I am not getting in the result.
Please let me know what can be the problem.
The input string(des) is can be found from following link.
https://docs.google.com/document/d/1LPmYaD6VE724OYoXDGPfInvx8WTu5JfrTqTOIv8zAlg/edit?usp=sharing
In case, the contractor completes the work ahead of stipulated date of
completion or justified extended date of completion as determined
under clauses 5.3, 12 & 15, a bonus # 0.5 % (zero point five per cent) of
the tendered value per month computed on per day basis, shall be
payable to the contractor, subject to a maximum limit of 2 % (two
percent) of the tendered value. Provided that justified time for extra
work shall be calculated on pro-rata basis as cost of extra work excluding
amount payable/ paid under clause 10 C, 10 CA & 10 CC X stipulated
period /tendered value. The amount of bonus, if payable, shall be paid
along with final bill after completion of work. Provided always that
provision of the Clause 2A shall be applicable only when so provided in
‘Schedule F’
You could match clauses followed by an optional digits part and optional chars A-Z and then use a repeating pattern to match the optional following comma and the digits.
For the last part of the pattern you can optionally match either a ,, & or and followed by a digit and optional chars A-Z.
\b[Cc]lauses?\s+\d+(?:\.\d+)?(?:\s*[A-Z]+)?(?:,\s+\d+(?:\.\d+)?(?:\s*[A-Z]+)?)*(?:\s+(?:[,&]|and)\s+\d+(?:\.\d+)?(?:\s*[A-Z]+)?)?\b
Explanation
\b Word boundary
[Cc]lauses?\s+\d+(?:\.\d+)? Match clauses followed by digits and optional decimal part
(?:\s*[A-Z]+)? Optionally match whitespace chars and 1+ chars A-Z
(?: Non capture group
,\s+\d+(?:\.\d+)? Match a comma, digits and optional decimal part
(?:\s*[A-Z]+)? Optionally match whitespace chars and 1+ chars A-Z
)* Close group and repeat 0+ times
(?: Non capture group
\s+(?:[,&]|and) Match 1+ whitespace char and either ,, & or and
\s+\d+(?:\.\d+)? Match 1+ whitespace chars, 1+ digits with an optional decimal part
(?:\s*[A-Z]+)? Match optional whitespace chars and 1+ chars A-Z
)? Close group and make optional
\b Word boundary
Regex demo

How to extract characters of particular length from a given string in python Regex

How to extract characters of particular length from a given string in python Regex
Hi I have records like,
Eg:
Health Insurance PortabilityNEG Ratio
Health Insurance PortabilityNEGRatio
Health Insurance PortabilityNEG NEGRatio
Here I need to extract NEG as my to write a regex in python like
Portability(.+?) Ratio,
Portability(.+?)Ratio
where I first "NEG" after Portability is my valuewhich i should get. The first and Second records give me correct output as "NEG". But in my third record I get "NEG NEG" which is a wrong value.
I need to get only "NEG" for third record also.Should I give the length of the first three character to take only "NEG".
If so, Kindly let me know how can I write the regex according to that?
The . means any character at all, and the + symbol mean "at least one" but does not specify an upper limit. You want \w{n}, where \w means character and n means number of occurences.
Also, note that \w includes arithmetic digits, so if you only want letters, you'd better use [a-zA-Z]{3}
If you have to extract any 3 chars right after Portability use
re.findall(r"Portability(.{3}).*?Ratio", s)
See the regex demo
If these are uppercase letters, replace .{3} with [A-Z]{3}.
Details:
Portability - a literal char sequence
(.{3}) - Capturing group 1: exactly 3 chars (any chars other than line break chars if re.S/re.DOTALL modifier is not used) since {3} is a limiting quantifier matching the number of occurrences defined inside {...}
.*?Ratio - any 0+ chars other than line break chars as few as possible (as *? is a lazy quantifier) up to the first Ratio substring.
The re.findall only returns captured values, so you will only get NEG.

Categories