Regex for string that has 5 numbers or IND/5numbers - python

I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?

Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).

Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo

This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .

Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number

I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Extract the string from the document using regex in python

I need to extract a string from a document with the following regex pattern in python.
string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order)
This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs
something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall.
I want conditions to be following to extract only 1 string "AK3418CPMP":
string should start with AK or BK
It should followed by letters and numbers or numbers and letters
It can contain "-" or "/"
How can I only extract "AK3418CPMP"
You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b A word boundary to prevent a partial match
[AB]K Match either AK or BK
[A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
[0-9] Match at least a single digit
[A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit
Regex demo
You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)
You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})

regex - Extract complete word base on match string [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
Can some one please help me on this - Here I'm trying extract word from given sentence which contains G,ML,KG,L,ML,PCS along with numbers .
I can able to match the string , but not sure how can I extract the comlpete word
for example my input is "This packet contains 250G Dates" and output should be 250G
another example is "You paid for 2KG Apples" and output should be 2KG
in my regular expression I'm getting only match string not complete word :(
import re
val = 'FUJI ALUMN FOIL CAKE, 240G, CHCLTE'
key_vals = ['G','GM','KG','L','ML','PCS']
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
This regex will not get you what you want:
re.findall("\d+\.?\d*(\s|G|KG|GM|L|ML|PCS)\s?", val)
Let's break it down:
\d+: one or more digits
\.?: a dot (optional, as indicated by the question mark)
\d*: one or more optional digits
(\s|G|KG|GM|L|ML|PCS): a group of alternatives, but whitespace is an option among others, it should be out of the group: what you probably want is allow optional whitespace between the number and the unit ie: 240G or 240 G
\s?: optional whitespace
A better expression for your purpose could be:
re.findall("\d+\s*(?:G|KG|GM|L|ML|PCS)", val)
That means: one or more digits, followed by optional whitespace and then either of these units: G|KG|GM|L|ML|PCS.
Note the presence of ?: to indicate a non-capturing group. Without it the expression would return G
Try using this Regex:
\d+\s*(G|KG|GM|L|ML|PCS)\s?
It matches every string which starts with at least one digit, is then followed by one the units. Between the digits and the units and behind the units there can also be whitespaces.
Adjust this like you want to :)
Use non-grouping parentheses (?:...) instead of the normal ones. Without grouping parentheses findall returns the string(s) which match the whole pattern.

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Categories