Extract the string from the document using regex in python - python

I need to extract a string from a document with the following regex pattern in python.
string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order)
This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs
something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall.
I want conditions to be following to extract only 1 string "AK3418CPMP":
string should start with AK or BK
It should followed by letters and numbers or numbers and letters
It can contain "-" or "/"
How can I only extract "AK3418CPMP"

You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b A word boundary to prevent a partial match
[AB]K Match either AK or BK
[A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
[0-9] Match at least a single digit
[A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit
Regex demo

You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)

You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Pandas/regex based approach to match first string from a list of strings

Apologies if this is cross-listed; I searched for a while!
I'm working with some very large, very messy data in Pandas. The variable of interest is a string, and contains one or more instances of business names with(out) typical business suffixes (e.g., LLC, LP, LTD). For example, I might have "ABC LLC XYZ,LLC XYZ, LTD". My goal is to find the first instance of a suffix, matched from a list. I also need to extract everything up to this first match. For the above example, I'd except to find/extract "ABC LLC". Consider the following data:
sfx = ['LLC','LP','LTD']
dat = pd.DataFrame({'name':['ABC LLC XYZ,LLC XYZ, LTD','IJK LP, ADDRESS']})
So far, I've accomplished this for a single case in a convoluted way that isn't working for me:
one_string = 'ABC LLC XYZ,LLC XYZ, LTD'
indexes=[]
keywords=dict()
for sf in sfx:
indexes.append(one_string.index(sf,0))
keywords[one_string.index(sf,0)]=sf
indexes.sort()
print(one_string[0:indexes[0]]+ keywords[indexes[0]])
I'm looking for a more efficient (possibly vectorized) way of doing this for an entire column. In addition, I need to incorporate regex in order to avoid extracting suffixes when the same letter combinations just happen to appear in the text. The regex pattern I need to match might look something like this (LLC appears after space or comma and is at the end of a word):
reg_pattern = r`(?<=[\s\,])LLC\b|(?<=[\s\,])LP\b|(?<=[\s\,])LTD\b`
UPDATE
Straightforward solution by Wiktor. I also realized once I have extract what precedes the suffix, I will then need to extract everything that comes after it separately. Throwing the solution into a positive look behind didn't work. Very appreciative!
To get the texts that come before and including the keywords, you may use
pattern = r"^(.*?\b(?:{}))(?!\w)".format("|".join(map(re.escape, names)))
and then
df['results'] = df['texts'].str.extract(pat, expand=False)
Adjust the column names to match your code. The pattern will look like ^(.*?\b(?:LLC|LP|LTD))(?!\w) and will mean:
^ - start of string
(.*?\b(?:LLC|LP|LTD)) - Group 1 (this value will be returned by .str.extract):
.*? - any 0+ chars other than line break chars, as few as possible
\b - a word boundary
(?:LLC|LP|LTD) - one of the alternatives: LLC, LP or LTD
(?!\w) - not followed with a word char: letter, digit or _.
To get all text after a match, you may use
pattern = r"\b(?:{})(?!\w)(.*)".format("|".join(map(re.escape, names)))
Here, the pattern will look like \b(?:LLC|LP|LTD))(?!\w)(.*) and it first matches one of the names as a whole word, and then captures into Group 1 all the rest of the line (matched with (.*) - any 0 or more chars other than line break chars).

Regex for string that has 5 numbers or IND/5numbers

I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?
Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).
Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo
This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .
Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number
I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable

Why does this regex to find repeated characters fail?

I'm trying to build a regex to match any occurrence of two or more repeated alphanumeric characters. The following regex fails:
import re
s = '__commit__'
m = re.search(r'([a-zA-Z0-9])\1\1', s)
But when I change it to this it works:
m = re.search(r'([a-zA-A0-9])\1+', s)
I'm pretty baffled as to why this is the way it is. Can anyone provide some insight?
Look at this line.
m = re.search(r'([a-zA-Z0-9])\1\1', s)
You are using a pattern and two backreferences (A reference of already matched pattern). So, it will match only when minimum of three consecutive characters appear. You can do:
m = re.search(r'([a-zA-Z0-9])\1', s)
Which will match when minimum of two consecutive character appears.
However, the following one is much better.
m = re.search(r'([a-zA-A0-9])\1+', s)
That's because, now you are trying to match at least one or more backreferences \1+, that is minimum two consecutive characters.
The \1 is a back-reference to any of the previously matching groups. So the original regex that does not work for you essentially means :
Match alphanumeric strings that contain 3 occurences of the previously matchd group. In this case the previously matched group ([a-zA-Z0-9]) contains a single character a-z or A-Z or 0-9. You then have two '\1 in your regex which accounts for two back-references to the previously matched character.
In the second regex the back-reference \1 has a + in front of it which means match atleast one occurence of the previously captured character - which means that the string confirming to this pattern has to be atleast 2 characters in length.
Hope this helps.

Categories