Regex Name Retrieval - python

I'm attempting to write a simple Regex expression that retrieves names for me based on the presence of a character string at the end of a line.
I've been successful at isolating each of these patterns using pythex in my data set, but I have been unable to match them as a conditional group.
Can someone explain what I am doing wrong?
Data Example
Mark Samson: CA
Sam Smith: US
Dawn Watterton: CA
Neil Shughar: CA
Fennial Fontaine: US
I want to be able to create a regex expression that uses the end of each line as the condition of the group match - i.e I want a list of those who live in the US from this dataset. I have used each of these expressions in isolation and it seems to work in matching what I am looking for. What I need is help in making the below a grouped search.
Does anyone have any suggestion?
([US]$)([A-Z][a-z]+)

Something like the following?
(\w+[ \w]*): US

You say "I have been unable to match them as a conditional group", but you are not using any conditional groups. ([US]$)([A-Z][a-z]+) is an example of a pattern that never matches any string as it matches U or S, then requires an end of string, and then matches an uppercase ASCII letter and one or more ASCII lowercase letters.
You want any string from start till a colon, whitespaces, and US substring at the end of string.
Hence, use
.+?(?=:\s*US$)
^(.+?):\s*US$
See the regex demo. Details:
.+? - one or more chars other than line break chars as few as possible
(?=:\s*US$) - a positive lookahead that matches a location immediately followed with :, zero or more whitespaces, US string and the end of string.
See a Python demo:
import re
texts = ["Mark Samson: CA", "Sam Smith: US", "Dawn Watterton: CA", "Neil Shughar: CA", "Fennial Fontaine: US"]
for text in texts:
match = re.search(r".+?(?=:\s*US$)", text)
if match:
print(match.group()) # With r"^(.+?):\s*US$" regex, use match.group(1) here
Output:
Sam Smith
Fennial Fontaine

Related

Extract the string from the document using regex in python

I need to extract a string from a document with the following regex pattern in python.
string will always start with either "AK" or "BK"..followed by numbers or letters or - or /(any order)
This string pattern can contain anywhere in the document
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
I have written following code.
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=re.findall(pattern,document_text)
but I am getting the list contains AKs and BKs
something like this
res_list=['AKBN','BKCPU','AK3418CPMP']
when I just use
res_grp=re.search(pattern,document_text)
res=res_grp.group(1)
I just get 'AKBN'
it is also matching the words "AKBN", "BKCPU"
along with the required "AK3418CPMP" when I use findall.
I want conditions to be following to extract only 1 string "AK3418CPMP":
string should start with AK or BK
It should followed by letters and numbers or numbers and letters
It can contain "-" or "/"
How can I only extract "AK3418CPMP"
You can make sure to match at least a single digit after matching AK or BK and move the - to the end of the character class or else it would denote a range.
\b[AB]K[A-Za-z/-]*[0-9][A-Za-z0-9/-]*
\b A word boundary to prevent a partial match
[AB]K Match either AK or BK
[A-Za-z/-]* Optionally repeat matching chars A-Za-z / or - without a digit
[0-9] Match at least a single digit
[A-Za-z0-9/-]* Optionally match what is listed in the character class including the digit
Regex demo
You can keep your regex, and make python do the filtering.
import re
document_text="""
This is the organization..this is the address.
AKBN
some information
AK3418CPMP
lot of other information down
BKCPU
"""
pattern="(?:AK|BK)[A-Za-z0-9-/]+"
res_list=[x for x in
re.findall(pattern,document_text)
if re.search(r'\d', x)
and re.search(r'\w', x)]
print(res_list)
You can include a 'match at least' clause like: ([AB]K[A-Z]{1,}[0-9]{1,})|([AB]K[0-9]{1,}[A-Z]{1,}). This would cover your 1st and 2nd needs. You can customize this regex condition to track the '-' and '/' cases too.
Let's suppose you would like to track cases where the '-' or '/' would separate your substrings :
([AB]K(-|\/){0,1}[A-Z]{1,}(-|\/){0,1}[0-9]{1,})|([AB]K(-|\/){0,1}[0-9]{1,}(-|\/){0,1}[A-Z]{1,})

Pandas/regex based approach to match first string from a list of strings

Apologies if this is cross-listed; I searched for a while!
I'm working with some very large, very messy data in Pandas. The variable of interest is a string, and contains one or more instances of business names with(out) typical business suffixes (e.g., LLC, LP, LTD). For example, I might have "ABC LLC XYZ,LLC XYZ, LTD". My goal is to find the first instance of a suffix, matched from a list. I also need to extract everything up to this first match. For the above example, I'd except to find/extract "ABC LLC". Consider the following data:
sfx = ['LLC','LP','LTD']
dat = pd.DataFrame({'name':['ABC LLC XYZ,LLC XYZ, LTD','IJK LP, ADDRESS']})
So far, I've accomplished this for a single case in a convoluted way that isn't working for me:
one_string = 'ABC LLC XYZ,LLC XYZ, LTD'
indexes=[]
keywords=dict()
for sf in sfx:
indexes.append(one_string.index(sf,0))
keywords[one_string.index(sf,0)]=sf
indexes.sort()
print(one_string[0:indexes[0]]+ keywords[indexes[0]])
I'm looking for a more efficient (possibly vectorized) way of doing this for an entire column. In addition, I need to incorporate regex in order to avoid extracting suffixes when the same letter combinations just happen to appear in the text. The regex pattern I need to match might look something like this (LLC appears after space or comma and is at the end of a word):
reg_pattern = r`(?<=[\s\,])LLC\b|(?<=[\s\,])LP\b|(?<=[\s\,])LTD\b`
UPDATE
Straightforward solution by Wiktor. I also realized once I have extract what precedes the suffix, I will then need to extract everything that comes after it separately. Throwing the solution into a positive look behind didn't work. Very appreciative!
To get the texts that come before and including the keywords, you may use
pattern = r"^(.*?\b(?:{}))(?!\w)".format("|".join(map(re.escape, names)))
and then
df['results'] = df['texts'].str.extract(pat, expand=False)
Adjust the column names to match your code. The pattern will look like ^(.*?\b(?:LLC|LP|LTD))(?!\w) and will mean:
^ - start of string
(.*?\b(?:LLC|LP|LTD)) - Group 1 (this value will be returned by .str.extract):
.*? - any 0+ chars other than line break chars, as few as possible
\b - a word boundary
(?:LLC|LP|LTD) - one of the alternatives: LLC, LP or LTD
(?!\w) - not followed with a word char: letter, digit or _.
To get all text after a match, you may use
pattern = r"\b(?:{})(?!\w)(.*)".format("|".join(map(re.escape, names)))
Here, the pattern will look like \b(?:LLC|LP|LTD))(?!\w)(.*) and it first matches one of the names as a whole word, and then captures into Group 1 all the rest of the line (matched with (.*) - any 0 or more chars other than line break chars).

Using Regex to search for a string unless it finds another string first

Hello I'm trying to use regex to search through a markdown file for a date and only get a match if it finds an instance of a specific string before it finds another date.
This is what I have right now and it definitely doesn't work.
(\d{2}\/\d{2}\/\d{2})(string)?(^(\d{2}\/\d{2}\/\d{2}))
So in this instance It would throw a match since the string is before the next date:
01/20/20
string
01/21/20
Here it shouldn't match since the string is after the next date:
01/20/20
this isn't the phrase you're looking for
01/21/20
string
Any help on this would be greatly appreciated.
You could match a date like pattern. Then use a tempered greedy token approach (?:(?!\d{2}\/\d{2}\/\d{2}).)* to match string without matching another date first.
If you have matched the string, use a non greedy dot .*? to match the first occurrence of the next date.
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string.*?\d{2}\/\d{2}\/\d{2}
Regex demo | Python demo
For example (using re.DOTALL to make the dot match a newline)
import re
regex = r"\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}"
test_str = """01/20/20\n\n"
"string\n\n"
"01/21/20\n\n"
"01/20/20\n\n"
"this isn't the phrase you're looking for\n\n"
"01/21/20\n\n"
"string"""
print(re.findall(regex, test_str, re.DOTALL))
Output
['01/20/20\n\n"\n\t"string\n\n"\n\t"01/21/20']
If the string can not occur 2 times between the date, you might use
\d{2}\/\d{2}\/\d{2}(?:(?!\d{2}\/\d{2}\/\d{2}|string).)*string(?:(?!string|\d{2}\/\d{2}\/\d{2}).)*\d{2}\/\d{2}\/\d{2}
Regex demo
Note that if you don't want the string and the dates to be part of a larger word, you could add word boundaries \b
One approach here would be to use a tempered dot to ensure that the regex engine does not cross over the ending date while trying to find the string after the starting date. For example:
inp = """01/20/20
string # <-- this is matched
01/21/20
01/20/20
01/21/20
string""" # <-- this is not matched
matches = re.findall(r'01/20/20(?:(?!\b01/21/20\b).)*?(\bstring\b).*?\b01/21/20\b', inp, flags=re.DOTALL)
print(matches)
This prints string only once, that match being the first occurrence, which legitimately sits in between the starting and ending dates.

extract word and before word and insert between ”_” in regex

I need some help on declaring a regex. My inputs are like the following:
I need to extract word and before word and insert between ”_” in regex:python
Input
Input
s2 = 'Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
# my regex pattern
re.sub(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}diagnosis", r"\1_", s2)
Desired Output:
s2 = 'Some other medical terms and stuff_diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
You have no capturing group defined in your regex, but are using \1 placeholder (replacement backreference) to refer to it.
You want to replace 1+ special chars other than - and ' before the word diagnosis, thus you may use
re.sub(r"[^\w'-]+(?=diagnosis)", "_", s2)
See this regex demo.
Details
[^\w'-]+ - any non-word char excluding ' and _
(?=diagnosis) - a positive lookahead that does not consume the text (does not add to the match value and thus re.sub does not remove this piece of text) but just requires diagnosis text to appear immediately to the right of the current location.
Or
re.sub(r"[^\w'-]+(diagnosis)", r"_\1", s2)
See this regex demo. Here, [^\w'-]+ also matches those special chars, but (diagnosis) is a capturing group whose text can be referred to using the \1 placeholder from the replacement pattern.
NOTE: If you want to make sure diagnosis is matched as a whole word, use \b around it, \bdiagnosis\b (mind the r raw string literal prefix!).

python regex match optional square brackets

I have the following strings:
1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003",
2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]",
3 '''GREGORY LANCASTER AND JOHN HENRY HUNTER V CULLEN INVESTMENTS LIMITED AND
ERIC JOHN WATSON CA CA51/03 26 May 2003'''
I am trying to find a regular expression which matches all of them. I don't know how to match optional square brackets around the date at the end of the string eg [16 May 2014].
casename = re.compile(r'(^[A-Z][A-Za-z\'\(\) ]+\b[v|V]\b[A-Za-z\'\(\) ]+(.*?)[ \[ ]\d+ \w+ \d\d\d\d[\] ])', re.S)
The date regex at the end only matches cases with dates in square bracket but not the ones without.
Thank to everybody who answered. #Matt Clarkson what I am trying to match is a judicial decision 'handle' in a much larger text. There is a large variation within those handles, but they all start at the beginning of a line have 'v' for versus between the party names and a date at the end. Mostly the names of the parties are in capital but not exclusively. I am trying to have only one match per document and no false positives.
I got all of them to match using this (You'll need to add the case-insensitive flag):
(^[a-z][a-z\'&\(\) ]+\bv\b[a-z&\'\(\) ]+(?:.*?) \[?\d+ \w+ \d{4}\]?)
Regex Demo
Explanation:
( Begin capture group
[a-z\'&\(\) ]+ Match one or more of the characters in this group
\b Match a word boundary
v Match the character 'v' literally
\b Match a word boundary
[a-z&\'\(\) ]+ Match one or more of the characters in this group
(?: Begin non-capturing group
.*? Match anything
) End non-capturing group
\[?\d+ \w+ \d{4}\]? Match a date, optionally surrounded by brackets
) End capture group
How to make Square brackets optional, can be achieved like this:
[\[]* with the * it makes the opening [ optional.
A few recommendations if I may:
This \d\d\d\d could be also expressed like this as well \d{4}
[v|V] in regex what is inside the [] is already one or other | is not necessary [vV]
And here is what an online demo
Using your regex and input strings, it looks like you will match only the 2nd line (if you get rid of the '^' at the beginning of the regex. I've added inline comments to each section of the regular expression you provided to make it more clear.
Can you indicate what you are trying to capture from each line? Do you want the entire string? Only the word immediately preceding the lone letter 'v'? Do you want the date captured separately?
Depending on the portions that you wish to capture, each section can be broken apart into their respective match groups: regex101.com example. This is a little looser than yours (capturing the entire section between quotation marks instead of only the single word immediately preceding the lone 'v'), and broken apart to help readability (each "group" on its own line).
This example also assumes the newline is intentional, and supports the newline component (warning: it COULD suck up more than you intend, depending on whether the date at the end gets matched or not).

Categories