Python regex matching multiline string

Python regex matching multiline string - python

my_str :
PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'
my code
regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))
my output :
[('Applicants:', ' ', 'Silixa Ltd.')]
what I need is to get the string between 'Applicants:' and '\nInventors:'
'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'
Thanks in advance for your help

Try using re.DOTALL instead:
import re
text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''
regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))
gives me
$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']
The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.

If what you want is the contents between Applicants: and \nInventors:, your regex should reflect that:
>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']
re.S is the "dot matches all" option, so our (.*) will also match new lines. Note that this is different from re.MULTILINE, because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact . will not match newlines. If . doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.
Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between (), as in (Inventors:) in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.

If you want to match all the text between \nApplicants: and \nInventors:, you could also get the match without using re.DOTALL preventing unnecessary backtracking.
Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:
Then match Inventors.
^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
^ Start of string (Or use \b if it does not have to be at the start)
Applicants: Match literally
( Capture group 1
.* Match the rest of the line
(?:\r?\n(?!Inventors:).*)* Match all lines that do not start with Inverntors:
) Close group
\r?\nInventors: Match a newline and Inventors:
Regex demo | Python demo
Example code
import re
text = ("PCT Filing Date: 2 December 2015\n"
"Applicants: Silixa Ltd.\n"
"Chevron U.S.A. Inc. (Incorporated\n"
"in USA - California)\n"
"Inventors: Farhadiroushan,\n"
"Mahmoud\n"
"Gillies, Arran\n"
"Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))
Output
['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']

Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):
import re
txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""
pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}
Result:
>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}
>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'
Demo of the regex

Related

Extract names between Academic Degree Variances using Regex Python

This code is having trouble extracting complete names from between academic degrees, for example, Dr. Richard, MM or Dr. Bobby Richard Klaus, MM or Richar, MM. The academic degrees is not only Dr but also Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.
The output would be like this
The Goal Result
Complete Names
Names (?)
Dr. RICHARD, MM
Richard
Dra. BOBBY Richard Klaus, MM
Bobby Richard Klaus
Richard, MM
Richard
but actually, the result is expected to like this
Actual Result
Complete Names
Names
Dr. Richard, MM
Richard
Dra. Bobby Richard Klaus, MM
Richard Klaus
Richard, MM
Richard, MM
with this code
def extract_names(text):
""" fix capitalize """
text = re.sub(r"(_|-)+"," ", text).title()
""" find name between whitespace and comma """
text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
text = ' '.join(text[0].split(","))
then there is another problem, error
11 text = ' '.join(text[0].split(","))
12 return text
13 # def extract_names(text):
IndexError: list index out of range

You can use
ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$', '', text, flags=re.I)
See the regex demo.
The (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? pattern matches Dr, Drs, Dra, Prof, M.Ag, ME, MM optionally followed with a ..
The ^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ main pattern matches
^(?:\s*{ads})+\s* - start of string, then one or more sequences of zero or more whitespaces and ads pattern and then zero or more whitespaces
| - or
\s*, - zero or more whitespaces and a comma
(?:\s*{ads})+ - one or more repetitions of zero or more whitespaces and ads pattern
$ - end of string

Returning empty string for missing capture group Python regex

I'm working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:
ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'
ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'
What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex1)
# Output:
['Graduated', 'Graduated']
which is what I want, two entries for two Bachelor's degrees. For ex2, however, one of the Bachelor's degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex2)
# Output:
['Graduated']
# Desired Output:
['', 'Graduated']
I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!

You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.
re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
Output:
>>> re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']

How about this?
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex1)]]
# output ['Graduated', 'Graduated']
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex2)]]
# output ['', 'Graduated']

Extracting words next to a location or Duration in python

How can i extract words next to a location or Duration? What is the best possible regex in python to do this action?
Example:-
Kathick Kumar, Bangalore who was a great person and lived from 29th March 1980 - 21 Dec 2014.
In the above example i want to extract the words before location and the words before duration. Here the location and duration is not fixed, what will be the best possible regex for this in python? Or can we do this using nltk?
Desired output:-
Output-1: Karthick Kumar (Keyword here is Location)
Output-2: who was a great person and lived from (Keyword here is duration)

I suggest using Lookaheads.
In your example, assuming you want the words before Bangalore and 29th March 1980 - 21 Dec 2014, you could use lookaheads( and lookbehinds) to get the relevant match.
I've used this regex: (.*)(?>Bangalore)(.+)(?=29th March 1980 - 21 Dec 2014) and captured the text in parentheses, which can be accessed by using \1 and \2.
DEMO

Python Regex - Different Results in findall and sub

I am trying to replace occurrences of the work 'brunch' with 'BRUNCH'. I am using a regex which correctly identifies the occurrence, but when I try to use re.sub it is replacing more text than identified with re.findall. The regex that I am using is:
re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
The string is
str = 'Valid only for dine-in January 2 - March 31, 2015. Excludes brunch, happy hour, holidays, and February 13 - 15, 2015.'
I want it to produce:
'Valid only for dine-in January 2 - March 31, 2015. Excludes BRUNCH, happy hour, holidays, and February 13 - 15, 2015.'
The steps:
>>> reg.findall(str)
>>> ['brunch']
>>> reg.sub('BRUNCH',str)
>>> Valid only for dine-in January 2 - March 31, 2015BRUNCH, happy hour, holidays, and February 13 - 15, 2015.
Edit:
The final solution that I used was:
re.compile(r'((?:^|\.))(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)',re.IGNORECASE)
re.sub('\g<1>\g<2>BRUNCH',str)

For re.sub use
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)
Replace by \1\2BRUNCH.See demo.
https://regex101.com/r/eZ0yP4/16

Through regex:
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)brunch
DEMO
Replace the matched characters by \1\2BRUNCH

Why does it match more than brunch
Because your regex actually does match more than brunch
See link on how the regex match
Why doesnt it show in findall?
Because you have wraped only the brunch in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
>>> reg.findall(str)
['brunch']
After wraping entire ([^.]*brunch) in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*brunch)',re.IGNORECASE)
>>> reg.findall(str)
[' Excludes brunch']
re.findall ignores those are not caputred

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?

It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space

text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex matching multiline string - python

Related

Extract names between Academic Degree Variances using Regex Python

Returning empty string for missing capture group Python regex

Extracting words next to a location or Duration in python

Python Regex - Different Results in findall and sub

Why doesn't this regular expression work in all cases?

Categories

Resources