Python Regex - Different Results in findall and sub - python

I am trying to replace occurrences of the work 'brunch' with 'BRUNCH'. I am using a regex which correctly identifies the occurrence, but when I try to use re.sub it is replacing more text than identified with re.findall. The regex that I am using is:
re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
The string is
str = 'Valid only for dine-in January 2 - March 31, 2015. Excludes brunch, happy hour, holidays, and February 13 - 15, 2015.'
I want it to produce:
'Valid only for dine-in January 2 - March 31, 2015. Excludes BRUNCH, happy hour, holidays, and February 13 - 15, 2015.'
The steps:
>>> reg.findall(str)
>>> ['brunch']
>>> reg.sub('BRUNCH',str)
>>> Valid only for dine-in January 2 - March 31, 2015BRUNCH, happy hour, holidays, and February 13 - 15, 2015.
Edit:
The final solution that I used was:
re.compile(r'((?:^|\.))(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)',re.IGNORECASE)
re.sub('\g<1>\g<2>BRUNCH',str)

For re.sub use
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)
Replace by \1\2BRUNCH.See demo.
https://regex101.com/r/eZ0yP4/16

Through regex:
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)brunch
DEMO
Replace the matched characters by \1\2BRUNCH

Why does it match more than brunch
Because your regex actually does match more than brunch
See link on how the regex match
Why doesnt it show in findall?
Because you have wraped only the brunch in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
>>> reg.findall(str)
['brunch']
After wraping entire ([^.]*brunch) in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*brunch)',re.IGNORECASE)
>>> reg.findall(str)
[' Excludes brunch']
re.findall ignores those are not caputred

Related

Extract data if between substrings else full string

I have string pattern like these:
Beginning through June 18, 2022 at Noon standard time\n
Jan 20, 2022
Beginning through April 26, 2022 at 12:01 a.m. standard time
I want to extract the data part presetnt after "through" and before "at" word using python regex.
June 18, 2022
Jan 20, 2022
April 26, 2022
I can extract for the long text using re group.
s ="Beginning through June 18, 2022 at Noon standard time"
re.search(r'(.*through)(.*) (at.*)', s).group(2)
However it will not work for
s ="June 18, 2022"
Can anyone help me on that.
You may use this regex with a capture group:
(?:.* through |^)(.+?)(?: at |$)
RegEx Demo
RegEx Details:
(?:.* through |^): Match anything followed by " though " or start position
(.+?): Match 1+ of any character and capture it in group #1
(?: at |$): Match " at " or end of string
Code:
import re
arr = ['Beginning through June 18, 2022 at Noon standard time',
'Jan 20, 2022',
'Beginning through April 26, 2022 at 12:01 a.m. standard time']
for i in arr:
print (re.findall(r'(?:.* through |^)(.+?)(?: at |$)', i))
Output:
['June 18, 2022']
['Jan 20, 2022']
['April 26, 2022']
How about playing with optional groups and backtracking.
^(?:.*?through )?(.*?)(?: at.*)?$
See this demo at regex101 or a Python demo at tio.run
Note that if just one of the substrings are present, it will either match from the first to end of the string or from start of string to the latter. If none are present, it will match the full string.
Another idea could be to use PyPI regex which supports branch reset groups.
^(?|.*?through (.+?) at|(.+))
This one extracts the part between if both are present, else the full string. Afaik the regex module is widely compatible to Python's regex functions, just use import regex as re instead.
Demo at regex101 or Python demo at tio.run

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

Python regex matching multiline string

my_str :
PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'
my code
regex = re.compile(r'(Applicants:)( )?(.*)', re.MULTILINE)
print(regex.findall(text))
my output :
[('Applicants:', ' ', 'Silixa Ltd.')]
what I need is to get the string between 'Applicants:' and '\nInventors:'
'Silixa Ltd.' & 'Chevron U.S.A. Inc. (Incorporated
in USA - California)'
Thanks in advance for your help
Try using re.DOTALL instead:
import re
text='''PCT Filing Date: 2 December 2015
\nApplicants: Silixa Ltd.
\nChevron U.S.A. Inc. (Incorporated
in USA - California)
\nInventors: Farhadiroushan,
Mahmoud
\nGillies, Arran
Parker, Tom'''
regex = re.compile(r'Applicants:(.*?)Inventors:', re.DOTALL)
print(regex.findall(text))
gives me
$ python test.py
[' Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n\n']
The reason this works is that MULTILINE doesn't let the dot (.) match newlines, whereas DOTALL will.
If what you want is the contents between Applicants: and \nInventors:, your regex should reflect that:
>>> regex = re.compile(r'Applicants: (.*)Inventors:', re.S)
>>> print(regex.findall(s))
['Silixa Ltd.\n\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n']
re.S is the "dot matches all" option, so our (.*) will also match new lines. Note that this is different from re.MULTILINE, because re.MULTILINE only says that our expression should apply to multiple lines, but doesn't change the fact . will not match newlines. If . doesn't match newlines, a match like (.*) will still stop at newlines, not achieving the multiline effect you want.
Also note that if you are not interested in Applicants: or Inventors: you may not want to put that between (), as in (Inventors:) in your regex, because the match will try to create a matching group for it. That's the reason you got 3 elements in your output instead of just 1.
If you want to match all the text between \nApplicants: and \nInventors:, you could also get the match without using re.DOTALL preventing unnecessary backtracking.
Match Applicants: and capture in group 1 the rest of that same line and all lines that follow that do not start with Inventors:
Then match Inventors.
^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:
^ Start of string (Or use \b if it does not have to be at the start)
Applicants: Match literally
( Capture group 1
.* Match the rest of the line
(?:\r?\n(?!Inventors:).*)* Match all lines that do not start with Inverntors:
) Close group
\r?\nInventors: Match a newline and Inventors:
Regex demo | Python demo
Example code
import re
text = ("PCT Filing Date: 2 December 2015\n"
"Applicants: Silixa Ltd.\n"
"Chevron U.S.A. Inc. (Incorporated\n"
"in USA - California)\n"
"Inventors: Farhadiroushan,\n"
"Mahmoud\n"
"Gillies, Arran\n"
"Parker, Tom'")
regex = re.compile(r'^Applicants: (.*(?:\r?\n(?!Inventors:).*)*)\r?\nInventors:', re.MULTILINE)
print(regex.findall(text))
Output
['Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)']
Here is a more general approach to parse a string like that into a dict of all the keys and values in it (ie, any string at the start of a line followed by a : is a key and the string following that key is data):
import re
txt="""\
PCT Filing Date: 2 December 2015
Applicants: Silixa Ltd.
Chevron U.S.A. Inc. (Incorporated
in USA - California)
Inventors: Farhadiroushan,
Mahmoud
Gillies, Arran
Parker, Tom'"""
pat=re.compile(r'(^[^\n:]+):[ \t]*([\s\S]*?(?=(?:^[^\n:]*:)|\Z))', flags=re.M)
data={m.group(1):m.group(2) for m in pat.finditer(txt)}
Result:
>>> data
{'PCT Filing Date': '2 December 2015\n', 'Applicants': 'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n', 'Inventors': "Farhadiroushan,\nMahmoud\nGillies, Arran\nParker, Tom'"}
>>> data['Applicants']
'Silixa Ltd.\nChevron U.S.A. Inc. (Incorporated\nin USA - California)\n'
Demo of the regex

Python, regex to exclude matches of numbers

To use regex to extract any numbers of length greater than 2, in a string, but also exclude "2016", here is what I have:
import re
string = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 "
print re.findall(r'\d{3,}', string)
output:
['856', '2016', '112']
I tried to change it to below to exclude "2016" but all failed.
print re.findall(r'\d{3,}/^(!2016)/', string)
print re.findall(r"\d{3,}/?!2016/", string)
print re.findall(r"\d{3,}!'2016'", string)
What is the right way to do it? Thank you.
the question was extended, please see the final comment made by Wiktor Stribiżew for the update.
You may use
import re
s = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 20161 12016 120162"
print(re.findall(r'(?<!\d)(?!2016(?!\d))\d{3,}', s))
See the Python demo and a regex demo.
Details
(?<!\d) - no digit allowed iommediately to the left of the current location
(?!2016(?!\d)) - no 2016 not followed with another digit is allowed immediately to the right of the current location
\d{3,} - 3 or more digits.
An alternative solution with some code:
import re
s = "Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 20161 12016 120162"
print([x for x in re.findall(r'\d{3,}', s) if x != "2016"])
Here, we extract any chunks of 3 or more digits (re.findall(r'\d{3,}', s)) and then filter out those equal to 2016.
You want to use a negative lookahead. The correct syntax is:
\D(?!2016)(\d{3,})\b
Results in:
In [24]: re.findall(r'\D(?!2016)(\d{3,})\b', string)
Out[24]: ['856', '112']
Or using a negative lookbehind:
In [26]: re.findall(r'\D(\d{3,})(?<!2016)\b', string)
Out[26]: ['856', '112']
Another way to do this can be:
st="Employee ID DF856, Year 2016, Department Finance, Team 2, Location 112 "
re.findall(r"\d{3,}",re.sub("((2)?(016))","",st))
output will be:
['856', '112']
but accepted answer I see is a faster method than my suggestion.

Tokenizing with different delimiters

say im reading a file that has a certain structure but different every line. for example, 'directory.csv' reads the following
November 11, Veterans’s Day
November 24, Thanksgiving
December 25, Christma
i want to split the lines by space, then comma so i can have the month, the day, and the holiday. i want to use re.split but i dont know how to set up the regular expression format wise. this is what i have
fp = open('holidays2011.csv', 'r')
import re
for item in fp :
month, day, holiday = re.split('; |, ', item)
print month, day, holiday
but when i print it says i dont have enough items to unpack. but why? im splitting at the space and the comma which gives me 3 items which i named as 3 variables
You don't need Regular Expressions for this,
with open("Input.txt") as inFile:
for item in inFile:
datePart, holiday = item.split(", ", 1)
month, day = datePart.split()
Splitting first on space is a bad idea due to the space character in the holiday name. You can use regex grouping to obtain the parts without using re.split (note the parenthesis around the parts):
>>> import re
>>> s = """November 11, Veterans’s Day
... November 24, Thanksgiving
... December 25, Christmas"""
>>> for line in s.split('\n'):
... month, day, holiday = re.match(r'(\w+) (\d+), (.+)', line).groups()
... print month
... print day
... print holiday
... print ''
...
November
11
Veterans’s Day
November
24
Thanksgiving
December
25
Christmas

Categories