Extracting substring with alternatives using regex in Python - python

I tried looking for previous posts but couldn't find anything that matches exactly what I'm looking for so here goes.
I'm trying to parse through strings in a dataframe and capture a certain substring (year) if a match is found. The formatting can vary a lot and I figured out a non-elegant way to get it done but I wonder if there is a better way.
Strings can looks like this
Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words
I'm looking to find the year, currently by finding the last date and its' year.
Current regex is .+3112(\d{2,4})|.+31\.12\.(\d{2,4}) where
it would return 20 in group 1 for 010120-311220,
and it would return 2020 in group 2 for 1.1.2020-31.12.2020 -.
The problem is I cannot know beforehand which group the match will belong to, as in the first example group 2 doesn't exist and in the second example group 1 will return None when using re.match(regexPattern, stringOfInterest). Therefore I couldn't access the value by naively using .group(1) on the match object, as sometimes the value would be in .group(2).
Best I've come up so far is naming the groups with (?P<groupName>\d{2,4) and checking for Nones
def getYear(stringOfInterest):
regexPattern = '(^|.+)3112(?P<firstMatchType>\d{2,4})|(^|.+)31\.12\.(?P<secondMatchType>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
matchDict = matchObject.groupdict()
if matchDict['firstMatchType'] is not None:
return matchDict['firstMatchType']
else:
return matchDict['secondMatchType']
return None
import re
df['year'] = df['text'].apply(getYear)
And while this works it intuitively seems like a stupid way to do it. Any ideas?

It looks like all your years are from the XXIst century. In this case, all you need is
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
See the regex demo. Details:
.* - any zero or more chars other than line break chars as many as possible
31\.?12\.? - 31, an optional ., 12, and an optional . char
(?:\d{2})? - an optional sequence of two digits
(\d{2}) - Group 1: two last digits of the year.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
Output:
>>> df
text year
0 Random Text 31.12.2020 2020
1 1.1. -31.12.2020 2020
2 010120-311220 2020
3 31.12.2020 2020
4 1.1.2020-31.12.2020 - 2020
5 1.1.2019 - 31.12.2019 2019
6 1.1. . . 31.12.2019 - 2019
7 1.1.2019 - -31.12.2019 2019
8 010120-311220 other random words 2020

We can try using re.findall here against your input list, with a regex alternation covering both variants:
inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output) # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']
The strategy here is to match either of the two date variants. We retain the last match for each input. Then, we use a list comprehension to find the non empty value. Note that there are two capture groups, so only one will ever match.

Your regex can be factorized a lot by grouping just the alternation of the beginning of the date; this removes the need to check for two groups:
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
Once the group is extracted, it can be normalized into a proper four-digit year:
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
All in all, we get:
import re
def getYear(stringOfInterest):
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
return None
df['year'] = df['text'].apply(getYear)

this is my approach to your problem, maybe it would be useful
import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)
# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)
#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)

Related

Removing signs and repeating numbers

I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90
Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.
You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90

What Python RegEx can I use to indicate a pattern only in the end of an Excel cell

I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.
Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)

Use regular expression to extract numbers before specific words

Goal
Extract number before word hours, hour, day, or days
How to use | to match the words?
s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days
return
['5.1', '100', '1']
Since 100 and 1 are not before the exact word hours, they should not show up. Expected
5.1
How to extract the first number from the matched result
s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)
return
['10.2 +/- 30']
Expect
10.2
Note that special characters +/-. is optional. When . appears such as 1.3, 1.3 will need to show up with the .. But when 1 +/- 0.5 happens, 1 will need to be extracted and none of the +/- should be extracted.
I know I could probably do a split and then take the first number
str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]
Gives
'10.2'
But some of the results only return one number so a split will cause an error. Should I do this with another step or could this be done in one step?
Please note that these strings s1, s2 are the values in a dataframe. Therefore, iteration using function like apply and lambda will be needed.
In fact, I would use re.findall here:
units = ["hours", "hour", "days", "day"] # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values) # prints [('5.1')]
If you want to also capture the units being used, then make the units alternation capturing, i.e. use:
regex = r'(' + '|'.join(units) + r')'
Then the output would be:
[('5.1', 'hours')]
Code
import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"]) # possible units
number = '\d+[.,]?\d*' # pattern for number
plus_minus = '\+\/\-' # plus minus
cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
pattern = re.compile(cases)
Tests
print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))
# Output: [5.1]
print(pattern.findall('2 Approximately 10.2 +/- 30hours'))
# Output: ['10.2']
print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))
# Output: ['114', '75']
print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))
# Output: ['102', '68']
print(pattern.findall("102 +/- 30 hrs"))
# Output: ['102']
print(pattern.findall("102-130 hrs"))
# Output: ['102']
print(pattern.findall("102hrs"))
# Output: ['102']
print(pattern.findall("102 hours"))
# Output: ['102']
Explanation
Above uses the convenience that raw strings (r'...') and string interpolation f'...' can be combined to:
fr'...'
per PEP 498
The cases strings:
fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
Parts are sequence:
fr'({number})' - capturing group '(\d+[.,]?\d*)' for integers or floats
r'(?:[\s\d-+/]*)' - non capturing group for allowable characters between number and units (i.e. space, +, -, digit, /)
fr'(?:{units})' - non-capturing group for units

How to match this pattern using regex in Python

I have a list of names with different notations:
for example:
myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]
the standarized version for those different notations are, for example:
'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'
What I tried is to separate the different characters of the string using compile.
input:
compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")
output:
characters = ['AB', '2000', '2000', 'A', '1']
Then applying:
characters = list(set(characters))
To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.
But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:
characters = ['AB', '2000', '2000', 'A1']
any idea to fix that?
o any better idea to solve my problem in general. Thank you, in advance.
Use the following pattern with optional groups and capturing groups:
r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'
and re.I flag.
Note that (?:_([A-Z\d]+))? must be repeated in order to match both
third and fourth group. If you attempted to "repeat" this group, putting
it once with "*" it would match only the last group, skipping the third
group.
To test it, I ran the following test:
myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
print(f'{tt:16} ', end=' ')
mtch = pat.match(tt)
if mtch:
for it in mtch.groups():
if it is not None:
print(f'{it:5}', end=' ')
print()
getting:
ab2000 ab 2000
abc2000_2000 abc 2000 2000
AB2000 AB 2000
ab2000_1 ab 2000 1
ABC2000_01 ABC 2000 01
AB2000_2 AB 2000 2
ABC2000_02 ABC 2000 02
AB2000_A1 AB 2000 A1
AB2000_2000_A1 AB 2000 2000 A1

Regex, how to remove all non-alphanumeric except colon in a 12/24 hour timestamp?

I have a string like:
Today, 3:30pm - Group Meeting to discuss "big idea"
How do you construct a regex such that after parsing it would return:
Today 3:30pm Group Meeting to discuss big idea
I would like it to remove all non-alphanumeric characters except for those that appear in a 12 or 24 hour time stamp.
# this: D:DD, DD:DDam/pm 12/24 hr
re = r':(?=..(?<!\d:\d\d))|[^a-zA-Z0-9 ](?<!:)'
A colon must be preceded by at least one digit and followed by at least two digits: then it's a time. All other colons will be considered textual colons.
How it works
: // match a colon
(?=.. // match but not capture two chars
(?<! // start a negative look-behind group (if it matches, the whole fails)
\d:\d\d // time stamp
) // end neg. look behind
) // end non-capture two chars
| // or
[^a-zA-Z0-9 ] // match anything not digits or letters
(?<!:) // that isn't a colon
Then when applied to this silly text:
Today, 3:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good
...changes it into:
Today, 3:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 16:47 is also good
Python.
import string
punct=string.punctuation
s='Today, 3:30pm - Group Meeting:am to discuss "big idea" by our madam'
for item in s.split():
try:
t=time.strptime(item,"%H:%M%p")
except:
item=''.join([ i for i in item if i not in punct])
else:
item=item
print item,
output
$ ./python.py
Today 3:30pm Group Meetingam to discuss big idea by our madam
# change to s='Today, 15:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good'
$ ./python.py
Today 15:30pm Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 1647 is also good
NB: Method should be improved to check for valid time only when necessary(by imposing conditions) , but i will leave it as that for now.
I assume you'd like to keep spaces as well, and this implementation is in python, but it's PCRE so it should be portable.
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
re.sub(r'[^a-zA-Z0-9: ]', '', x)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
for a slightly cleaner answer (no double spaces)
import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
tmp = re.sub(r'[^a-zA-Z0-9: ]', '', x)
re.sub(r'[ ]+', ' ', tmp)
Output: 'Today 3:30pm Group Meeting to discuss big idea'
You can try, in Javascript:
var re = /(\W+(?!\d{2}[ap]m))/gi;
var input = 'Today, 3:30pm - Group Meeting to discuss "big idea"';
alert(input.replace(re, " "))
Correct regexp to do that would be:
'(?<!\d):|:(?!\d\d)|[^a-zA-Z0-9 :]'
s="Call me, my dear, at 3:30"
re.sub(r'[^\w :]','',s)
'Call me my dear at 3:30'

Categories