I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90
Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.
You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Related
I tried looking for previous posts but couldn't find anything that matches exactly what I'm looking for so here goes.
I'm trying to parse through strings in a dataframe and capture a certain substring (year) if a match is found. The formatting can vary a lot and I figured out a non-elegant way to get it done but I wonder if there is a better way.
Strings can looks like this
Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words
I'm looking to find the year, currently by finding the last date and its' year.
Current regex is .+3112(\d{2,4})|.+31\.12\.(\d{2,4}) where
it would return 20 in group 1 for 010120-311220,
and it would return 2020 in group 2 for 1.1.2020-31.12.2020 -.
The problem is I cannot know beforehand which group the match will belong to, as in the first example group 2 doesn't exist and in the second example group 1 will return None when using re.match(regexPattern, stringOfInterest). Therefore I couldn't access the value by naively using .group(1) on the match object, as sometimes the value would be in .group(2).
Best I've come up so far is naming the groups with (?P<groupName>\d{2,4) and checking for Nones
def getYear(stringOfInterest):
regexPattern = '(^|.+)3112(?P<firstMatchType>\d{2,4})|(^|.+)31\.12\.(?P<secondMatchType>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
matchDict = matchObject.groupdict()
if matchDict['firstMatchType'] is not None:
return matchDict['firstMatchType']
else:
return matchDict['secondMatchType']
return None
import re
df['year'] = df['text'].apply(getYear)
And while this works it intuitively seems like a stupid way to do it. Any ideas?
It looks like all your years are from the XXIst century. In this case, all you need is
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
See the regex demo. Details:
.* - any zero or more chars other than line break chars as many as possible
31\.?12\.? - 31, an optional ., 12, and an optional . char
(?:\d{2})? - an optional sequence of two digits
(\d{2}) - Group 1: two last digits of the year.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
Output:
>>> df
text year
0 Random Text 31.12.2020 2020
1 1.1. -31.12.2020 2020
2 010120-311220 2020
3 31.12.2020 2020
4 1.1.2020-31.12.2020 - 2020
5 1.1.2019 - 31.12.2019 2019
6 1.1. . . 31.12.2019 - 2019
7 1.1.2019 - -31.12.2019 2019
8 010120-311220 other random words 2020
We can try using re.findall here against your input list, with a regex alternation covering both variants:
inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output) # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']
The strategy here is to match either of the two date variants. We retain the last match for each input. Then, we use a list comprehension to find the non empty value. Note that there are two capture groups, so only one will ever match.
Your regex can be factorized a lot by grouping just the alternation of the beginning of the date; this removes the need to check for two groups:
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
Once the group is extracted, it can be normalized into a proper four-digit year:
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
All in all, we get:
import re
def getYear(stringOfInterest):
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
return None
df['year'] = df['text'].apply(getYear)
this is my approach to your problem, maybe it would be useful
import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)
# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)
#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)
I really struggle with regex, and I'm hoping for some help.
I have columns that look like this
import pandas as pd
data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}
df = pd.DataFrame(data)
Location
0 Building A, 100 First St City, State
1 Fire Station # 100, 2 Apple Row, City, State Zip
2 Church , 134 Baker Rd City, State
I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I'm running into an issue where I'm removing the number.
Location Name Address
0 Building A 100 First St City, State
1 Fire Station # 100 2 Apple Row, City, State, Zip
2 Church 134 Baker Rd City, State
This is the code I've been using
df['Location Name']= df['Location'].str.split('.,\s\d', expand=True)[0]
df['Address']= df['Location'].str.split('.,\s\d', expand=True)[1]
You can use Series.str.extract:
df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),\s(\d.*)', expand=True)
The ^(.*?),\s(\d.*) regex matches
^ - start of string
(.*?) - Group 1 ('Location Name'): any zero or more chars other than line break chars as few as possible
,\s - comma and whitespace
(\d.*) - Group 1 ('Address'): digit and the rest of the line.
See the regex demo.
Another simple solution to your problem is to use a positive lookahead. You want to check if there is a number ahead of your pattern, while not including the number in the match. Here's an example of a regex that solves your problem:
\s?,\s(?=\d)
Here, we optionally remove a trailing whitespace, then match a comma followed by whitespace.
The (?= ) is a positive lookahead, in this case we check for a following digit. If that's matched, the split will remove the comma and whitespace only.
Goal
Extract number before word hours, hour, day, or days
How to use | to match the words?
s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days
return
['5.1', '100', '1']
Since 100 and 1 are not before the exact word hours, they should not show up. Expected
5.1
How to extract the first number from the matched result
s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)
return
['10.2 +/- 30']
Expect
10.2
Note that special characters +/-. is optional. When . appears such as 1.3, 1.3 will need to show up with the .. But when 1 +/- 0.5 happens, 1 will need to be extracted and none of the +/- should be extracted.
I know I could probably do a split and then take the first number
str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]
Gives
'10.2'
But some of the results only return one number so a split will cause an error. Should I do this with another step or could this be done in one step?
Please note that these strings s1, s2 are the values in a dataframe. Therefore, iteration using function like apply and lambda will be needed.
In fact, I would use re.findall here:
units = ["hours", "hour", "days", "day"] # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values) # prints [('5.1')]
If you want to also capture the units being used, then make the units alternation capturing, i.e. use:
regex = r'(' + '|'.join(units) + r')'
Then the output would be:
[('5.1', 'hours')]
Code
import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"]) # possible units
number = '\d+[.,]?\d*' # pattern for number
plus_minus = '\+\/\-' # plus minus
cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
pattern = re.compile(cases)
Tests
print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))
# Output: [5.1]
print(pattern.findall('2 Approximately 10.2 +/- 30hours'))
# Output: ['10.2']
print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))
# Output: ['114', '75']
print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))
# Output: ['102', '68']
print(pattern.findall("102 +/- 30 hrs"))
# Output: ['102']
print(pattern.findall("102-130 hrs"))
# Output: ['102']
print(pattern.findall("102hrs"))
# Output: ['102']
print(pattern.findall("102 hours"))
# Output: ['102']
Explanation
Above uses the convenience that raw strings (r'...') and string interpolation f'...' can be combined to:
fr'...'
per PEP 498
The cases strings:
fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'
Parts are sequence:
fr'({number})' - capturing group '(\d+[.,]?\d*)' for integers or floats
r'(?:[\s\d-+/]*)' - non capturing group for allowable characters between number and units (i.e. space, +, -, digit, /)
fr'(?:{units})' - non-capturing group for units
I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
'department 6789 DOSE 8000 units on DIALYSIS days - IV Interm',
'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
'insulin MixTARD 30/70 - inJECTable 46 units',
'insulin ISOPHANE -- InsulaTARD Vial - inJECTable 56 units SC SubCutaneous',
'1-alfacalcidol DOSE 1 mcg - 3 times a week - IV Intermittent',
'jevity liquid - FEEDS PO Jevity - 237 mL - 1 times per day',
'1-alfacalcidol DOSE 1 mcg - 3 times per week - IV Intermittent',
'1-supported DOSE 1 mcg - 1 time/day - IV Intermittent',
'1-testpackage DOSE 1 mcg - 1 time a day - IV Intermittent']})
I would like to remove the words/strings which follow patterns such as 46 units, 3 times a week, 3 times per week, 1 time/day etc.
I was reading about positive and negative look ahead and behind.
So, was trying something like below
[^([0-9\s]*(?=units))] #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants
time variants ex: 3 times a day, 3 time/wk, 3 times per day, 3 times a month, 3 times/month etc.
Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)
You can consider a pattern like
\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))
See the regex demo
NOTE: the \d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers?, for example.
Pattern details
\s* - zero or more whitespace chars
\d+ - one or more digits
\s* - zero or more whitespaces
(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
units? - unit or units
| - or
times? - time or times
(?:\s+(?:a|per)\s+|\s*/\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day, or wk or week, or month, or y/yea/yr
If you need to match whole words only, use word boundaries, \b:
\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b
In Pandas, use
df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')
Need to clean up a csv import, which gives me a range of times (in string form). Code is at bottom; I currently use regular expressions and replace() on the df to convert other chars. Just not sure how to:
select the current 24 hour format numbers and add :00
how to select the 12 hour format numbers and make them 24 hour.
Input (from csv import):
break_notes
0 15-18
1 18.30-19.00
2 4PM-5PM
3 3-4
4 4-4.10PM
5 15 - 17
6 11 - 13
So far I have got it to look like (remove spaces, AM/PM, replace dot with colon):
break_notes
0 15-18
1 18:30-19:00
2 4-5
3 3-4
4 4-4:10
5 15-17
6 11-13
However, I would like it to look like this ('HH:MM-HH:MM' format):
break_notes
0 15:00-18:00
1 18:30-19:00
2 16:00-17:00
3 15:00-16:00
4 16:00-16:10
5 15:00-17:00
6 11:00-13:00
My code is:
data = pd.read_csv('test.csv')
data.break_notes = data.break_notes.str.replace(r'([P].|[ ])', '').str.strip()
data.break_notes = data.break_notes.str.replace(r'([.])', ':').str.strip()
Here is the converter function that you need based on your requested input data. convert_entry takes complete value entry, splits it on a dash, and passes its result to convert_single, since both halfs of one entry can be converted individually. After each conversion, it merges them with a dash.
convert_single uses regex to search for important parts in the time string.
It starts with a some numbers \d+ (representing the hours), then optionally a dot or a colon and some more number [.:]?(\d+)? (representing the minutes). And after that optionally AM or PM (AM|PM)? (only PM is relevant in this case)
import re
def convert_single(s):
m = re.search(pattern="(\d+)[.:]?(\d+)?(AM|PM)?", string=s)
hours = m.group(1)
minutes = m.group(2) or "00"
if m.group(3) == "PM":
hours = str(int(hours) + 12)
return hours.zfill(2) + ":" + minutes.zfill(2)
def convert_entry(value):
start, end = value.split("-")
start = convert_single(start)
end = convert_single(end)
return "-".join((start, end))
values = ["15-18", "18.30-19.00", "4PM-5PM", "3-4", "4-4.10PM", "15 - 17", "11 - 13"]
for value in values:
cvalue = convert_entry(value)
print(cvalue)