I want to remove all signs from my dataframe to leave it in either one of the two formats: 100-200 or 200
So the salaries should either have a single hyphen between them if a range of salaries if given, otherwise a clean single number.
I have the following data:
import pandas as pd
import re
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df)
Here's what I have tried to remove some of the signs:
salary = []
for i in data.salary:
space = re.sub(" ",'',i)
lower = re.sub("[a-z]",'',space)
upper = re.sub("[A-Z]",'',lower)
bracket = re.sub("/",'',upper)
comma = re.sub(",", '', bracket)
plus = re.sub("\+",'',comma)
percentage = re.sub("\%",'', plus)
dot = re.sub("\.",'', percentage)
bracket1 = re.sub("\(",'',dot)
bracket2 = re.sub("\)",'',bracket1)
salary.append(bracket2)
Which gives me:
'£26768-£30136',
'£26000-£28000',
'£21000',
'£26768-£30136',
'£33',
'£18500-£20500-',
'£27500-£30000£27500£30000',
'£35000-£40000',
'£24000-£27000',
'£19000-£24000',
'£30000-£35000',
'£44000-£6600015',
'£75-£90£75-£90'
However, I have some repeating numbers, essentially I want anything after the first range of values removed, and any sign besides the hyphen between the two numbers.
Expected output:
'26768-30136',
'26000-28000',
'21000',
'26768-30136',
'33',
'18500-20500',
'27500-30000',
'35000-40000',
'24000-27000',
'19000-24000',
'30000-35000',
'44000-66000',
'75-90
Another way using pandas.Series.str.partition with replace:
data["salary"].str.partition("/")[0].str.replace("[^\d-]+", "", regex=True)
Output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
Name: 0, dtype: object
Explain:
It assumes that you are only interested in the parts upto /; it extracts everything until /, than removes anything but digits and hypen
You can use
data['salary'].str.split('/', n=1).str[0].replace('[^\d-]+','', regex=True)
# 0 26768-30136
# 1 26000-28000
# 2 21000
# 3 26768-30136
# 4 33
# 5 18500-20500
# 6 27500-30000
# 7 35000-40000
# 8 24000-27000
# 9 19000-24000
# 10 30000-35000
# 11 44000-66000
# 12 75-90
Here,
.str.split('/', n=1) - splits into two parts with the first / char
.str[0] - gets the first item
.replace('[^\d-]+','', regex=True) - removes all chars other than digits and hyphens.
A more precise solution is to extract the £num(-£num)? pattern and remove all non-digits/hyphens:
data['salary'].str.extract(r'£(\d+(?:,\d+)*(?:\.\d+)?(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)?)')[0].str.replace(r'[^\d-]+', '', regex=True)
Details:
£ - a literal char
\d+(?:,\d+)*(?:\.\d+)? - one or more digits, followed with zero or more occurrences of a comma and one or more digits and then an optional sequence of a dot and one or more digits
(?:\s*-\s*£\d+(?:,\d+)*(?:\.\d+)?)? - an optional occurrence of a hyphen enclosed with zero or more whitespaces (\s*-\s*), then a £ char, and a number pattern described above.
You can do it in only two regex passes.
First extract the monetary amounts with a regex, then remove the thousands separators, finally, join the output by group keeping only the first two occurrences per original row.
The advantage of this solution is that is really only extracts monetary digits, not other possible numbers that would be there if the input is not clean.
(data['salary'].str.extractall(r'£([,\d]+)')[0] # extract £123,456 digits
.str.replace(r'\D', '', regex=True) # remove separator
.groupby(level=0).apply(lambda x: '-'.join(x[:2])) # join first two occurrences
)
output:
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
You can use replace with a pattern and optional capture groups to match the data format, and use those groups in the replacement.
import pandas as pd
df = {'salary':['£26,768 - £30,136/annum Attractive benefits package',
'£26,000 - £28,000/annum plus bonus',
'£21,000/annum',
'£26,768 - £30,136/annum Attractive benefits package',
'£33/hour',
'£18,500 - £20,500/annum Inc Bonus - Study Support + Bens',
'£27,500 - £30,000/annum £27,500 to £30,000 + Study',
'£35,000 - £40,000/annum',
'£24,000 - £27,000/annum Study Support (ACCA / CIMA)',
'£19,000 - £24,000/annum Study Support',
'£30,000 - £35,000/annum',
'£44,000 - £66,000/annum + 15% Bonus + Excellent Benefits. L',
'£75 - £90/day £75-£90 Per Day']}
data = pd.DataFrame(df).salary.replace(
r"^£(\d+)(?:,(\d+))?(?:\s*(-)\s*£(\d+)(?:,(\d+))?)?/.*",
r"\1\2\3\4\5", regex=True
)
print(data)
The pattern matches
^ Start of string
£ Match literally
(\d+) Capture 1+ digits in group 1
(?:,(\d+))?Optionally capture 1+ digits in group 2 that is preceded by a comma to match the data format
(?: Non capture group to match as a whole
\s*(-)\s*£ capture - between optional whitespace chars in group 3 and match £
(\d+)(?:,(\d+))? The same as previous, now in group 4 and group 5
)? Close non capture group and make it optional
See a regex demo.
Output
0 26768-30136
1 26000-28000
2 21000
3 26768-30136
4 33
5 18500-20500
6 27500-30000
7 35000-40000
8 24000-27000
9 19000-24000
10 30000-35000
11 44000-66000
12 75-90
so the problem is. I have wrote a script that compare values in dataPhrame using fuzzywuzzy
def check_match_principal_name(state):
for i in range(len(ALL_SCHOOLS['Principal Name'])):
for a in range(len(TOP100['Principal'])):
matchADD = fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
if matchADD > 90:
print(ALL_SCHOOLS['Principal Name'][i]+' '+TOP100['Principal'][a])
matchPRI.append(i)
matchPRI100.append(a)
print(ALL_SCHOOLS['Principal Name'][i])
print(TOP100['Principal'][a])
for i in matchPRI:
ALL_SCHOOLS.loc[i, 'MatchPRI'] = 1
for i in matchPRI100:
TOP100.loc[i, 'MatchPRI'] = 1
ALL_SCHOOLS.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/{state}1.xlsx')
TOP100.to_excel(f'/Users/Giova/PycharmProjects/Schools/Final_final/top-100/{state}1.xlsx')
matchPRI.clear()
matchPRI100.clear()
it works, I don't have any exceptions and etc. but for example in upper script fuzz.token_sort_ratio(ALL_SCHOOLS['Principal Name'][i], TOP100['Principal'][a])
returns Kimberly Beukema - Ms. Kimberly Beukema = 91
and in second script like this:
from fuzzywuzzy import fuzz
match= fuzz.partial_token_sort_ratio('Kimberly Beukema',' Ms. Kimberly Beukema')
print(match)
it returns match = 100
and I don't understand why the value is changing?
Both token_sort_ratio and partial_token_sort_ratio preprocess the two strings by default. This means it lowercases the strings, removes non alphanumeric characters and trims whitespaces. So in your case it converts:
'Kimberly Beukema'
' Ms. Kimberly Beukema'
to
'kimberly beukema'
'ms kimberly beukema'
In the next step they both sort the words in the two strings:
'beukema kimberly'
'beukema kimberly ms'
Afterwards they compare the two strings. For this comparision token_sort_ratio uses ratio, while partial_token_sort_ratio uses partial_ratio.
In ratio 3 deletions are required to convert 'beukema kimberly ms' to 'beukema kimberly'. Since the strings have a combined length of 35 the resulting ratio is round(100 * (1 - 3 / 35)) = 91.
In partial_ratio the ratio of the optimal alignment of the two strings is calculated. In your case 'beukema kimberly' is a substring of 'beukema kimberly ms', so the ratio between 'beukema kimberly' and 'beukema kimberly' is calculated which is round(100 * (1 - 0 / 32)) = 100.
I have a data frame like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
'department 6789 DOSE 8000 units on DIALYSIS days - IV Interm',
'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
'insulin MixTARD 30/70 - inJECTable 46 units',
'insulin ISOPHANE -- InsulaTARD Vial - inJECTable 56 units SC SubCutaneous',
'1-alfacalcidol DOSE 1 mcg - 3 times a week - IV Intermittent',
'jevity liquid - FEEDS PO Jevity - 237 mL - 1 times per day',
'1-alfacalcidol DOSE 1 mcg - 3 times per week - IV Intermittent',
'1-supported DOSE 1 mcg - 1 time/day - IV Intermittent',
'1-testpackage DOSE 1 mcg - 1 time a day - IV Intermittent']})
I would like to remove the words/strings which follow patterns such as 46 units, 3 times a week, 3 times per week, 1 time/day etc.
I was reading about positive and negative look ahead and behind.
So, was trying something like below
[^([0-9\s]*(?=units))] #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants
time variants ex: 3 times a day, 3 time/wk, 3 times per day, 3 times a month, 3 times/month etc.
Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)
You can consider a pattern like
\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))
See the regex demo
NOTE: the \d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers?, for example.
Pattern details
\s* - zero or more whitespace chars
\d+ - one or more digits
\s* - zero or more whitespaces
(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
units? - unit or units
| - or
times? - time or times
(?:\s+(?:a|per)\s+|\s*/\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day, or wk or week, or month, or y/yea/yr
If you need to match whole words only, use word boundaries, \b:
\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b
In Pandas, use
df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')
Need to clean up a csv import, which gives me a range of times (in string form). Code is at bottom; I currently use regular expressions and replace() on the df to convert other chars. Just not sure how to:
select the current 24 hour format numbers and add :00
how to select the 12 hour format numbers and make them 24 hour.
Input (from csv import):
break_notes
0 15-18
1 18.30-19.00
2 4PM-5PM
3 3-4
4 4-4.10PM
5 15 - 17
6 11 - 13
So far I have got it to look like (remove spaces, AM/PM, replace dot with colon):
break_notes
0 15-18
1 18:30-19:00
2 4-5
3 3-4
4 4-4:10
5 15-17
6 11-13
However, I would like it to look like this ('HH:MM-HH:MM' format):
break_notes
0 15:00-18:00
1 18:30-19:00
2 16:00-17:00
3 15:00-16:00
4 16:00-16:10
5 15:00-17:00
6 11:00-13:00
My code is:
data = pd.read_csv('test.csv')
data.break_notes = data.break_notes.str.replace(r'([P].|[ ])', '').str.strip()
data.break_notes = data.break_notes.str.replace(r'([.])', ':').str.strip()
Here is the converter function that you need based on your requested input data. convert_entry takes complete value entry, splits it on a dash, and passes its result to convert_single, since both halfs of one entry can be converted individually. After each conversion, it merges them with a dash.
convert_single uses regex to search for important parts in the time string.
It starts with a some numbers \d+ (representing the hours), then optionally a dot or a colon and some more number [.:]?(\d+)? (representing the minutes). And after that optionally AM or PM (AM|PM)? (only PM is relevant in this case)
import re
def convert_single(s):
m = re.search(pattern="(\d+)[.:]?(\d+)?(AM|PM)?", string=s)
hours = m.group(1)
minutes = m.group(2) or "00"
if m.group(3) == "PM":
hours = str(int(hours) + 12)
return hours.zfill(2) + ":" + minutes.zfill(2)
def convert_entry(value):
start, end = value.split("-")
start = convert_single(start)
end = convert_single(end)
return "-".join((start, end))
values = ["15-18", "18.30-19.00", "4PM-5PM", "3-4", "4-4.10PM", "15 - 17", "11 - 13"]
for value in values:
cvalue = convert_entry(value)
print(cvalue)
I am trying to do some data analysis and there are some numbers that I want to analyze, the problem being that those numbers are in different string formats. These are the different formats:
"25,000,000 USD" or
"9 500 USD" or
"50,000 ETH"
It is basically always a number first, separated by either commas or blank spaces followed by the currency. Depending on the currency, i want to calculate the amount in USD afterwards.
I have looked up Regex expressions for the last hour and could not find anything that solves my problem. I definitely made some progress and implemented different expressions, but none worked 100%. It's always missing something as you will see below.
for i, row_value in df2['hardcap'].iteritems():
try:
q = df2['hardcap'][i]
c = re.findall(r'[a-zA-Z]+', q)
if c[0] == "USD":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "EUR":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "ETH":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
print(d[0])
except Exception:
pass
So I am iterating through my dataframe column and first, ill find out which currency the number is related to, either "USD", "EUR" or "ETH" which I save in c. This part already works, after that, i want to extract the number in a form that can be converted to an integer so I can do calculations with it.
Right now, the line
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
returns something like this in d[0]:
('100,000,000', ',000') if the number was 100,000,000 and
('270', '') if the number was 270 000 000
What I would like to get in the best case would be something like:
100000000
and
270000000, but any way to extract the whole numbers would suffice
I'd appreciate any bump in the right direction as I don't have much experience with regex and feel stuck right now.
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
for g in re.findall(r'(.*?)([A-Z]{3})', s):
print(int(''.join(re.findall(r'\d', g[0]))), g[1])
Prints:
25000000 USD
9500 USD
50000 ETH
Optimized solution with re.search + re.sub functions:
import re
# equivalent for your df2['hardcap'] column values
hardcap = ["25,000,000 USD", "9 500 USD", "50,000 ETH"]
pat = re.compile(r'^(\d[\s,\d]*\d) ([A-Z]{3})')
for v in hardcap:
m = pat.search(v)
if m: # if value is in the needed format
amount, currency = m.group(1), m.group(2)
amount = int(re.sub(r'\D*', '', amount))
print(amount, currency)
Sample output:
25000000 USD
9500 USD
50000 ETH
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
matches = re.findall(r'(\d[\d, ]*) ([A-Z]{3})', s)
l = [(int(match[0].replace(',', '').replace(' ', '')), match[1]) for match in matches]
print(l)
[(25000000, 'USD'), (9500, 'USD'), (50000, 'ETH')]