I have this one string, which is actually price, this price value comes with any currency symbol (currency_list), I am trying to remove these currency symbols from price and return only price.\
Till now I am able to do it for prefix and suffix currency symbol using below code , everything works till here.
I just want to add one validation where if the symbol is not prefix or suffix like "200$434" in btw, then it should return not valid format. which I am not able to understand how should be implemented.
currency_list = ['USD', 'UNITED STATES DOLLAR', '$', 'EUR', 'EURO', '€', 'GBP','BRITISH POUND', '£']
Normally input string can be
"$1212212"
"1212212EURO"
"1212212"
"1212212 BRITISH POUND"
need help to validate values like "1212$343" or "1212212EURO323.23"
Code:
for symb in currency_list:
if symb in amount:
data = amount.replace(symb, '')
After going through multiple blog post, I found this answer which gets the job done.
def validateCurrency(amount):
new_amount=None
for cur in currency_list:
if amount.startswith(cur) or amount.endswith(cur):
new_amount = amount.replace(cur, "", 1)
if new_amount == None:
return "Currency is not valid a string."
return f"Price after removeing symbol is {new_amount}"
// print(validateCurrency('$1212212'))
You can use regex to achieve your purpose.
import re
currency_list = ['USD', 'UNITED STATES DOLLAR', '$', 'EUR', 'EURO', '€', 'GBP', 'BRITISH POUND', '£']
p = re.compile(r'([\D]*)([\d]+\.?[\d]+)(.*)')
def verify_or_get_amount(amount):
first, mid, last = [i.strip() for i in p.search(amount).groups()]
if (first and first not in currency_list) or (last and last not in currency_list):
print('invalid:', amount)
else:
amount = mid
print('amount:', amount)
return mid
for i in ['EURO123', 'EURO 123', 'EURO 123.', 'EURO .12', 'EURO 12.12', '$1212212', '1212212EURO', '1212212', '1212212 BRITISH POUND', '1212$343']:
verify_or_get_amount(i)
using regex:
import re
currency_list = ['USD', 'UNITED STATES DOLLAR', '\$', 'EUR', 'EURO', '€', 'GBP', 'BRITISH POUND', '£']
currencies = '|'.join(currency_list)
c = re.compile(rf'^({currencies})? *(\d+(\.\d+)?) *({currencies})?$')
for i in ['$1212212', '1212212EURO', '1212212', '1212212 BRITISH POUND', '1212$343']:
match_obj = c.match(i)
if match_obj:
print(match_obj.group(2))
else:
print('not found')
output :
1212212
1212212
1212212
1212212
not found
Explanation :
to see actual pattern : print(c.pattern) which gives :
^(USD|UNITED STATES DOLLAR|\$|EUR|EURO|€|GBP|BRITISH POUND|£)?(\d+(\.\d+)?) *(USD|UNITED STATES DOLLAR|\$|EUR|EURO|€|GBP|BRITISH POUND|£)?$
I've escaped $ in the currency_list.
currencies = '|'.join(currency_list) for building possible prefixes or suffixes.
(\d+(\.\d+)?) is for matching price which accept float as well. (you can omit the (\.\d+) part)
the * that you see in regex, is for for example BRITISH POUND which have a space after the number.
I am assuming you want a currency validation function
def validateCurrency(input):
input_length = len(input)
if input.isdigit():return False
split = [re.findall(r'(\D+?)(\d+)|(\d+?)(\D+)', input)[0] ]
total_length = 0
for i in split[0]:
if i in currency_list:
total_length+=len(i)
if str(i).isdigit():
total_length+=len(i)
if total_length == input_length:
return True
else:
return False
Related
I am trying to write a function that, given a dictionary of restaurants, allows you to randomly pick a restaurant based on your choice of one of the values. For example, if you say you want a bakery then it will only give you bakeries.
I have only worked on the code for choosing the type of restaurant so far, and I am struggling with how to generate a random list. So I am checking for a value and, if it has it, would want to add the key to a list. Does anyone have any suggestions on how to do this?
import random
Restaurants ={
"Eureka": ["American", "$$", "Lunch", "Dinner"],
"Le_Pain": ["Bakery", "$$", "Breakfast", "Lunch", "Dinner"],
"Creme_Bakery": ["Bakery", "$", "Snack"]
}
list=[]
def simple_chooser():
print('Would you like to lock a category or randomize? (randomize, type, price, or meal)')
start= input()
if start=="randomize":
return #completely random
elif start=="type":
print("American, Bakery, Pie, Ice_Cream, Bagels, Asian, Chocolate, Italian, Pizza, Thai, Mexican, Japanese, Acai, Mediterranean, or Boba/Coffee?")
type=input()
for lst in Restaurants.values():
for x in lst:
if x==type:
list.append(x)
return(random.choice(list))
To return completely random restaurant suggestions you need to create a list of all the types first and then you can choose one and return the names of the restaurants.
import random
Restaurants ={
"Eureka": ["American", "$$", "Lunch", "Dinner"],
"Le_Pain": ["Bakery", "$$", "Breakfast", "Lunch", "Dinner"],
"Creme_Bakery": ["Bakery", "$", "Snack"]
}
types = ['American', 'Bakery', 'Pie', 'Ice_Cream', 'Bagels', 'Asian', 'Chocolate', 'Italian',
'Pizza', 'Thai', 'Mexican', 'Japanese', 'Acai', 'Mediterranean','Boba/Coffee']
def simple_chooser():
l=[]
print('Would you like to lock a category or randomize? (randomize, type, price, or meal)')
start= input()
if start=="randomize":
type_random = random.choice(types)
for k,v in Restaurants.items():
if v[0] == type_random:
l.append(k)
elif start=="type":
print("American, Bakery, Pie, Ice_Cream, Bagels, Asian, Chocolate, Italian, Pizza, Thai, Mexican, Japanese, Acai, Mediterranean, or Boba/Coffee?")
type_chosen=input()
for k,v in Restaurants.items():
if v[0] == type_chosen:
l.append(k)
return(random.choice(l))
Also, you don't need to return in if-else statements. Once you have your list of Restaurants you can randomly choose a restaurant and return it.
In your loop, you don't have the restaurant name, as you iterate on the values, you would have need something like
for name, props in Restaurants.items():
if props[0] == type:
list.append(name)
return (random.choice(list)) # wait for the whole list to be constructed
With a better naming (don't use type and list that are builtin methods)
def simple_chooser():
start = input('Would you like to lock a category or randomize? (randomize, type, price, or meal)')
if start == "randomize":
return # completely random
elif start == "type":
restaurant_type = input("American, Bakery, Pie, Ice_Cream, Bagels, Asian, Chocolate, Italian, "
"Pizza, Thai, Mexican, Japanese, Acai, Mediterranean, or Boba/Coffee?")
matching_names = [name for name, props in Restaurants.items() if props[0] == restaurant_type]
return random.choice(matching_names)
You make processing difficult because of the design of your data structures.
Here's an idea which should be easily adapted to future needs.
import random
from operator import contains, eq
Restaurants = [
{'name': 'Eureka', 'type': 'American', 'price': '$$', 'meal': ('Dinner',)},
{'name': 'Le_Pain', 'type': 'Bakery', 'price': '$$', 'meal': ('Lunch', 'Dinner')},
{'name': 'Creme_Bakery', 'type': 'Bakery', 'price': '$', 'meal': ('Snack',)}
]
def get_attr(k):
s = set()
for r in Restaurants:
if isinstance(r[k], tuple):
for t in r[k]:
s.add(t)
else:
s.add(r[k])
return s
def choose_restaurant():
categories = ', '.join(Restaurants[0])
while True:
choice = input(f'Select by category ({categories}) or choose random: ')
if choice == 'random':
return random.choice(Restaurants)
if choice in Restaurants[0]:
choices = get_attr(choice)
if (v := input(f'Select value for {choice} from ({", ".join(choices)}): ')) in choices:
op = contains if isinstance(Restaurants[0][choice], tuple) else eq
return [r for r in Restaurants if op(r[choice], v)]
print('Invalid selection\x07')
print(choose_restaurant())
Restaurants is now a list of dictionaries which is easy to extend. You just need to make sure that each new restaurant has the same structure (keys). Also note that the 'meal' value is a tuple even if there's a single value
Below is my example code:
from fuzzywuzzy import fuzz
import json
from itertools import zip_longest
synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th
Generation)","hard
drive","ddr 8gb","something1", "something2",
"something3","HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
for k,k2 in zip_longest(vendor_data,buyer_data):
for v in value:
if fuzz.token_set_ratio(k,v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item+" "+k)
else:
#didnt get only "something" strings here !
if fuzz.token_set_ratio(k2,v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item+" "+k2)
vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer
Note: "something" string can be anything like "battery" or "display"etc
synonyms json
{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7
processor","processor i5","processor i7","core generation","core gen"],
"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR
32gb","DDR 32 gb","DDR4-"],
"ssd":["solid state drive","solid drive"],
"hdd":["Hard Drive"]
}
what do i need ?
I want to add all "something" string inside vendor list dynamically.
! NOTE -- "something" string can be anything in future.
I want to add "something" string in vendor array which is not a matched value in fuzz>70! I want to basically add left out data also.
for example like below:
current output
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state']
expected output below
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state',
'something1',
'something2'
'something3'] #something string need to be added in vendor list dynamically.
what silly mistake am I doing ? Thank you.
Here's my attempt:
from fuzzywuzzy import process, fuzz
synonyms = {'processor': ['corei5', 'core', 'corei7', 'i5', 'i7', 'ryzen5', 'i5 processor', 'i7 processor', 'processor i5', 'processor i7', 'core generation', 'core gen'], 'ram': ['DDR4', 'memory', 'DDR3', 'DDR', 'DDR 8gb', 'DDR 8 gb', 'DDR 16gb', 'DDR 16 gb', 'DDR 32gb', 'DDR 32 gb', 'DDR4-'], 'ssd': ['solid state drive', 'solid drive'], 'hdd': ['Hard Drive']}
vendor_data = ['i7 processor', 'solid state', 'Corei5 :1135G7 (11th Generation)', 'hard drive', 'ddr 8gb', 'something1', 'something2', 'something3', 'HT (100W) DDR4-2400']
buyer_data = ['i7 processor 12 generation', 'corei7:latest technology']
def find_synonym(s: str, min_score: int = 60):
results = process.extractBests(s, choices=synonyms, score_cutoff=min_score)
if not results:
return None
return results[0][-1]
def process_data(l: list, min_score: int = 60):
matches = []
no_matches = []
for item in l:
syn = find_synonym(item, min_score=min_score)
if syn is not None:
new_item = f'{syn} {item}' if syn not in item else item
matches.append(new_item)
elif any(fuzz.partial_ratio(s, item) >= min_score for s in synonyms.keys()):
# one of the synonyms is already in the item string
matches.append(item)
else:
no_matches.append(item)
return matches, no_matches
For process_data(vendor_data) we get:
(['i7 processor',
'ssd solid state',
'processor Corei5 :1135G7 (11th Generation)',
'hdd hard drive',
'ram ddr 8gb',
'ram HT (100W) DDR4-2400'],
['something1', 'something2', 'something3'])
And for process_data(buyer_data):
(['i7 processor 12 generation', 'processor corei7:latest technology'], [])
I had to lower the cut-off score to 60 to also get results for ddr 8gb. The process_data function returns 2 lists: One with matches with words from the synonyms dict and one with items without matches. If you want exactly the output you listed in your question, just concatenate the two lists like this:
matches, no_matches = process_data(vendor_data)
matches + no_matches # ['i7 processor', 'ssd solid state', 'processor Corei5 :1135G7 (11th Generation)', 'hdd hard drive', 'ram ddr 8gb', 'ram HT (100W) DDR4-2400', 'something1', 'something2', 'something3']
I have tried to come up with a decent answer (certainly not the cleanest one)
import json
from itertools import zip_longest
from fuzzywuzzy import fuzz
synonyms = open("synonyms.json", "r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor", "solid state", "Corei5 :1135G7 (11thGeneration)", "hard drive", "ddr 8gb", "something1",
"something2",
"something3", "HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation", "corei7:latest technology"]
vendor = []
buyer = []
for k, k2 in zip_longest(vendor_data, buyer_data):
has_matched = False
for item, value in synonyms.items():
for v in value:
if fuzz.token_set_ratio(k, v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item + " " + k)
if has_matched or k2 is None:
break
else:
has_matched = True
if fuzz.token_set_ratio(k2, v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item + " " + k2)
if has_matched or k is None:
break
else:
has_matched = True
else:
continue # match not found
break # match is found
else: # only evaluates on normal loop end
# Only something strings
# do something with the new input values
continue
vendor = list(set(vendor))
buyer = list(set(buyer))
I hope you can achieve what you want with this code. Check the docs if you don't know what a for else loop does. TLDR: the else clause executes when the loop terminates normally (not with a break). Note that I put the synonyms loop inside the data loop. This is because we can't certainly know in which synonym group the data belongs, also somethimes the vendor data entry is a processor while the buyer data is memory. Also note that I have assumed an item can't match more than 1 time. If this could be the case you would need to make a more advanced check (just make a counter and break when the counter equals 2 for example).
EDIT:
I took another look at the question and came up with maybe a better answer:
v_dict = dict()
for spec in vendor_data[:]:
for item, choices in synonyms.items():
if process.extractOne(spec, choices)[1] > 70: # don't forget to import process from fuzzywuzzy
v_dict[spec] = item
break
else:
v_dict[spec] = "Something new"
This code matches the strings to the correct type. for example {'i7 processor': 'processor', 'solid state': 'ssd', 'Corei5 :1135G7 (11thGeneration)': 'processor', 'hard drive': 'ssd', 'ddr 8gb': 'ram', 'something1': 'Something new', 'something2': 'Something new', 'something3': 'Something new', 'HT (100W) DDR4-2400': 'ram'}. You can change the "Something new" with watherver you like. You could also do: v_dict[spec] = 0 (on a match) and v_dict[spec] = 1 (on no match). You could then sort the dict ->
it = iter(v_dict.values())
print(sorted(v_dict.keys(), key=lambda x: next(it)))
Which would give the wanted results (more or less), all the recognised items will be first, and then all the unrecognised items. You could do some more advanced sorting on this dict if you want. I think this code gives you enough flexibility to reach your goal.
If I understand correctly, what you are trying to do is match keywords specified by a customer and/or vendor against a predefined database of keywords you have.
First, I would highly recommend using a reversed mapping of the synonyms, so it's faster to lookup, especially when the dataset will grow.
Second, considering the fuzzywuzzy API, it looks like you simply want the best match, so extractOne is a solid choice for that.
Now, extractOne returns the best match and a score:
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
I would split the algorithm into two:
A generic part that simply gets the best match, which should always exist (even if it's not a great one)
A filter, where you could adjust the sensitivity of the algorithm, based on different criteria of your application. This sensitivity threshold should set the minimal match quality. If you're below this threshold, just use "untagged" for the category for example.
Here is the final code, which I think is very simple and easy to understand and expand:
import json
from fuzzywuzzy import process
def load_synonyms():
with open('synonyms.json') as fin:
synonyms = json.load(fin)
# Reversing the map makes it much easier to lookup
reversed_synonyms = {}
for key, values in synonyms.items():
for value in values:
reversed_synonyms[value] = key
return reversed_synonyms
def load_vendor_data():
return [
"i7 processor",
"solid state",
"Corei5 :1135G7 (11thGeneration)",
"hard drive",
"ddr 8gb",
"something1",
"something2",
"something3",
"HT (100W) DDR4-2400"
]
def load_customer_data():
return [
"i7 processor 12 generation",
"corei7:latest technology"
]
def get_tag(keyword, synonyms):
THRESHOLD = 80
DEFAULT = 'general'
tag, score = process.extractOne(keyword, synonyms.keys())
return synonyms[tag] if score > THRESHOLD else DEFAULT
def main():
synonyms = load_synonyms()
customer_data = load_customer_data()
vendor_data = load_vendor_data()
data = customer_data + vendor_data
tags_dict = { keyword: get_tag(keyword, synonyms) for keyword in data }
print(json.dumps(tags_dict, indent=4))
if __name__ == '__main__':
main()
When running with the specified inputs, the output is:
{
"i7 processor 12 generation": "processor",
"corei7:latest technology": "processor",
"i7 processor": "processor",
"solid state": "ssd",
"Corei5 :1135G7 (11thGeneration)": "processor",
"hard drive": "hdd",
"ddr 8gb": "ram",
"something1": "general",
"something2": "general",
"something3": "general",
"HT (100W) DDR4-2400": "ram"
}
I have an old string and a modified one. Then also values from old string in dictionary format. I am trying to check if the values in dictionary is still present as such in new string. If yes nothing happens. If there is a change in the value, the value in the dictionary is replace by the modified value. If the value in dictionary is not present in new string, then update the value in dictionary by None.
Code
import re
db_tag_old = {"art":"art", "organizer":"james", "month":"December", "season":"summer"}
old = 'The art is performed by james. _______ Season is summer _____ time. It is December.'
new = 'The art is performed by ______ Mathew. Season is ______ autmn time. __ __ _________'
db_tag_new = {}
final_db_tag = {}
symbol = '_'
needle = f'{re.escape(symbol)}+'
position = [(match.start(),match.end()) for match in re.finditer(needle, old)]
for key,value in db_tag_old.items():
position_old = [(match.start(),match.end()) for match in re.finditer(value.lower(), old)]
position_new = [(match.start(),match.end()) for match in re.finditer(value.lower(), new)]
if position_old == position_new and [] not in (position_old, position_new)::
db_tag_new.update({key:value})
continue
else:
new_value = new[position[0][0]:position[0][1]]
db_tag_new.update({key:new_value})
final_db_tag.update({"old":db_tag_old,"new":db_tag_new})
print(final_db_tag)
Output Obtained
{'old': {'art': 'art', 'organizer': 'james', 'month': 'December', 'season': 'summer'}, 'new': {'art': 'art', 'organizer': 'Mathew.', 'month': 'Mathew.', 'season': 'Mathew.'}}
Here in the dictionary key "new", month and season are wring values.
Expected Output
{'old': {'art': 'art', 'organizer': 'james', 'month': 'December', 'season': 'summer'}, 'new': {'art': 'art', 'organizer': 'Mathew.', 'month': 'None', 'season': 'autmn'}}
How this can be corrected
It's not really clear to me, what the rule is to replace old with new text. The following code produces the wanted result, but I'm not sure whether this approach is as universal as needed:
import re
db_tag_old = {"art":"art", "organizer":"james.", "month":"December", "season":"summer"}
old = 'The art is performed by james. _______ Season is summer _____ time. It is December.'
new = 'The art is performed by ______ Mathew. Season is ______ autmn time. __ __ _________'
db_tag_new = {}
# pre-definition for dict-entries we won't find:
for key, val in db_tag_old.items():
db_tag_new[key] = "None"
owords = old.split();
nwords = new.split();
for (i, nw) in enumerate(nwords):
# the "art"-case:
for key, ow in db_tag_old.items():
if nw == ow:
db_tag_new[key] = ow
# "organizer" / "season" cases:
if re.match(r'^_+$', nw):
for key, ow in db_tag_old.items():
if ow == owords[i] and re.match(r'^_+$', owords[i+1]):
db_tag_new[key] = nwords[i+1]
print("old: ", db_tag_old)
print("new: ", db_tag_new)
I have a column that has a lot of doctor specialties. I want to clean it up and created a function below:
def specialty(x):
if x.str.contains('Urolog'):
return 'Urology'
elif x.str.contains('Nurse'):
return 'Nurse Practioner'
elif x.str.contains('Oncology'):
return 'Oncology'
elif x.str.contains('Physician'):
return 'Physician Assistant'
elif x.str.contains('Family Medicine'):
return 'Family Medicine'
elif x.str.contains('Anesthes'):
return 'Anesthesiology'
else:
return 'Other'
df['desc_clean'] = df['desc'].apply(specialty)
However I get an error TypeError: 'function' object is not subscriptable
There are too many values to use a manual mapping so i wanted to use str.contains. Is there a way to do this better?
EDIT: Sample DF
{'person_id': {39063: 33081476009,
50538: 33033519093,
56075: 33170508793,
36593: 33061707789,
51656: 33047685345,
95512: 33022026049,
40286: 33038034707,
3887: 33076466195,
40161: 33052807819,
52905: 33190526939,
35418: 33008425164,
35934: 33015737122,
3389: 33055125864,
136: 33139641318,
105460: 33113871389,
52568: 33075745388,
24725: 33052090907,
34838: 33205449839,
31908: 33183672635,
36115: 33006692696},
'final_desc': {39063: 'None',
50538: 'Urology',
56075: 'Anesthesiology',
36593: 'None',
51656: 'Urology',
95512: 'None',
40286: 'Anesthesiology',
3887: 'Specialist',
40161: 'None',
52905: 'Anesthesiology',
35418: 'Urology',
35934: 'None',
3389: 'Ophthalmology',
136: 'Rheumatology',
105460: 'None',
52568: 'Urology',
24725: 'Family Medicine',
34838: 'None',
31908: 'Nurse Practitioner',
36115: 'None'}}
To do this, we can define a mapping between matches, then iterate through them and set the column's value, keeping track of columns we've changed. At the end, any columns we never matched get set to 'Other'.
mapping = {'Urolog': 'Urology',
'Nurse': 'Nurse Practioner',
'Oncology': 'Oncology',
'Physician': 'Physician Assistant',
'Family Medicine': 'Family Medicine',
'Anesthes': 'Anesthesiology'}
def specialty(column):
column = column.copy()
matches = pd.Series(False, index=column.index)
for k,v in mapping.items():
match = column.str.contains(k)
column[match] = v
matches[match] = True
column[~matches] = 'Other'
return column
specialty(df['final_desc'])
39063 Other
50538 Urology
56075 Anesthesiology
36593 Other
51656 Urology
95512 Other
40286 Anesthesiology
3887 Other
40161 Other
52905 Anesthesiology
35418 Urology
35934 Other
3389 Other
136 Other
105460 Other
52568 Urology
24725 Family Medicine
34838 Other
31908 Nurse Practioner
36115 Other
Name: final_desc, dtype: object
x received by specialty function is string itself. So no x.str and since it is string you can use 'in' to check as below. Modified some data to see the result
Tip: You should use a dictionary or list rather than using the elif chain.
Code:
import pandas as pd
import numpy as np
def specialty(x):
print(x)
if x in 'Urolog':
return 'Urology'
elif x in 'Nurse':
return 'Nurse Practioner'
elif x in 'Oncology':
return 'Oncology'
elif x in 'Physician':
return 'Physician Assistant'
elif x in 'Family Medicine':
return 'Family Medicine'
elif x in 'Anesthes':
return 'Anesthesiology'
else:
return 'Other'
df = pd.DataFrame(data={'person_id': {39063: 33081476009, 50538: 33033519093, 56075: 33170508793, 36593: 33061707789, 51656: 33047685345, 95512: 33022026049, 40286: 33038034707, 3887: 33076466195, 40161: 33052807819, 52905: 33190526939, 35418: 33008425164, 35934: 33015737122, 3389: 33055125864, 136: 33139641318, 105460: 33113871389, 52568: 33075745388, 24725: 33052090907, 34838: 33205449839, 31908: 33183672635, 36115: 33006692696},
'final_desc': {39063: 'None', 50538: 'Urolog', 56075: 'Anesthes', 36593: 'None', 51656: 'Urology', 95512: 'None', 40286: 'Anesthes', 3887: 'Specialist', 40161: 'None', 52905: 'Anesthesiology', 35418: 'Urology', 35934: 'None', 3389: 'Ophthalmology', 136: 'Rheumatology', 105460: 'None', 52568: 'Urology', 24725: 'Family Medicine', 34838: 'None', 31908: 'Nurse', 36115: 'None'}})
df['desc_clean'] = df['final_desc'].apply(specialty)
print(df)
Output:
person_id final_desc desc_clean
39063 33081476009 None Other
50538 33033519093 Urolog Urology
56075 33170508793 Anesthes Anesthesiology
36593 33061707789 None Other
51656 33047685345 Urology Other
95512 33022026049 None Other
40286 33038034707 Anesthes Anesthesiology
3887 33076466195 Specialist Other
40161 33052807819 None Other
52905 33190526939 Anesthesiology Other
35418 33008425164 Urology Other
35934 33015737122 None Other
3389 33055125864 Ophthalmology Other
136 33139641318 Rheumatology Other
105460 33113871389 None Other
52568 33075745388 Urology Other
24725 33052090907 Family Medicine Family Medicine
34838 33205449839 None Other
31908 33183672635 Nurse Nurse Practioner
36115 33006692696 None Other
You can use a library like fuzzywuzzy for fuzzy string matching. The benefit of this approach is it is more flexible than some rule set, as demonstrated below.
This solution generates the max score of substrings and candidate categories, returning the one that matches best. If it's below the threshold it will return the default value ("None"):
from fuzzywuzzy import fuzz
CATEGORIES = [
'Urology',
'Nurse Practioner',
'Oncology',
'Physician Assistant',
'Family Medicine',
'Anesthesiology',
'Specialist',
]
def best_match(
text,
categories=CATEGORIES,
default="None",
threshold=65
):
matches = {fuzz.partial_ratio(cat, text): cat for cat in categories}
best_score = max(matches)
best_match = matches[best_score]
if best_score >= threshold:
return best_match
else:
return default
df["final_desc"] = df.desc.apply(best_match)
result:
person_id final_desc desc
52568 33075745388 Urology urologist
36593 33061707789 Nurse Practioner nruse practition
136 33139641318 Specialist oncology specialist
50538 33033519093 Physician Assistant physicians assistant
3389 33055125864 Family Medicine fam. medicine
51656 33047685345 Anesthesiology anesthesiology
35418 33008425164 Anesthesiology anesthesiologist
52905 33190526939 Nurse Practioner Nurses practitioner
36115 33006692696 Specialist Occupational specialist
31908 33183672635 Oncology Oncologist
You can iterate directly using the index :
ix = df[df.desc.str.contains('Urolog')].index
df.loc[ix, 'desc_clean'] = "Urology"
So iterating would be something like :
dict_specialties = {"Urolog":"Urology",}
for key, val in dict_specialties.items():
ix = df[df.desc.str.contains(key)].index
df.loc[ix, 'desc_clean'] = val
I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}