Simplifying a list into categories - python

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:
def map_company(row):
company = row['SIC_Desc']
if company in 'Surgical':
return 'Surgical'
elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
return 'Eyewear'
elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
return 'First Aid'
elif company in ['Dental', 'Denture']:
return 'Dental'
elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
return 'Mobility equipments'
else:
return 'Other'
df['SIC_Desc'] = df.apply(map_company,axis=1)
This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel?
Thank you

It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:
if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])
However you might have a case issue here so I would recommend:
company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
return 'Eyewear'
You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.
In the end your function will look like that:
def is_match(company,names):
return any(name in company for name in names)
def map_company(row):
company = row['SIC_Desc'].lower()
if 'surgical' in company:
return 'Surgical'
elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
return 'Eyewear'
elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
return 'First Aid'
else:
return 'Other'

Here is an option using a reversed dictionary.
Code
import pandas as pd
# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles",
"glasses", "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df
LOOKUP = {
"Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
"First Aid": ["cotton", "bandages", "gauze", "tape"],
"Surgical": ["surgical"],
"Dental": ["dental", "denture"],
"Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}
REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}
def map_company(row):
company = row["SIC_Desc"].lower()
return REVERSE_LOOKUP.get(company, "Other")
df["SIC_Desc"] = df.apply(map_company, axis=1)
df
Details
We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:
>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
'cotton': 'First Aid',
'eye': 'Eyewear',
'gauze': 'First Aid',
...}
Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.
See #Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.
Benefits
Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

Related

Case insensitive Full Name dictionary search

I am creating a dictionary with "Full Name": "Birthday" for numerous people as an exercise.
The program should ask
"Who's birthday do you want to look up?"
I will input a name, say "Benjamin Franklin"
And it will return his birthday: 1706/01/17.
Alright, the problem I am encountering is name capitalization.
How can I input "benjamin franklin" and still find "Benjamin Franklin" in my dictionary? I am familiar with .lower() and .upper() functions, however I am not able to implement them correctly, is that the right way to approach this problem?
Here is what I have
bday_dict = {"Person1": "YYYY/MM/DD1",
"Person2": "YYYY/MM/DD2",
"Benjamin Franklin": "1706/01/17"}
def get_name(dict_name):
name = input("Who's birthday do you want to look up? > ")
return name
def find_bday(name):
print(bday_dict[name])
find_bday(get_name(bday_dict))
The best way to do this is to keep the keys in your dictionary lowercase. If you can't do that for whatever reason, have a dictionary from lowercase to the real key, and then keep the original dictionary.
Otherwise, Kraigolas's solution works well, but it is O(N) whereas hashmaps are supposed to be constant-time, and thus for really large dictionaries the other answer will not scale.
So, when you are setting your keys, do bday_dict[name.lower()] = value and then you can query by bday_dict[input.lower()].
Alternatively:
bday_dict = {"John": 1}
name_dict = {"john": "John"}
def access(x):
return bday_dict[name_dict[x.lower()]]
Probably the most straight forward way I can think of to solve this is the following:
def get_birthday(name):
global bday_dict
for person, bday in bday_dict.items():
if name.lower() == person.lower():
return bday
return "This person is not in bday_dict"
Here, you just iterate through the entire dictionary using the person's name paired with their birthday, and if we don't find them, just return a message saying we don't have their birthday.
If you know that all names will capitalize the first letter of each word, you can just use:
name = ' '.join([word.capitalize() for word in name.split()])
then you can just search for that. This is not always the case. For example, for "Leonardo da Vinci" this will not work, so the original answer is probably the most reliable way to do this.
One final way to do this would be to just store the names as lowercase from the beginning in your dictionary, but this might not be practical when you want to draw a name from the dictionary as well.
Depending what your exercise allows, I would put the names in the dictionary as all lowercase or uppercase. So:
bday_dict = {"person1": "YYYY/MM/DD1",
"person2": "YYYY/MM/DD2",
"benjamin franklin": "1706/01/17"}
And then look up the entered name in the dictionary like this:
def find_bday(name):
print(bday_dict[name.lower()])
You may also want to do a check that the name is in the dictionary beforehand to avoid an error:
def find_bday(name):
bday = bday_dict.get(name.lower(), None)
if bday:
print(bday)
else:
print("No result for {}.".format(name))

How to conditionally modify string values in dataframe column - Python/Pandas

I have a dataframe of which one column ('entity) contains various names of countries and non-state entities. I need to clean the column because the string values (provided by manual data-entry) are all lower-case (china instead of China). I can't just perform the .title() operation on the column since there are string values for which I want nothing to done (e.g., al Something should not be turned into AL Something).
I'm have trouble creating a function to help me with this problem and could use some guidance from the community. In the past I've used dictionaries to help map/replace incorrect strings with correct strings, and I can still revert to that way of doing things, but I thought creating this function might be more straightforward and efficient and plus I wanted to challenge myself. But no changes occurs to the entity column when I execute the function. Thanks in advance!
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(myString)
else:
new_title.append(entity.title())
return new_title
title_fix(df)
The entities in the line entities = df['entity'] is not the same variable as the entities in the line def title_fix(entities):. This second entities variable is the argument to the function title_fix, and it exists only within the function. It takes on whatever argument you pass into your call to title_fix, which is df.
Try this instead of your function:
# A list of entity names to leave alone (must exactly match character-for-character)
myString = ['al Group1', 'al Group2']
# Apply title case to every entity NOT in myString
df['entity'] = df['entity'].apply(lambda x: x if x in myString else x.title())
# Print the modified DataFrame
df
Note that this solution requires that each string in myString exactly matches the target string in df['entity'], otherwise the target string will not be replaced.
Your code had several bugs, such as spelling and indentation. Fixed code:
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(entity)
else:
new_titles.append(entity.title())
return new_titles
df['entity'] = title_fix(entities)
However, what you want to achieve can be done in a one-liner. I came up with 3 solutions. I don't know pandas that well and I have no idea about the performance differences between these solutions, but here they are.
ignored makes a little bit more sense than myString so I'll use it.
ignored = ['al Group1', 'al Group2']
First solution:
df['entity'] = df['entity'].apply(lambda x: x.title() if x not in ignored else x)
Second:
df.entity[~df.entity.isin(ignored)] = df.entity.str.title()
Third:
df.loc[~df.entity.isin(ignored), 'entity'] = df.entity.str.title()

Add missing dictionary key/value via raw_input

import collections
header_dict = {'account number':'ACCOUNT_name','accountID':'ACCOUNT_name','name':'client','first name':'client','tax id':'tin'}
#header_dict = collections.defaultdict(lambda: 'tin') # attempted use of defaultdict...destroys my dictionary
given_header = ['account number','name','tax id']#,'tax identification number']#,'social security number'
#given_header = ['account number','name','tax identification number']...non working header layout
fileLayout = [header_dict[ting] for ting in given_header if ting] #create if else..if ting exists, add to list...else if not in list, add to dictionary
def getLayout(ting):
global given_header
global fileLayout
return given_header[fileLayout.index(ting)]
print getLayout('ACCOUNT_name')
print getLayout('client')
print getLayout('tin')
rows = zip((getLayout('ACCOUNT_name'),getLayout('client'),getLayout('tin')))
print rows
I am working with many files of random, mixed up layouts/column orders. I have a set template for my db table of 'ACCOUNT_name','client','tin' that I want the files to be ordered in. I have created a dictionary of the possible header/column names I might find in other files as keys and my set header names as values. So, for example, if I wanted to see where to put the column 'account number' from one of my given files, I would type header_dict['account number'].
This would give me the corresponding column from my template, 'ACCOUNT_name'. This works great...I also added another feature. Instead of having to type 'account number'..I made a list comprehension that looks up each value by key.
This list I just created with the 'fileLayout' list comprehension essentially transforms my given file's header into my desired names: ['ACCOUNT_name','client']
That makes life a lot easier...I know that I want to look up 'ACCOUNT_name', or 'client'. Next I run a function 'getLayout' that returns the index of the desired columns I am searching...So if I want to see where my desired column 'ACCOUNT_name' is in the file, I just run the function which is called like this...
getLayout('ACCOUNT_name')
Now at this point, I can easily print the columns to my order...with:
rows = zip((getLayout('ACCOUNT_name'),getLayout('client'),getLayout('tin')))
print rows
The above code gives me [('account number'),('name'),('tax id')], which is exactly what I want...
But what if there is a new header I am not used to ?? Lets use the same example code above but change the list 'given_header' to this:
given_header = ['account number','name','tax identification number']
I most certainly get the key error, KeyError: 'tax identification number' I know I can use defaultdict but when I try to use it with the set value 'tin', I end up overwriting my entire dictionary... What I would ultimately like to end up doing is this...
I would like to create an else within my list comprehension that allows me to standard input dictionary entries if they don't exist. In other words, since 'tax identification number' does not exists as a key, add it as one to my dict and give it the value 'tin' via raw_input. Has anyone ever done or tried anything like this? Any ideas? If you have and have any suggestions, I am all ears. I'm struggling on this issue...
The way I would want to go about this is in the list comprehension..
fileLayout = [header_dict[ting] for ting in given_header if ting else raw_input('add missing key value pair to dictionary')] # or do something of the sort.

Sorting the catalog results by multiple fields

I need to sort the catalog results by multiple fields.
In my case, first sort by year, then by month. The year and month field are included in my custom content type (item_publication_year and item_publication_month respectively).
However, I'm not getting the results that I want. The year and month are not ordered at all. They should appear in descending order i.e. 2006, 2005, 2004 etc.
Below is my code:
def queryItemRepository(self):
"""
Perform a search returning items matching the criteria
"""
query = {}
portal_catalog = getToolByName(self, 'portal_catalog')
folder_path = '/'.join( self.context.getPhysicalPath() )
query['portal_type'] = "MyContentType"
query['path'] = {'query' : folder_path, 'depth' : 2 }
results = portal_catalog.searchResults(query)
# convert the results to a python list so we can use the sort function
results = list(results)
results.sort(lambda x, y : cmp((y['item_publication_year'], y['item_publication_year']),
(x['item_publication_month'], x['item_publication_month'])
))
return results
Anyone care to help?
A better bet is to use the key parameter for sorting:
results.sort(key=lambda b: (b.item_publication_year, b.item_publication_month))
You can also use the sorted() built-in function instead of using list(); it'll return a sorted list for you, it's the same amount of work for Python to first call list on the results, then sort, as it is to just call sorted:
results = portal_catalog.searchResults(query)
results = sorted(results, key=lambda b: (b.item_publication_year, b.item_publication_month))
Naturally, both item_publication_year and item_publication_month need to be present in the catalog metadata.
You can get multiple sorting straight from catalog search using advanced query see also its official docs

python list of dicts how to merge key:value where values are same?

Python newb here looking for some assistance...
For a variable number of dicts in a python list like:
list_dicts = [
{'id':'001', 'name':'jim', 'item':'pencil', 'price':'0.99'},
{'id':'002', 'name':'mary', 'item':'book', 'price':'15.49'},
{'id':'002', 'name':'mary', 'item':'tape', 'price':'7.99'},
{'id':'003', 'name':'john', 'item':'pen', 'price':'3.49'},
{'id':'003', 'name':'john', 'item':'stapler', 'price':'9.49'},
{'id':'003', 'name':'john', 'item':'scissors', 'price':'12.99'},
]
I'm trying to find the best way to group dicts where the value of key "id" is equal, then add/merge any unique key:value and create a new list of dicts like:
list_dicts2 = [
{'id':'001', 'name':'jim', 'item1':'pencil', 'price1':'0.99'},
{'id':'002', 'name':'mary', 'item1':'book', 'price1':'15.49', 'item2':'tape', 'price2':'7.99'},
{'id':'003', 'name':'john', 'item1':'pen', 'price1':'3.49', 'item2':'stapler', 'price2':'9.49', 'item3':'scissors', 'price3':'12.99'},
]
So far, I've figured out how to group the dicts in the list with:
myList = itertools.groupby(list_dicts, operator.itemgetter('id'))
But I'm struggling with how to build the new list of dicts to:
1) Add the extra keys and values to the first dict instance that has the same "id"
2) Set the new name for "item" and "price" keys (e.g. "item1", "item2", "item3"). This seems clunky to me, is there a better way?
3) Loop over each "id" match to build up a string for later output
I've chosen to return a new list of dicts only because of the convenience of passing a dict to a templating function where setting variables by a descriptive key is helpful (there are many vars). If there is a cleaner more concise way to accomplish this, I'd be curious to learn. Again, I'm pretty new to Python and in working with data structures like this.
Try to avoid complex nested data structures. I believe people tend to
grok them only while they are intensively using the data structure. After the
program is finished, or is set aside for a while, the data structure quickly
becomes mystifying.
Objects can be used to retain or even add richness to the data structure in a saner, more organized way. For instance, it appears the item and price always go together. So the two pieces of data might as well be paired in an object:
class Item(object):
def __init__(self,name,price):
self.name=name
self.price=price
Similarly, a person seems to have an id and name and a set of possessions:
class Person(object):
def __init__(self,id,name,*items):
self.id=id
self.name=name
self.items=set(items)
If you buy into the idea of using classes like these, then your list_dicts could become
list_people = [
Person('001','jim',Item('pencil',0.99)),
Person('002','mary',Item('book',15.49)),
Person('002','mary',Item('tape',7.99)),
Person('003','john',Item('pen',3.49)),
Person('003','john',Item('stapler',9.49)),
Person('003','john',Item('scissors',12.99)),
]
Then, to merge the people based on id, you could use Python's reduce function,
along with take_items, which takes (merges) the items from one person and gives them to another:
def take_items(person,other):
'''
person takes other's items.
Note however, that although person may be altered, other remains the same --
other does not lose its items.
'''
person.items.update(other.items)
return person
Putting it all together:
import itertools
import operator
class Item(object):
def __init__(self,name,price):
self.name=name
self.price=price
def __str__(self):
return '{0} {1}'.format(self.name,self.price)
class Person(object):
def __init__(self,id,name,*items):
self.id=id
self.name=name
self.items=set(items)
def __str__(self):
return '{0} {1}: {2}'.format(self.id,self.name,map(str,self.items))
list_people = [
Person('001','jim',Item('pencil',0.99)),
Person('002','mary',Item('book',15.49)),
Person('002','mary',Item('tape',7.99)),
Person('003','john',Item('pen',3.49)),
Person('003','john',Item('stapler',9.49)),
Person('003','john',Item('scissors',12.99)),
]
def take_items(person,other):
'''
person takes other's items.
Note however, that although person may be altered, other remains the same --
other does not lose its items.
'''
person.items.update(other.items)
return person
list_people2 = [reduce(take_items,g)
for k,g in itertools.groupby(list_people, lambda person: person.id)]
for person in list_people2:
print(person)
This looks very much like a homework problem.
As the above poster mentioned, there are a few more appropriate data structures for this kind of data, some variant on the following might be reasonable:
[ ('001', 'jim', [('pencil', '0.99')]),
('002', 'mary', [('book', '15.49'), ('tape', '7.99')]),
('003', 'john', [('pen', '3.49'), ('stapler', '9.49'), ('scissors', '12.99')])]
This can be made with the relatively simple:
list2 = []
for id,iter in itertools.groupby(list_dicts,operator.itemgetter('id')):
idList = list(iter)
list2.append((id,idList[0]['name'],[(z['item'],z['price']) for z in idList]))
The interesting thing about this question is the difficulty in extracting 'name' when using groupby, without iterating past the item.
To get back to the original goal though, you could use code like this (as the OP suggested):
list3 = []
for id,name,itemList in list2:
newitem = dict({'id':id,'name':name})
for index,items in enumerate(itemList):
newitem['item'+str(index+1)] = items[0]
newitem['price'+str(index+1)] = items[1]
list3.append(newitem)
I imagine it would be easier to combine the items in list_dicts into something that looks more like this:
list_dicts2 = [{'id':1, 'name':'jim', 'items':[{'itemname':'pencil','price':'0.99'}], {'id':2, 'name':'mary', 'items':[{'itemname':'book','price':'15.49'}, {'itemname':'tape','price':'7.99'}]]
You could also use a list of tuples for 'items' or perhaps a named tuple.

Categories