How to conditionally modify string values in dataframe column - Python/Pandas - python

I have a dataframe of which one column ('entity) contains various names of countries and non-state entities. I need to clean the column because the string values (provided by manual data-entry) are all lower-case (china instead of China). I can't just perform the .title() operation on the column since there are string values for which I want nothing to done (e.g., al Something should not be turned into AL Something).
I'm have trouble creating a function to help me with this problem and could use some guidance from the community. In the past I've used dictionaries to help map/replace incorrect strings with correct strings, and I can still revert to that way of doing things, but I thought creating this function might be more straightforward and efficient and plus I wanted to challenge myself. But no changes occurs to the entity column when I execute the function. Thanks in advance!
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(myString)
else:
new_title.append(entity.title())
return new_title
title_fix(df)

The entities in the line entities = df['entity'] is not the same variable as the entities in the line def title_fix(entities):. This second entities variable is the argument to the function title_fix, and it exists only within the function. It takes on whatever argument you pass into your call to title_fix, which is df.
Try this instead of your function:
# A list of entity names to leave alone (must exactly match character-for-character)
myString = ['al Group1', 'al Group2']
# Apply title case to every entity NOT in myString
df['entity'] = df['entity'].apply(lambda x: x if x in myString else x.title())
# Print the modified DataFrame
df
Note that this solution requires that each string in myString exactly matches the target string in df['entity'], otherwise the target string will not be replaced.

Your code had several bugs, such as spelling and indentation. Fixed code:
myString = ['al Group1', 'al Group2']
entities = df['entity']
def title_fix(entities):
new_titles = []
for entity in entities:
if entity in myString:
new_titles.append(entity)
else:
new_titles.append(entity.title())
return new_titles
df['entity'] = title_fix(entities)
However, what you want to achieve can be done in a one-liner. I came up with 3 solutions. I don't know pandas that well and I have no idea about the performance differences between these solutions, but here they are.
ignored makes a little bit more sense than myString so I'll use it.
ignored = ['al Group1', 'al Group2']
First solution:
df['entity'] = df['entity'].apply(lambda x: x.title() if x not in ignored else x)
Second:
df.entity[~df.entity.isin(ignored)] = df.entity.str.title()
Third:
df.loc[~df.entity.isin(ignored), 'entity'] = df.entity.str.title()

Related

Transliterate sentence written in 2 different scripts to a single script

I am able to convert an Hindi script written in English back to Hindi
import codecs,string
from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate
def is_hindi(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return character
else:
print(transliterate(character, sanscript.ITRANS, sanscript.DEVANAGARI)
character = 'bakrya'
is_hindi(character)
Output:
बक्र्य
But If I try to do something like this, I don't get any conversions
character = 'Bakrya विकणे आहे'
is_hindi(character)
Output:
Bakrya विकणे आहे
Expected Output:
बक्र्य विकणे आहे
I also tried the library Polyglot but I am getting similar results with it.
Preface: I know nothing of devanagari, so you will have to bear with me.
First, consider your function. It can return two things, character or None (print just outputs something, it doesn't actually return a value). That makes your first output example originate from the print function, not Python evaluating your last statement.
Then, when you consider your second test string, it will see that there's some Devanagari text and just return the string back. What you have to do, if this transliteration works as I think it does, is to apply this function to every word in your text.
I modified your function to:
def is_hindi(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return character
else:
return transliterate(character, sanscript.ITRANS, sanscript.DEVANAGARI)
and modified your call to
' '.join(map(is_hindi, character.split()))
Let me explain, from right to left. First, I split your test string into the separate words with .split(). Then, I map (i.e., apply the function to every element) the new is_hindi function to this new list. Last, I join the separate words with a space to return your converted string.
Output:
'बक्र्य विकणे आहे'
If I may suggest, I would place this splitting/mapping functionality into another function, to make things easier to apply.
Edit: I had to modify your test string from 'Bakrya विकणे आहे' to 'bakrya विकणे आहे' because B wasn't being converted. This can be fixed in a generic text with character.lower().

.replace only replaces last character in list

title = 'Example####+||'
blacklisted_chars = ['#','|','#','+']
for i in blacklisted_chars:
convert = title.replace(i, '')
print(convert)
# Example####||
I want to remove all blacklisted characters in a list and replace them with '', however when the code is run only the final 'blacklisted_char' is replaced within the print statement
I am wondering how I would make it that all characters are replaced and only 'Example' is printed
Strings are immutable in python. You assign a new string with
convert = title.replace(i, '')
title remains unchanged after this statement. convert is an entirely new string that is missing i.
On the next iteration, you replace a different value of i, but still from the original title. So in the end it looks like you only ran
convert = title.replace('+', '')
You have two very similar options, depending on whether you want to keep the original title around or not.
If you do, make another reference to it, and keep updating that reference with the results, so that each successive iteration builds on the result of the previous removal:
convert = title
for i in blacklisted_chars:
convert = convert.replace(i, '')
print(convert)
If you don't care to retain the original title, use that name directly:
for i in blacklisted_chars:
title = title.replace(i, '')
print(title)
You can achieve a similar result without an explicit loop using re.sub:
convert = re.sub('[#|#+]', '', title)
Try this :
title = 'Example####+||'
blacklisted_chars = ['#','|','#','+']
for i in blacklisted_chars:
title = title.replace(i, '')
print(title)
Explanation: Since you were storing the result of title.replace in the convert variable, every iteration it was being overwritten. What you need is to apply replace to the result of the previous iteration, which can be the variable with the original string or another variable containing a copy of it if you want to keep the original value unchanged.
P.S.: strings are iterables so you can also achieve the same results with something like this:
blacklisted_chars = '#|#+'

Conditionally modify multiple variables

Not quite sure what the correct title should be.
I have a function with 2 inputs def color_matching(color_old, color_new). This function should check the strings in both arguments and assign either a new string if there is a hit.
def color_matching(color_old, color_new):
if ('<color: none' in color_old):
color_old = "NoHighlightColor"
elif ('<color: none' in color_new):
color_new = "NoHighlightColor"
And so forth. The problem is that each of the arguments can be matched to 1 of 14 different categories ("NoHighlightColor" being one of them). I'm sure there is a better way to do this than repeating the if statement 28 times for each mapping but I'm drawing a blank.
You can at first parse your input arguments, if for example it's something like that:
old_color='<color: none attr:ham>'
you can parse it to get only the value of the relevant attribute you need:
_old_color=old_color.split(':')[1].split()[0]
That way _old_color='none'
Then you can use a dictionary where {'none':'NoHighlightColor'}, lets call it colors_dict
old_color=colors_dict.get(_old_color, old_color)
That way if _old_color exists as a key in the dictionary old_color will get the value of that key, otherwise, old_color will remain unchanged
So your final code should look similar to this:
def color_matching(color_old, color_new):
""" Assuming you've predefined colros_dict """
# Parsing to get both colors
_old_color=old_color.split(':')[1].split()[0]
_new_color=new_color.split(':')[1].split()[0]
# Checking if the first one is a hit
_result_color = colors_dict.get(_old_color, None)
# If it was a hit (not None) then assign it to the first argument
if _result_color:
color_old = _result_color
else:
color_new = colors_dict.get(_color_new, color_new)
You can replace conditionals with a data structure:
def match(color):
matches = {'<color: none': 'NoHighlightColor', ... }
for substring, ret in matches.iteritems():
if substring in color:
return ret
But you seems to have a problem that requires a proper parser for the format you are trying to recognize.
You might build one from simple string operations like "<color:none jaja:a>".split(':')
You could maybe hack one with a massive regex.
Or use a powerful parser generated by a library like this one

Python: Removing characters from a string and then returning it

For example, given a list of strings prices = ["US$200", "CA$80", "GA$500"],
I am trying to only return ["US", "CA", "GA"].
Here is my code - what am I doing wrong?
def get_country_codes(prices):
prices = ""
list = prices.split()
list.remove("$")
"".join(list)
return list
Since each of the strings in the prices argument has the form '[country_code]$[number]', you can split each of them on '$' and take the first part.
Here's an example of how you can do this:
def get_country_codes(prices):
return [p.split('$')[0] for p in prices]
So get_country_codes(['US$200', 'CA$80', 'GA$500']) returns ['US', 'CA', 'GA'].
Also as a side note, I would recommend against naming a variable list as this will override the built-in value of list, which is the type list itself.
There are multiple problems with your code, and you have to fix all of them to make it work:
def get_country_codes(prices):
prices = ""
Whatever value your caller passed in, you're throwing that away and replacing it with "". You don't want to do that, so just get rid of that last line.
list = prices.split()
You really shouldn't be calling this list list. Also, split with no argument splits on spaces, so what you get may not be what you want:
>>> "US$200, CA$80, GA$500".split()
['US$200,', 'CA$80,', 'GA$500']
I suppose you can get away with having those stray commas, since you're just going to throw them away. But it's better to split with your actual separators, the ', '. So, let's change that line:
prices = prices.split(", ")
list.remove("$")
This removes every value in the list that's equal to the string "$". There are no such values, so it does nothing.
More generally, you don't want to throw away any of the strings in the list. Instead, you want to replace the strings, with strings that are truncated at the $. So, you need a loop:
countries = []
for price in prices:
country, dollar, price = price.partition('$')
countries.append(country)
If you're familiar with list comprehensions, you can rewrite this as a one-liner:
countries = [price.partition('$')[0] for price in prices]
"".join(list)
This just creates a new string and then throws it away. You have to assign it to something if you want to use it, like this:
result = "".join(countries)
But… do you really want to join anything here? It sounds like you want the result to be a list of strings, ['US', 'CA', 'GA'], not one big string 'USCAGA', right? So, just get rid of this line.
return list
Just change the variable name to countries and you're done.
Since your data is structured where the first two characters are the county code you can use simple string slicing.
def get_country_codes(prices):
return [p[:2] for p in prices]
You call the function sending the prices parameter but your first line initialize to an empty string:
prices = ''
I would also suggest using the '$' character as the split character, like:
list = prices.split('$')
try something like this:
def get_country_codes(prices):
list = prices.split('$')
return list[0]

Simplifying a list into categories

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:
def map_company(row):
company = row['SIC_Desc']
if company in 'Surgical':
return 'Surgical'
elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
return 'Eyewear'
elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
return 'First Aid'
elif company in ['Dental', 'Denture']:
return 'Dental'
elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
return 'Mobility equipments'
else:
return 'Other'
df['SIC_Desc'] = df.apply(map_company,axis=1)
This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel?
Thank you
It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:
if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])
However you might have a case issue here so I would recommend:
company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
return 'Eyewear'
You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.
In the end your function will look like that:
def is_match(company,names):
return any(name in company for name in names)
def map_company(row):
company = row['SIC_Desc'].lower()
if 'surgical' in company:
return 'Surgical'
elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
return 'Eyewear'
elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
return 'First Aid'
else:
return 'Other'
Here is an option using a reversed dictionary.
Code
import pandas as pd
# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles",
"glasses", "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df
LOOKUP = {
"Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
"First Aid": ["cotton", "bandages", "gauze", "tape"],
"Surgical": ["surgical"],
"Dental": ["dental", "denture"],
"Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}
REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}
def map_company(row):
company = row["SIC_Desc"].lower()
return REVERSE_LOOKUP.get(company, "Other")
df["SIC_Desc"] = df.apply(map_company, axis=1)
df
Details
We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:
>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
'cotton': 'First Aid',
'eye': 'Eyewear',
'gauze': 'First Aid',
...}
Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.
See #Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.
Benefits
Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

Categories