How to uppercase acronyms in a dataframe

How to uppercase acronyms in a dataframe - python

I have a dataframe df which contains company names that I need to neatly-format. The names are already in titlecase:
Company Name
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 Pnc Bank
4 Aig Corp
5 Td Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Since many of the companies go by an acronym or an abbreviation (rows 1, 3, 4, 5), I want only the first string in those company names to be uppercase, like so:
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
I know I can't get 100% accurate replacement, but I believe I can get close by uppercasing only the first string if:
it's 4 or fewer characters
and the first string is not a word in the dictionary
How can I achieve this with something like: df['Company Name'] = df['Company Name'].replace()?

So you can actually use the enchant module to find out if it is a dictionary word or not. Given you are still going to have some off results I.E. Uber.
Here is the code I came up with, sorry for the terrible names of variables and what not.
import enchant
import pandas as pd
def main():
d = enchant.Dict("en_US")
listofcompanys = ['Msci Inc',
'Coca Cola Inc',
'Pnc Bank',
'Aig Corp',
'Td Ameritrade',
'Uber Inc',
'Costco Inc',
'New York Times']
dataframe = pd.DataFrame(listofcompanys, columns=['Company Name'])
for index, name in dataframe.iterrows():
first_word = name['Company Name'].split()
is_word = d.check(first_word[0])
if not is_word:
name['Company Name'] = first_word[0].upper() + ' ' + first_word[1]
print(dataframe)
if __name__ == '__main__':
main()
Output for this was:
Company Name
0 MSCI Inc
1 Coca Cola Inc
2 PNC Bank
3 AIG Corp
4 TD Ameritrade
5 UBER Inc
6 Costco Inc
7 New York Times

Here's a working solution, which makes use of a english word list. Only it's not accurate for td and uber, but like you said, this will be hard to get 100% accurate.
url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'
words = set(pd.read_csv(url, header=None)[0])
w1 = df['Company Name'].str.split()
m1 = ~w1.str[0].str.lower().isin(words) # is not an english word
m2 = w1.str[0].str.len().le(4) # first word is < 4 characters
df.loc[m1 & m2, 'Company Name'] = w1.str[0].str.upper() + ' ' + w1.str[1:].str.join(' ')
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 Td Ameritrade
6 UBER Inc
7 Costco Inc
8 New York Times
Note: I also tried this with nltk package, but apparently, the nltk.corpus.words module is by far not complete with the english words.

You can first separate the first words and the other parts. Then filter those first words based on your logic:
company_list = ['Visa']
s = df['Company Name'].str.extract('^(\S+)(.*)')
mask = s[0].str.len().le(4) & (~s[0].isin(company_list))
df['Company Name'] = s[0].mask(mask, s[0].str.upper()) + s[1]
Output (notice that NEW in New York gets changed as well):
Company Name
0 Visa Inc
1 MSCI Inc
2 COCA Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 UBER Inc
7 Costco Inc
8 NEW York Times

This will get you first word from the string and make it upper only for those company names that are included in the include list:
import pandas as pd
import numpy as np
company_name = {'Visa Inc', 'Msci Inc', 'Coca Cola Ins', 'Pnc Bank'}
include = ['Msci', 'Pnc']
df = pd.DataFrame(company_name)
df.rename(columns={0: 'Company Name'}, inplace=True)
df['Company Name'] = df['Company Name'].apply(lambda x: x.split()[0].upper() + ' ' + x[len(x.split()[0].upper()):] if x.split()[0].strip() in include else x)
df['Company Name']
Output:
0 MSCI Inc
1 Coca Cola Ins
2 PNC Bank
3 Visa Inc
Name: Company Name, dtype: object

A manual workaround could be appending words like "uber"
from nltk.corpus import words
dict_words = words.words()
dict_words.append('uber')
create a new column
df.apply(lambda x : x['Company Name'].replace(x['Company Name'].split(" ")[0].strip(), x['Company Name'].split(" ")[0].strip().upper())
if len(x['Company Name'].split(" ")[0].strip()) <= 4 and x['Company Name'].split(" ")[0].strip().lower() not in dict_words
else x['Company Name'],axis=1)
Output:
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Download the nltk package version by runnning:
import nltk
nltk.download()
Demo:
from nltk.corpus import words
"new" in words.words()
Output:
False

Related

isin only returning first line from csv

I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?

You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]

Keyword categorization from strings in a new column in pandas

This is not the best approach but this what I did so far:
I have this example df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
gives:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.
So, here is my nested dic as a reference for countries and cities:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None
then applied it to df:
df.City.apply(lambda x: get_country(x.lower()))
I got the following output:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
Expected output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0

Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.
df = pd.DataFrame({
'City': ['I lived Los Angeles',
'I visited London and Toronto',
'the best one is Toronto',
'business hub is in New York',
' Mexico city is stunning']
})
country_cities = {
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)
for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] += 1
return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Solution based on Series.str.count
A simpler solution is using Series.str.count to count the occurences of the following regex pattern city1|city2|etc for each country (the pattern matches city1 or city2 or etc). Using the same setup as above:
country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
for country, pat in country_patterns.items():
df[country] = df['City'].str.count(pat)
Why your solution doesn't work?
if y.lower() in x:
df[k] = count + 1
else:
return None
The reason your function doesn't produce the right output is that
you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because the return statement automatically exits the function.
What is happening is that only US cities are checked, and the line df[k] = 1 (in this case k = 'US') creates an entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.

You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0

Getting the name of the row with highest count

Can anyone help me in getting the name of the CurrencySymbol which have the highest count.
filt = df['Country'] == 'India'
df.loc[filt]['CurrencySymbol'].value_counts()
INR
4918
USD
133
AED
16
AUD
11
EUR
8
AMD
7
AFN
5
CAD
2
RON
2
AWG
2
JPY
2
ARS
2
AOA
2
when I tried this:
df.loc[filt]['CurrencySymbol'].value_counts().max()
It returns me 4918 But I want to return INR.

Because your value count is already ordered, just query the first element of the index of the value count.
pd.Series.value_counts().index[0]
Eg, for this trivial example:
df = pd.DataFrame([{'str': 'free peoples of middle earth'}])
df['str'].value_counts().index[0]
>>> 'free peoples of middle earth'

If column contains substring from list, create new column with removed substring from list

I'm trying to create a simplified name column. I have a brand name column and a list of strings as shown below. If the brand name column contains any string from list, then create a simplified brand name column with the string matched removed. The other brand name column elements that do not contain any strings from list will be carried over to the simplified column
l = ['co', 'ltd', 'company']
df:
Brand
Nike
Adidas co
Apple company
Intel
Google ltd
Walmart co
Burger King
Desired df:
Brand Simplified
Nike Nike
Adidas co Adidas
Apple company Apple
Intel Intel
Google Ltd Google
Walmart co Walmart
Burger King Burger King
Thanks in advance! Any help is appreciated!!

how about use this to remove substrings and trailing whitespaces
list_substring = ['ltd', 'company', 'co'] # 'company' will be evaluated first before 'co'
df['Simplified'] = df['Brand'].str.replace('|'.join(list_substring), '').str.lstrip()

In [28]: df
Out[28]:
Brand
0 Nike
1 Adidas co
2 Apple company
3 Intel
4 Google ltd
5 Walmart co
6 Burger King
In [30]: df["Simplified"] = df.Brand.apply(lambda x: x.split()[0] if x.split()[-1] in l else x)
In [31]: df
Out[31]:
Brand Simplified
0 Nike Nike
1 Adidas co Adidas
2 Apple company Apple
3 Intel Intel
4 Google ltd Google
5 Walmart co Walmart
6 Burger King Burger King

Using str.replace
Ex:
l = ['co', 'ltd', 'company']
df = pd.DataFrame({'Brand': ['Nike', 'Adidas co', 'Apple company', 'Intel', 'Google ltd', 'Walmart co', 'Burger King']})
df['Simplified'] = df['Brand'].str.replace(r"\b(" + "|".join(l) + r")\b", "").str.strip()
#or df['Brand'].str.replace(r"\b(" + "|".join(l) + r")\b$", "").str.strip() #TO remove only in END of string
print(df)
Output:
Brand Simplified
0 Nike Nike
1 Adidas co Adidas
2 Apple company Apple
3 Intel Intel
4 Google ltd Google
5 Walmart co Walmart
6 Burger King Burger King

df = {"Brand":["Nike","Adidas co","Apple company","Google ltd","Berger King"]}
df = pd.DataFrame(df)
list_items = ['ltd', 'company', 'co'] # 'company' will be evaluated first before 'co'
df['Simplified'] = [' '.join(w) for w in df['Brand'].str.split().apply(lambda x: [i for i in x if i not in list_items])]

Dealing with abbreviation and misspelled words in DataFrame Pandas

I have a dataframe contains misspelled words and abbreviations like this.
input:
df = pd.DataFrame(['swtch', 'cola', 'FBI',
'smsng', 'BCA', 'MIB'], columns=['misspelled'])
output:
misspelled
0 swtch
1 cola
2 FBI
3 smsng
4 BCA
5 MIB
I need to correcting the misspelled words and the Abvreviations
I have tried with creating the dictionary such as:
input:
dicts = pd.DataFrame(['coca cola', 'Federal Bureau of Investigation',
'samsung', 'Bank Central Asia', 'switch', 'Men In Black'], columns=['words'])
output:
words
0 coca cola
1 Federal Bureau of Investigation
2 samsung
3 Bank Central Asia
4 switch
5 Men In Black
and applying this code
x = [next(iter(x), np.nan) for x in map(lambda x: difflib.get_close_matches(x, dicts.words), df.misspelled)]
df['fix'] = x
print (df)
The output is I have succeded correcting misspelled but not the abbreviation
misspelled fix
0 swtch switch
1 cola coca cola
2 FBI NaN
3 smsng samsung
4 BCA NaN
5 MIB NaN
Please help.

How about following a 2-prong approach where first correct the misspellings and then expand the abbreviations:
df = pd.DataFrame(['swtch', 'cola', 'FBI', 'smsng', 'BCA', 'MIB'], columns=['misspelled'])
abbreviations = {
'FBI': 'Federal Bureau of Investigation',
'BCA': 'Bank Central Asia',
'MIB': 'Men In Black',
'cola': 'Coca Cola'
}
spell = SpellChecker()
df['fixed'] = df['misspelled'].apply(spell.correction).replace(abbreviations)
Result:
misspelled fixed
0 swtch switch
1 cola Coca Cola
2 FBI Federal Bureau of Investigation
3 smsng among
4 BCA Bank Central Asia
5 MIB Men In Black
I use pyspellchecker but you can go with any spelling-checking library. It corrected smsng to among but that is a caveat of automatic spelling correction. Different libraries may give out different results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to uppercase acronyms in a dataframe - python

Related

isin only returning first line from csv

Keyword categorization from strings in a new column in pandas

Getting the name of the row with highest count

If column contains substring from list, create new column with removed substring from list

Dealing with abbreviation and misspelled words in DataFrame Pandas

Categories

Resources