Getting a list of suffixes from the company names - python

I have a data frame df with a column name - Company. Few examples of the company names are: ABC Inc., XYZ Gmbh, PQR Ltd, JKL Limited etc. I want a list of all the suffixes (Inc.,Gmbh, Ltd., Limited etc). Please notice that suffix length is always different. There might be companies without any suffix, for example: Apple. I need a complete list of all suffixes from the all the company names, keeping only unique suffixes in the list.
How do I accomplish this task?

try this:
In [36]: df
Out[36]:
Company
0 Google
1 Apple Inc
2 Microsoft Inc
3 ABC Inc.
4 XYZ Gmbh
5 PQR Ltd
6 JKL Limited
In [37]: df.Company.str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[37]: array(['Inc', 'Inc.', 'Gmbh', 'Ltd', 'Limited'], dtype=object)
or ignoring punctuation:
In [38]: import string
In [39]: df.Company.str.replace('['+string.punctuation+']+','')
Out[39]:
0 Google
1 Apple Inc
2 Microsoft Inc
3 ABC Inc
4 XYZ Gmbh
5 PQR Ltd
6 JKL Limited
Name: Company, dtype: object
In [40]: df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[40]: array(['Inc', 'Gmbh', 'Ltd', 'Limited'], dtype=object)
export result into Excel file:
data = df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
res = pd.DataFrame(data, columns=['Comp_suffix'])
res.to_excel(r'/path/to/file.xlsx', index=False)

You can use cleanco Python library for that, it has a list of all possible suffixes inside. E.g. it contains all the examples you provided (Inc, Gmbh, Ltd, Limited).
So you can take the suffixes from the library and use them as a dictionary to search in your data, e.g.:
import pandas as pd
company_names = pd.Series(["Apple", "ABS LLC", "Animusoft Corp", "A GMBH"])
suffixes = ["llc", "corp", "abc"] # take from cleanco source code
found = [any(company_names.map(lambda x: x.lower().endswith(' ' + suffix))) for suffix in suffixes]
suffixes_found = [suffix for (suffix, suffix_found) in zip(suffixes, found) if suffix_found]
print suffixes_found # outputs ['llc', 'corp']

So you want the last word of the Company name, assuming the company has a name more than one word long?
set(name_list[-1] for name_list in map(str.split, company_names) if len(name_list) > 1)
The [-1] gets the last word. str.split splits on spaces. I've never used pandas, so getting company_names might be the hard part of this.

This only adds the suffixes when the company name has more than one word as you required.
company_names = ["Apple", "ABS LLC", "Animusoft Corp"]
suffixes = [name.split()[-1] for name in company_names if len(name.split()) > 1]
Now having into account that this doesn't cover the unique requirement.
This doesn't cover that you can have a company named like "Be Smart" and "Smart" is not a suffix but part of the name. However this takes care of the unique requirement:
company_names = ["Apple", "ABS LLC", "Animusoft Corp", "BBC Corp"]
suffixes = []
for name in company_names:
if len(name.split()) > 1 and name.split()[-1] not in suffixes:
suffixes.append(name.split()[-1])

Related

See if object from one dataframe appears in other dataframe, when one has numbers added (e.g. string, string1)

I have two dataframes with actor names (their types are object) that look like the following:
df = pd.DataFrame({Actors: [Christian Bale, Ben Kingsley, Halley Bailey, Aaron Paul, etc...]
df2 = pd.read_csv({id: [Halley Bailey - 1998, Coco Jones – 1998, etc...]
Normally I would use the following code to find if one item is present in another dataframe but due to the numbers in the second dataframe I get 0 matches. Is there any smart way of going over this?
df.assign(indf=df.Actors.isin(df_actor_list.id).astype(int))
The code above did not work obviously
You can extract the actor names from df2['id'] and check if df['Actors'] is in it:
df.assign(indf=df['Actors'].isin(df2['id'].str.extract('(.*)(?=\s[-–])',
expand=False)).astype(int))
output:
Actors indf
0 Christian Bale 0
1 Ben Kingsley 0
2 Halley Bailey 1
3 Aaron Paul 0
Another, more generic, approach relying on a regex:
import re
regex = '|'.join(map(re.escape, df['Actors']))
# 'Christian\\ Bale|Ben\\ Kingsley|Halley\\ Bailey|Aaron\\ Paul'
actors = df2['id'].str.extract(f'({regex})', expand=False).dropna()
df.assign(indf=df['Actors'].isin(actors).astype(int))
used inputs:
df = pd.DataFrame({'Actors': ['Christian Bale', 'Ben Kingsley', 'Halley Bailey', 'Aaron Paul']})
df2 = pd.DataFrame({'id': ['Halley Bailey - 1998', 'Coco Jones – 1998']})

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Downloading key ratios for various Tickers with python library FundamenalAnalysis

I try to download key financial ratios from yahoo finance via the FundamentalAnalysis library. It's pretty easy for single I have a df with tickers and names:
Ticker Company
0 A Agilent Technologies Inc.
1 AA ALCOA CORPORATION
2 AAC AAC Holdings Inc
3 AAL AMERICAN AIRLINES GROUP INC
4 AAME Atlantic American Corp.
I then tried to use a for-loop to download the ratios for every ticker with fa.ratios().
for i in range (3):
i = 0
i = i + 1
Ratios = fa.ratios(tickers["Ticker"][i])
So basically it shall download all ratios for one ticker and the second and so on. I also tried to change the df into a list, but it didn't work as well. If I put them in a list manually like:
Symbol = ["TSLA" , "AAPL" , "MSFT"]
it works somehow. But as I want to work with Data from 1000+ Tickers I don't want to type all of them manually into a list.
Maybe this question has already been answered elsewhere, in that case sorry, but I've not been able to find a thread that helps me. Any ideas?
You can get symbols using
symbols = df['Ticker'].to_list()
and then you could use for-loop without range()
ratios = dict()
for s in symbols:
ratios[s] = fa.ratios(s)
print(ratios)
Because some symbols may not give ratios so you should use try/except
Minimal working example. I use io.StringIO only to simulate file.
import FundamentalAnalysis as fa
import pandas as pd
import io
text='''Ticker Company
A Agilent Technologies Inc.
AA ALCOA CORPORATION
AAC AAC Holdings Inc
AAL AMERICAN AIRLINES GROUP INC
AAME Atlantic American Corp.'''
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
symbols = df['Ticker'].to_list()
#symbols = ["TSLA" , "AAPL" , "MSFT"]
print(symbols)
ratios = dict()
for s in symbols:
try:
ratios[s] = fa.ratios(s)
except Exception as ex:
print(s, ex)
for s, ratio in ratios.items():
print(s, ratio)
EDIT: it seems fa.ratios() returns DataFrames and if you will keep them on list then you can concatenate all DataFrames to one DataFrame
ratios = list() # list instead of dictionary
for s in symbols:
try:
ratios.append(fa.ratios(s)) # append to list
except Exception as ex:
print(s, ex)
df = pd.concat(ratios, axis=1) # convert list of DataFrames to one DataFrame
print(df.columns)
print(df)
Doc: pandas.concat()

How to smartly match two data frames using Python (using pandas or other means)?

I have one pandas dataframe composed of the names of the world's cities as well as countries, to which cities belong,
city.head(3)
city country
0 Qal eh-ye Now Afghanistan
1 Chaghcharan Afghanistan
2 Lashkar Gah Afghanistan
and another data frame consisting of addresses of the world's universities, which is shown below:
df.head(3)
university
0 Inst Huizhou, Huihzhou 516001, Guangdong, Peop...
1 Guangxi Acad Sci, Nanning 530004, Guangxi, Peo...
2 Shenzhen VisuCA Key Lab SIAT, Shenzhen, People...
The locations of cities' names are irregularly distributed across rows. I would like to match the city names with the addresses of world's universities. That is, I would like to know which city each university is located in. Hopefully, the city name matched is shown in the same row as the address of each university.
I've tried the following, and it doesn't work because the locations of cities are irregular across the rows.
df['university'].str.split(',').str[0]
I would suggest to use apply
city_list = city.tolist()
def match_city(row):
for city in city_list:
if city in row['university']: return city
return 'None'
df['city'] = df.apply(match_city, axis=1)
I assume the addresses of university data is clean enough. If you want to do more advanced checking of matching, you can adjust the match_city function.
In order to deal with the inconsistent structure of your strings, a good solution is to use regular expressions. I mocked up some data based on your description and created a function to capture the city from the strings.
In my solution I used numpy to output NaN values when there wasn't a match, but you could easily just make it a blank string. I also included a test case where the input was blank in order to display the NaN result.
import re
import numpy as np
data = ["Inst Huizhou, Huihzhou 516001, Guangdong, People's Republic of China",
"Guangxi Acad Sci, Nanning 530004, Guangxi, People's Republic of China",
"Shenzhen VisuCA Key Lab SIAT, Shenzhen, People's Republic of China",
"New York University, New York, New York 10012, United States of America",
""]
df = pd.DataFrame(data, columns = ['university'])
def extract_city(row):
match = re.match('^[^,]*,([^,]*),', row)
if match:
city = re.sub('\d+', '', match.group(1)).strip()
else:
city = np.nan
return city
df.university.apply(extract_city)
Here's the output:
0 Huihzhou
1 Nanning
2 Shenzhen
3 New York
4 NaN
Name: university, dtype: object
My suggestion is that after a few pre-processing that reduce address into city level information (we don't need to be exact, but try your best; like removing numbers etc), and then merge the dataframes based on text similarities.
You may consider text similarity measures like levenshtein distance or jaro-winkler which are commonly used to match the words.
Here is example for text similarity:
class DLDistance:
def __init__(self, s1):
self.s1 = s1
self.d = {}
self.lenstr1 = len(self.s1)
for i in xrange(-1,self.lenstr1+1):
self.d[(i,-1)] = i+1
def distance(self, s2):
lenstr2 = len(s2)
for j in xrange(-1,lenstr2+1):
self.d[(-1,j)] = j+1
for i in xrange(self.lenstr1):
for j in xrange(lenstr2):
if self.s1[i] == s2[j]:
cost = 0
else:
cost = 1
self.d[(i,j)] = min(
self.d[(i-1,j)] + 1, # deletion
self.d[(i,j-1)] + 1, # insertion
self.d[(i-1,j-1)] + cost, # substitution
)
if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:
self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition
return self.d[self.lenstr1-1,lenstr2-1]
if __name__ == '__main__':
base = u'abs'
cmpstrs = [u'abs', u'sdfbasz', u'asdf', u'hfghfg']
dl = DLDistance(base)
for s in cmpstrs:
print "damerau_levenshtein"
print dl.distance(s)
Even though, it has high level of computation complexity since it calculate N*M times of distance measure where N rows in the first dataframe, M rows in the second dataframe.(To reduce computatinal complexity, you can truncate the set who needs comparison by only comparing the rows which have the same first character)
levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
jaro-winkler: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
I think one simple idea would be to create a mapping from any word or sequence of word of any address to the full address the word is part of, with the assumption that one of those address words is the cites. In a second step we match this with the set of known cities that you have, and anything that is not a known city gets discarded.
A mapping from each single word to address is as simple as:
def address_to_dict(address):
return {word: address for word in address.split(",")}
And we can easily extend this to include the set of bi-grams, tri-gram,... so that universities encoded in several words are also collected. See a discussion here: Elegant N-gram Generation in Python
We can then apply this to every address we have to obtain one grand mapping from any word to the full address:
word_to_address_mapping = pd.DataFrame(df.university.apply(address_to_dict ).tolist()).stack()
word_to_address_mapping = pd.DataFrame(word_to_address_mapping,
columns=["address"])
word_to_address_mapping.index = word_to_address_mapping.index.droplevel(level=0)
word_to_address_mapping
This yields something like this:
All you have to do then is join this with the actual city list you have: this will automatically discard any entry in word_to_address_mapping which is not a known city, and provide a mapping between university address and their city.
# the outer join here should ensure that several university in the
# same city do not overwrite each other
pd.merge(left=word_to_address_mapping, right=city,
left_index=True, right_on="city",
how="outer)
Partial match is prevented in the below function. Information of countries also considered while matching cities. To use this function university dataframe need to be split into list data type, such that every piece of address split into list of strings.
In [22]: def get_city(univ_name_split):
....: # find country from university address
....: for name in univ_name_split:
....: if name in city['country'].values:
....: country = name
....: else:
....: country = None
....: if country:
....: cities = city[city.country == country].city.values
....: else:
....: cities = city['city'].values
....: # find city from university address
....: for name in univ_name_split:
....: if name in cities:
....: return name
....: else:
....: return None
....:
In [1]: import pandas as pd
In [2]: city = pd.read_csv('city.csv')
In [3]: df = pd.read_csv('university.csv')
In [4]: # splitting university name and address
In [5]: df_split = df['university'].str.split(',')
In [6]: df_split = df_split.apply(lambda x:[i.strip() for i in x])
In [10]: df
Out[10]:
university
0 Kongu Engineering College, Perundurai, Erode, ...
1 Anna University - Guindy, Chennai, India
2 Birla Institute of Technology and Science, Pil...
In [11]: df_split
Out[11]:
0 [Kongu Engineering College, Perundurai, Erode,...
1 [Anna University - Guindy, Chennai, India]
2 [Birla Institute of Technology and Science, Pi...
Name: university, dtype: object
In [12]: city
Out[12]:
city country
0 Bangalore India
1 Chennai India
2 Coimbatore India
3 Delhi India
4 Erode India
#This function is shorter version of above function
In [14]: def get_city(univ_name_split):
....: for name in univ_name_split:
....: if name in city['city'].values:
....: return name
....: else:
....: return None
....:
In [15]: df['city'] = df_split.apply(get_city)
In [16]: df
Out[16]:
university city
0 Kongu Engineering College, Perundurai, Erode, ... Erode
1 Anna University - Guindy, Chennai, India Chennai
2 Birla Institute of Technology and Science, Pil... None
I've created a tiny library for my projects, especially for fuzzy joins. It might not be the fastest solution but it may help, feel free to use.
Link to my GitHub repo

Using Pandas, How to replace the last word of the string with an empty strings without distorting rest of the string?

I can't share the actual data. So I am taking an example.
Suppose I have a list of suffixes -
Suffix_List = ["Ltd.", "Inc.", "Limited", "Corp.", "AG"]
I have a data frame with a column containing company names. I want to replace the suffixes of the company name with an empty string. This should not distort the rest of company name. For example: Say the company name is "CAGE AG". "AG" should just be removed from the suffix not from the company name. So the result should be just "CAGE". Also, suffix should only be removed if it is present in the Suffix_List.
Right now I am using -
for suffix in Suffix_List:
df['company_name'] = df['company_name'].str.replace( suffix,"")
But this distorts the actual company name too.
Sample company names could be - CAGE AG, Wage Limited, Tage Ltd. , Sage Inc
You can use regex to substitute out the suffix:
In [11]: re.sub("\s?(" + "|".join(Suffix_List) + ")$", "", "CAGE AG")
Out[11]: 'CAGE'
This looks whether any (|) of the suffixes ends ($) the string.
On the Series/column you can use str.replace:
In [21]: df = pd.DataFrame([["CAGE AG"], ["Stack Exchange Inc."]], columns=["company"])
In [22]: df
Out[22]:
company
0 CAGE
1 Stack Exchange
In [23]: df["company"] = df["company"].str.replace("\s?(" + "|".join(Suffix_List) + ")$", "")
In [24]: df
Out[24]:
company
0 CAGE
1 Stack Exchange

Categories