I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?
You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]
Related
Very new to this, would appreciate any advice on the following:
I have a dataset 'Projects' showing list of institutions with project IDs:
project_id institution_name
0 somali national university
1 aarhus university
2 bath spa
3 aa school of architecture
4 actionaid uk
I would like to fuzzy match merge this with the following dataset of 'Universities' and their country codes:
institution_name country_code
a tan kapuja buddhista foiskola HU
aa school of architecture UK
bath spa university UK
aalto-yliopisto FI
aarhus universitet DK
And get back this:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet DK
2 bath spa [(90)] bath spa university UK
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
Using rapidfuzz:
import pandas as pd
import numpy as np
from rapidfuzz import process, utils as fuzz_utils
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
# baseFrame: the left table to join
# compareFrame: the right table to join
# baseKey: key column of the left table
# compareKey: key column of the right table
# threshold: how close the matches should be to return a match, based on Levenshtein distance
# limit: the amount of matches that will get returned, these are sorted high to low
# return: dataframe with boths keys and matches
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisation'] = m2
return baseFrame.merge(compareFrame, on=baseKey, how=how)
Merged = fuzzy_merge(Projects, Universities, 'institution_name', 'institution_name')
Merged
I got this (with some extra text in the match column but won't go into that now). It's nearly what I want, but the country code only matches up when it's a 100% match:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet NaN
2 bath spa [(90)] bath spa university NaN
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
I reckon this is an issue with how I'm comparing my basekey to the compareframe to create my merged dataset. I can't sort out how to return it on 'organisation' instead though - attempts to plug it in result in varying errors.
Never mind, figured it out - I didn't account for the empty cells! Replacing them with NaN worked out perfectly.
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisations'] = m2.replace("",np.nan)
return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)
I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64
I have a dataframe df which contains company names that I need to neatly-format. The names are already in titlecase:
Company Name
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 Pnc Bank
4 Aig Corp
5 Td Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Since many of the companies go by an acronym or an abbreviation (rows 1, 3, 4, 5), I want only the first string in those company names to be uppercase, like so:
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
I know I can't get 100% accurate replacement, but I believe I can get close by uppercasing only the first string if:
it's 4 or fewer characters
and the first string is not a word in the dictionary
How can I achieve this with something like: df['Company Name'] = df['Company Name'].replace()?
So you can actually use the enchant module to find out if it is a dictionary word or not. Given you are still going to have some off results I.E. Uber.
Here is the code I came up with, sorry for the terrible names of variables and what not.
import enchant
import pandas as pd
def main():
d = enchant.Dict("en_US")
listofcompanys = ['Msci Inc',
'Coca Cola Inc',
'Pnc Bank',
'Aig Corp',
'Td Ameritrade',
'Uber Inc',
'Costco Inc',
'New York Times']
dataframe = pd.DataFrame(listofcompanys, columns=['Company Name'])
for index, name in dataframe.iterrows():
first_word = name['Company Name'].split()
is_word = d.check(first_word[0])
if not is_word:
name['Company Name'] = first_word[0].upper() + ' ' + first_word[1]
print(dataframe)
if __name__ == '__main__':
main()
Output for this was:
Company Name
0 MSCI Inc
1 Coca Cola Inc
2 PNC Bank
3 AIG Corp
4 TD Ameritrade
5 UBER Inc
6 Costco Inc
7 New York Times
Here's a working solution, which makes use of a english word list. Only it's not accurate for td and uber, but like you said, this will be hard to get 100% accurate.
url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'
words = set(pd.read_csv(url, header=None)[0])
w1 = df['Company Name'].str.split()
m1 = ~w1.str[0].str.lower().isin(words) # is not an english word
m2 = w1.str[0].str.len().le(4) # first word is < 4 characters
df.loc[m1 & m2, 'Company Name'] = w1.str[0].str.upper() + ' ' + w1.str[1:].str.join(' ')
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 Td Ameritrade
6 UBER Inc
7 Costco Inc
8 New York Times
Note: I also tried this with nltk package, but apparently, the nltk.corpus.words module is by far not complete with the english words.
You can first separate the first words and the other parts. Then filter those first words based on your logic:
company_list = ['Visa']
s = df['Company Name'].str.extract('^(\S+)(.*)')
mask = s[0].str.len().le(4) & (~s[0].isin(company_list))
df['Company Name'] = s[0].mask(mask, s[0].str.upper()) + s[1]
Output (notice that NEW in New York gets changed as well):
Company Name
0 Visa Inc
1 MSCI Inc
2 COCA Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 UBER Inc
7 Costco Inc
8 NEW York Times
This will get you first word from the string and make it upper only for those company names that are included in the include list:
import pandas as pd
import numpy as np
company_name = {'Visa Inc', 'Msci Inc', 'Coca Cola Ins', 'Pnc Bank'}
include = ['Msci', 'Pnc']
df = pd.DataFrame(company_name)
df.rename(columns={0: 'Company Name'}, inplace=True)
df['Company Name'] = df['Company Name'].apply(lambda x: x.split()[0].upper() + ' ' + x[len(x.split()[0].upper()):] if x.split()[0].strip() in include else x)
df['Company Name']
Output:
0 MSCI Inc
1 Coca Cola Ins
2 PNC Bank
3 Visa Inc
Name: Company Name, dtype: object
A manual workaround could be appending words like "uber"
from nltk.corpus import words
dict_words = words.words()
dict_words.append('uber')
create a new column
df.apply(lambda x : x['Company Name'].replace(x['Company Name'].split(" ")[0].strip(), x['Company Name'].split(" ")[0].strip().upper())
if len(x['Company Name'].split(" ")[0].strip()) <= 4 and x['Company Name'].split(" ")[0].strip().lower() not in dict_words
else x['Company Name'],axis=1)
Output:
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Download the nltk package version by runnning:
import nltk
nltk.download()
Demo:
from nltk.corpus import words
"new" in words.words()
Output:
False
I have a dataframe which looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITAN BANK
2 ABC BANK HABIB METROPOLITIAN BANK
3 BANK OF AMERICA HSBC BANK MALAYSIA BHD
4 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
5 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK
6 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
7 CITIBANK N.A. CHINA GUANGFA BANK CO LTD
8 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
9 SECURITY BANK CORP. SECURITY BANK CORP
10 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
11 TEMU ANZ BANK SAMOA LTD
I have written a levenshtein function which loooks like -
def fm(s1, s2):
score = Levenshtein.distance(s1,s2)
if score == 0.0:
score = 1.0
else:
score = 1 - (score / len(s1))
return score
I wanted to write a code that if the levenstein score of two EDT_ENTITY_NAME values is greater than .75 then we drop the one value having less length and retain the one having more length.Also the ML_ENTITY_NAME for comparison should be same.
My final output should looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITIAN BANK
2 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
3 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
4 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
5 SECURITY BANK CORP. SECURITY BANK CORP
6 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
7 TEMU ANZ BANK SAMOA LTD
Currently my approach is to sort the df and iterate over the loop and check if ML_ENTITY_NAME values are same then calculate the levenshtein for EDT_ENTITY_NAME. i have added a new column delete and I'm updating the delete column to 1 if the above conditions satifies and the length one ML_ENTITY_NAME is smaller than other ML_ENTITY_NAME.
my code looks like -
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
for row2 in df.itertuples():
if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):
if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
df.loc[row2.Index,row2[2]]=1
print(df)
currently it's giving wrong output.
can someone help me with some answers/hints/suggestions?
I believe you need:
#cross join by ML_ENTITY_NAME column
df1 = df.merge(df, on='ML_ENTITY_NAME', how='outer')
#remove same values per rows (distance 1)
df1 = df1[df1['EDT_ENTITY_NAME_x'] != df1['EDT_ENTITY_NAME_y']]
#apply function and compare
m1 = df1.apply(lambda x: fm(x['EDT_ENTITY_NAME_x'], x['EDT_ENTITY_NAME_y']), axis=1) > .75
m2 = df1['EDT_ENTITY_NAME_x'].str.len() > df1['EDT_ENTITY_NAME_y'].str.len()
#filtering
df2 = df1.loc[m1 & m2, ['ML_ENTITY_NAME','EDT_ENTITY_NAME_x']]
#remove `_x`
df2.columns = df2.columns.str.replace('_x$', '')
#add unique rows per ML_ENTITY_NAME
df2 = df2.append(df[~df['ML_ENTITY_NAME'].duplicated(keep=False)]).reset_index(drop=True)
print (df2)
ML_ENTITY_NAME EDT_ENTITY_NAME
0 ABC BANK HABIB METROPOLITIAN BANK
1 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
2 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
3 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
4 SECURITY BANK CORP. SECURITY BANK CORP
5 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
6 TEMU ANZ BANK SAMOA LTD
Could you specify what exactly is wrong about the output you are getting? The only deviation from your goal I see in code is that you only set the delete flag to 1 for row pairs with 0.74 < fm(...) < 1, while it should be rather 0.75 < fm(...).
As a side note, sorting is redundant in your code, since you end up comparing every possible pair of rows anyways. What you possibly had in mind when implementing the sorting was going through each consecutive pair of rows, which would improve the complexity of your code from O(n2) to O(n).
Another side note is that you don't need the if statement in your fm function: statement score = 1 - score / len(s1) would cover both cases.
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09