Having trouble merging two dataframes in python

Having trouble merging two dataframes in python - python

I am new to Python and I am trying to merge two datasets for my research together:
df1 has the column names: companyname, ticker, and Dscode,
df2 has companyname, ticker, grouptcode, and Dscode.
I want to merge the grouptcode from df1 to df2, however, the companyname is slightly different, but very similar between the two dataframes.
For each ticker, there is an associated Dscode. However, multiple companies have the same ticker, and therefore the same Dscode.
Problem
I am only interested in merging the grouptcode for the associated ticker and Dscode that matches the companyname (which at times is slightly different - this part is what I cannot get past). The code I have been using is below.
Code
import pandas as pd
import os
# set working directory
path = "/Users/name/Desktop/Python"
os.chdir(path)
os.getcwd() # Prints the working directory
# read in excel file
file = "/Users/name/Desktop/Python/Excel/DSROE.xlsx"
x1 = pd.ExcelFile(file)
print(x1.sheet_names)
df1 = x1.parse('Sheet1')
df1.head()
df1.tail()
file2 = "/Users/name/Desktop/Python/Excel/tcode2.xlsx"
x2 = pd.ExcelFile(file2)
print(x2.sheet_names)
df2 = x2.parse('Sheet1')
df2['companyname'] = df2['companyname'].str.upper() ## make column uppercase
df2.head()
df2.tail()
df2 = df2.dropna()
x3 = pd.merge(df1, df2,how = 'outer') # merge
Data
df1
Dscode ticker companyname
65286 8933TC 3pl 3P LEARNING LIMITED
79291 9401FP a2m A2 MILK COMPANY LIMITED
1925 14424Q aac AUSTRALIAN AGRICULTURAL COMPANY LIMITED
39902 675493 aad ARDENT LEISURE GROUP
1400 133915 aba AUSWIDE BANK LIMITED
74565 922472 abc ADELAIDE BRIGHTON LIMITED
7350 26502C abp ABACUS PROPERTY GROUP
39202 675142 ada ADACEL TECHNOLOGIES LIMITED
80866 9661AD adh ADAIRS
80341 9522QV afg AUSTRALIAN FINANCE GROUP LIMITED
45327 691938 agg ANGLOGOLD ASHANTI LIMITED
2625 14880E agi AINSWORTH GAME TECHNOLOGY LIMITED
75090 923040 agl AGL ENERGY LIMITED
19251 29897X ago ATLAS IRON LIMITED
64409 890588 agy ARGOSY MINERALS LIMITED
24151 31511D ahg AUTOMOTIVE HOLDINGS GROUP LIMITED
64934 8917JD ahy ASALEO CARE LIMITED
42877 691152 aia AUCKLAND INTERNATIONAL AIRPORT LIMITED
61433 88013C ajd ASIA PACIFIC DATA CENTRE GROUP
44452 691704 ajl AJ LUCAS GROUP LIMITED
700 13288C ajm ALTURA MINING LIMITED
19601 29929D akp AUDIO PIXELS HOLDINGS LIMITED
79816 951404 alk ALKANE RESOURCES LIMITED
56008 865613 all ARISTOCRAT LEISURE LIMITED
51807 771351 alq ALS LIMITED
44277 691685 alu ALTIUM LIMITED
42702 68625C alx ATLAS ARTERIA GROUP
30101 41162F ama AMA GROUP LIMITED
67386 902201 amc AMCOR LIMITED
33426 50431L ami AURELIA METALS LIMITED
df2
companyname grouptcode ticker
524 3P LEARNING LIMITED.. tpn1 3pl
1 THE A2 MILK COMPANY LIMITED a2m1 a2m
2 AUSTRALIAN AGRICULTURAL COMPANY LIMITED. aac2 aac
3 AAPC LIMITED. aad1 aad
6 ADVANCE BANK AUSTRALIA LIMITED aba1 aba
7 ADELAIDE BRIGHTON CEMENT HOLDINGS LIMITED abc1 abc
8 ABACUS PROPERTY GROUP abp1 abp
9 ADACEL TECHNOLOGIES LIMITED ada1 ada
288 ADA CORPORATION LIMITED khs1 ada
10 AERODATA HOLDINGS LIMITED adh1 adh
11 ADAMS (HERBERT) HOLDINGS LIMITED adh2 adh
12 ADAIRS LIMITED adh3 adh
431 ALLCO FINANCE GROUP LIMITED rcd1 afg
13 AUSTRALIAN FINANCE GROUP LTD afg1 afg
14 ANGLOGOLD ASHANTI LIMITED agg1 agg
15 APGAR INDUSTRIES LIMITED agi1 agi
16 AINSWORTH GAME TECHNOLOGY LIMITED agi2 agi
17 AUSTRALIAN GAS LIGHT COMPANY (THE) agl1 agl
18 ATLAS IRON LIMITED ago1 ago
393 ACM GOLD LIMITED pgo2 ago
19 AUSTRALIAN GYPSUM INDUSTRIES LIMITED agy1 agy
142 ARGOSY MINERALS INC cio1 agy
21 ARCHAEAN GOLD NL ahg1 ahg
22 AUSTRALIAN HYDROCARBONS N.L. ahy1 ahy
23 ASALEO CARE LIMITED ahy2 ahy
24 AUCKLAND INTERNATIONAL AIRPORT LIMITED aia1 aia
25 ASIA PACIFIC DATA CENTRE GROUP ajd1 ajd
26 AJ LUCAS GROUP LIMITED ajl1 ajl
27 AJAX MCPHERSON'S LIMITED ajm1 ajm
29 ALKANE EXPLORATION (TERRIGAL) N.L. alk1 alk
Dscode
524 8933TC
1 9401FP
2 14424Q
3 675493
6 133915
7 922472
8 26502C
9 675142
288 675142
10 9661AD
11 9661AD
12 9661AD
431 9522QV
13 9522QV
14 691938
15 14880E
16 14880E
17 923040
18 29897X
393 29897X
19 890588
142 890588
21 31511D
22 8917JD
23 8917JD
24 691152
25 88013C
26 691704
27 13288C
29 951404

Related

While loop data not appending to list outside of while loop

I am trying to scrape data, write it to a pd series then go into a while loop for the remaining pages of the website appending to the original series (located outside of the while loop) after each iteration. I'm not sure why this isn't working. Here's where I'm stuck:
current_url = 'https://www.yellowpages.com/search?search_terms=hvac&geo_location_terms=97080'
def get_data_run(current_url):
company_names1 = get_company_name(current_url)
print(company_names1) #1
page = 1
max_page = 3
company_names1 = paginate(current_url, page, max_page, company_names1)
print(company_names1) #2
def paginate(current_url, page, max_page, company_names1):
while (page <= max_page):
new_url = current_url + f"&page={page}"
print(new_url)
company_names = get_company_name(new_url)
company_names1.append(company_names)
print(company_names) #3
print(company_names1) #4
page +=1
if page == max_page:
return company_names1
def get_company_name(url):
company_names = []
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
box = list(soup.findAll("div", {"class": "result"}))
for i in range(len(box)):
try:
company_names.append(box[i].find("a", {"class": "business-name"}).text.strip())
except Exception:
company_names.append("null")
else:
continue
company_names = pd.Series(company_names, dtype='string')
return company_names
get_data_run(current_url)
I've labeled the different prints and all of the prints of company_names1 and company_names and each time company_names1 it prints the same series of companies even after appending company_names inside the while loop. The thing I can't understand is that when I print company_names (#3) it prints the next page of company names. I don't understand why its not appending inside the while loop then why it's not returning outside of the function successfully and printing the combined series in the #2 print. Thanks!
UPDATE:
Here is some sample output:
when I print #3:
(pyfinance) justinbenfit#MacBook-Pro-3 yellowpages_scrape % /usr/local/anaconda3/envs/pyfinance/bin/python /Users/justinbenfit/Desktop/yellowpages_scrape/test.py
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Mt. Hood Heating Cooling & Refrigeration
22 Chuck's Heating & Cooling
23 Mr. Furnace
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
0 Air-Trix
1 Johnstone Supply
2 Buss Heating & Cooling Inc
3 The Heat Exchange
4 Hoodview Heating & Air Conditioning
5 Loomis Heating Cooling & Refrigeration
6 All About Air Heating & Cooling
7 Hanson Heating
8 Sparks Heating & Cooling
9 Interior Comfort Systems
10 P D X Heating & Cooling
11 Apcom Power Inc
12 Area Heating Inc
13 Four Seasons Heating Air Conditioning & Servic...
14 Perfect Climate Inc
15 Combustion Consultants Inc
16 Classic Heat Source, Inc.
17 Multnomah Heating, Inc
18 Apollo Plumbing, Heating & Air Conditioning - OR
19 Art's Furnace & Air Cond
20 Kurchel Heating
21 P & O Construction Inc
22 Systems Management NW
23 Bridgetown Heating
24 Amana Heating & Air Conditioning Systems
25 QualitySmith
26 Wilbert Jr, Wilson
27 Faith Heating & Air Conditioning Inc
28 Northwest Commercial Heating & Air Conditionin...
29 Heat Master Corp
dtype: string
when I print #1, #2, and #4
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Chuck's Heating & Cooling
22 Mr. Furnace
23 Mt. Hood Heating Cooling & Refrigeration
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string

The problem is you're treating pd.Series as a list, but the former are immutable while the later are mutable. This means, appending data to a list works like this:
lst = [1,2,3]
lst.append(4)
print(lst)
# [1, 2, 3, 4]
The object changes without having to explicitly assign it. If you do the same with Series, this happens:
series = pd.Series([1,2,3])
series.append(pd.Series([4]))
print(series)
The output is:
0 1
1 2
2 3
dtype: int64
So, to update a Series, you have to replace the original object or create a new one. If there's no assignment it won't be stored in memory:
series = pd.Series([1,2,3])
series = series.append(pd.Series([4]))
print(series)
Output:
0 1
1 2
2 3
0 4
dtype: int64
In the case of your problem it lies in the paginate function, you should change this line:
company_names1.append(company_names)
to:
company_names1 = company_names1.append(company_names)
And everything should work

Need help in matching strings from phrases from multiple columns of a dataframe in python

Need help in matching phrases in the data given below where I need to match phrases from both TextA and TextB.
The following code did not helped me in doing it how can I address this I had 100s of them to match
#sorting jumbled phrases
def sorts(string_value):
sorted_string = sorted(string_value.split())
sorted_string = ' '.join(sorted_string)
return sorted_string
#Removing punctuations in string
punc = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
def punt(test_str):
for ele in test_str:
if ele in punc:
test_str = test_str.replace(ele, "")
return(test_str)
#matching strings
def lets_match(x):
for text1 in TextA:
for text2 in TextB:
try:
if sorts(punt(x[text1.casefold()])) == sorts(punt(x[text2.casefold()])):
return True
except:
continue
return False
df['result'] = df.apply(lets_match,axis =1)
even after implementing string sort, removing punctuations and case sensitivity I am still getting those strings as not matching. I am I missing something here can some help me in achieving it

Actually you can use difflib to match two text, here's what you can try:
from difflib import SequenceMatcher
def similar(a, b):
a=str(a).lower()
b=str(b).lower()
return SequenceMatcher(None, a, b).ratio()
def lets_match(d):
print(d[0]," --- ",d[1])
result=similar(d[0],d[1])
print(result)
if result>0.6:
return True
else:
return False
df["result"]=df.apply(lets_match,axis =1)
You can play with if result>0.6 value.
For more information about difflib you can visit here. There are other sequence matchers also like textdistance but I found it easy so I tried this.

Is there any issues with using the fuzzy match lib? The implementation is pretty straight forward and works well given the above data is relatively similar. I've performed the below without preprocessing.
import pandas as pd
""" Install the libs below via terminal:
$pip install fuzzywuzzy
$pip install python-Levenshtein
"""
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#creating the data frames
text_a = ['AKIL KUMAR SINGH','OUSMANI DJIBO','PETER HRYB','CNOC LIMITED','POLY NOVA INDUSTRIES LTD','SAM GAWED JR','ADAN GENERAL LLC','CHINA MOBLE LIMITED','CASTAR CO., LTD.','MURAN','OLD SAROOP FOR CAR SEAT COVERS','CNP HEALTHCARE, LLC','GLORY PACK LTD','AUNCO VENTURES','INTERNATIONAL COMPANY','SAMEERA HEAT AND ENERGY FUND']
text_b = ['Singh, Akil Kumar','DJIBO, Ousmani Illiassou','HRYB, Peter','CNOOC LIMITED','POLYNOVA INDUSTRIES LTD.','GAWED, SAM','ADAN GENERAL TRADING FZE','CHINA MOBILE LIMITED','CASTAR GROUP CO., LTD.','MURMAN','Old Saroop for Car Seat Covers','CNP HEATHCARE, LLC','GLORY PACK LTD.','AUNCO VENTURE','INTL COMPANY','SAMEERA HEAT AND ENERGY PROPERTY FUND']
df_text_a = pd.DataFrame(text_a, columns=['text_a'])
df_text_b = pd.DataFrame(text_b, columns=['text_b'])
def lets_match(txt: str, chklist: list) -> str:
return process.extractOne(txt, chklist, scorer=fuzz.token_set_ratio)
#match Text_A against Text_B
result_txt_ab = df_text_a.apply(lambda x: lets_match(str(x), text_b), axis=1, result_type='expand')
result_txt_ab.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_a[result_txt_ab.columns]=result_txt_ab
df_text_a
text_a Return Match Match Value
0 AKIL KUMAR SINGH Singh, Akil Kumar 100
1 OUSMANI DJIBO DJIBO, Ousmani Illiassou 72
2 PETER HRYB HRYB, Peter 100
3 CNOC LIMITED CNOOC LIMITED 70
4 POLY NOVA INDUSTRIES LTD POLYNOVA INDUSTRIES LTD. 76
5 SAM GAWED JR GAWED, SAM 100
6 ADAN GENERAL LLC ADAN GENERAL TRADING FZE 67
7 CHINA MOBLE LIMITED CHINA MOBILE LIMITED 79
8 CASTAR CO., LTD. CASTAR GROUP CO., LTD. 81
9 MURAN SAMEERA HEAT AND ENERGY PROPERTY FUND 41
10 OLD SAROOP FOR CAR SEAT COVERS Old Saroop for Car Seat Covers 100
11 CNP HEALTHCARE, LLC CNP HEATHCARE, LLC 58
12 GLORY PACK LTD GLORY PACK LTD. 100
13 AUNCO VENTURES AUNCO VENTURE 56
14 INTERNATIONAL COMPANY INTL COMPANY 74
15 SAMEERA HEAT AND ENERGY FUND SAMEERA HEAT AND ENERGY PROPERTY FUND 86
#match Text_B against Text_A
result_txt_ba= df_text_b.apply(lambda x: lets_match(str(x), text_a), axis=1, result_type='expand')
result_txt_ba.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_b[result_txt_ba.columns]=result_txt_ba
df_text_b
text_b Return Match Match Value
0 Singh, Akil Kumar AKIL KUMAR SINGH 100
1 DJIBO, Ousmani Illiassou OUSMANI DJIBO 100
2 HRYB, Peter PETER HRYB 100
3 CNOOC LIMITED CNOC LIMITED 74
4 POLYNOVA INDUSTRIES LTD. POLY NOVA INDUSTRIES LTD 74
5 GAWED, SAM SAM GAWED JR 86
6 ADAN GENERAL TRADING FZE ADAN GENERAL LLC 86
7 CHINA MOBILE LIMITED CHINA MOBLE LIMITED 81
8 CASTAR GROUP CO., LTD. CASTAR CO., LTD. 100
9 MURMAN ADAN GENERAL LLC 33
10 Old Saroop for Car Seat Covers OLD SAROOP FOR CAR SEAT COVERS 100
11 CNP HEATHCARE, LLC CNP HEALTHCARE, LLC 56
12 GLORY PACK LTD. GLORY PACK LTD 100
13 AUNCO VENTURE AUNCO VENTURES 53
14 INTL COMPANY INTERNATIONAL COMPANY 50
15 SAMEERA HEAT AND ENERGY PROPERTY FUND SAMEERA HEAT AND ENERGY FUND 100

I think you can't do it without a strings distance notion, what you can do is use, for example record linkage.
I will not get into details, but i'll show you an example of usage on this case.
import pandas as pd
import recordlinkage as rl
from recordlinkage.preprocessing import clean
# creating first dataframe
df_text_a = pd.DataFrame({
"Text A":[
"AKIL KUMAR SINGH",
"OUSMANI DJIBO",
"PETER HRYB",
"CNOC LIMITED",
"POLY NOVA INDUSTRIES LTD",
"SAM GAWED JR",
"ADAN GENERAL LLC",
"CHINA MOBLE LIMITED",
"CASTAR CO., LTD.",
"MURAN",
"OLD SAROOP FOR CAR SEAT COVERS",
"CNP HEALTHCARE, LLC",
"GLORY PACK LTD",
"AUNCO VENTURES",
"INTERNATIONAL COMPANY",
"SAMEERA HEAT AND ENERGY FUND"]
}
)
# creating second dataframe
df_text_b = pd.DataFrame({
"Text B":[
"Singh, Akil Kumar",
"DJIBO, Ousmani Illiassou",
"HRYB, Peter",
"CNOOC LIMITED",
"POLYNOVA INDUSTRIES LTD. ",
"GAWED, SAM",
"ADAN GENERAL TRADING FZE",
"CHINA MOBILE LIMITED",
"CASTAR GROUP CO., LTD.",
"MURMAN ",
"Old Saroop for Car Seat Covers",
"CNP HEATHCARE, LLC",
"GLORY PACK LTD.",
"AUNCO VENTURE",
"INTL COMPANY",
"SAMEERA HEAT AND ENERGY PROPERTY FUND"
]
}
)
# preprocessing in very important on results, you have to find which fit well on yuor problem.
cleaned_a = pd.DataFrame(clean(df_text_a["Text A"], lowercase=True))
cleaned_b = pd.DataFrame(clean(df_text_b["Text B"], lowercase=True))
# creating an indexing which will be used for comprison, you have various type of indexing, watch documentation.
indexer = rl.Index()
indexer.full()
# generating all passible pairs
pairs = indexer.index(cleaned_a, cleaned_b)
# starting evaluation phase
compare = rl.Compare(n_jobs=-1)
compare.string("Text A", "Text B", method='jarowinkler', label = 'text')
matches = compare.compute(pairs, cleaned_a, cleaned_b)
matches is now a MultiIndex DataFrame, what you want to do next is to find all max on the second index by first index. So you will have the results you need.
Results can be improved working on distance, indexing and/or preprocessing.

how to drop similar values in pandas using levenshtein function

I have a dataframe which looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITAN BANK
2 ABC BANK HABIB METROPOLITIAN BANK
3 BANK OF AMERICA HSBC BANK MALAYSIA BHD
4 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
5 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK
6 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
7 CITIBANK N.A. CHINA GUANGFA BANK CO LTD
8 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
9 SECURITY BANK CORP. SECURITY BANK CORP
10 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
11 TEMU ANZ BANK SAMOA LTD
I have written a levenshtein function which loooks like -
def fm(s1, s2):
score = Levenshtein.distance(s1,s2)
if score == 0.0:
score = 1.0
else:
score = 1 - (score / len(s1))
return score
I wanted to write a code that if the levenstein score of two EDT_ENTITY_NAME values is greater than .75 then we drop the one value having less length and retain the one having more length.Also the ML_ENTITY_NAME for comparison should be same.
My final output should looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITIAN BANK
2 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
3 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
4 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
5 SECURITY BANK CORP. SECURITY BANK CORP
6 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
7 TEMU ANZ BANK SAMOA LTD
Currently my approach is to sort the df and iterate over the loop and check if ML_ENTITY_NAME values are same then calculate the levenshtein for EDT_ENTITY_NAME. i have added a new column delete and I'm updating the delete column to 1 if the above conditions satifies and the length one ML_ENTITY_NAME is smaller than other ML_ENTITY_NAME.
my code looks like -
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
for row2 in df.itertuples():
if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):
if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
df.loc[row2.Index,row2[2]]=1
print(df)
currently it's giving wrong output.
can someone help me with some answers/hints/suggestions?

I believe you need:
#cross join by ML_ENTITY_NAME column
df1 = df.merge(df, on='ML_ENTITY_NAME', how='outer')
#remove same values per rows (distance 1)
df1 = df1[df1['EDT_ENTITY_NAME_x'] != df1['EDT_ENTITY_NAME_y']]
#apply function and compare
m1 = df1.apply(lambda x: fm(x['EDT_ENTITY_NAME_x'], x['EDT_ENTITY_NAME_y']), axis=1) > .75
m2 = df1['EDT_ENTITY_NAME_x'].str.len() > df1['EDT_ENTITY_NAME_y'].str.len()
#filtering
df2 = df1.loc[m1 & m2, ['ML_ENTITY_NAME','EDT_ENTITY_NAME_x']]
#remove `_x`
df2.columns = df2.columns.str.replace('_x$', '')
#add unique rows per ML_ENTITY_NAME
df2 = df2.append(df[~df['ML_ENTITY_NAME'].duplicated(keep=False)]).reset_index(drop=True)
print (df2)
ML_ENTITY_NAME EDT_ENTITY_NAME
0 ABC BANK HABIB METROPOLITIAN BANK
1 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
2 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
3 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
4 SECURITY BANK CORP. SECURITY BANK CORP
5 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
6 TEMU ANZ BANK SAMOA LTD

Could you specify what exactly is wrong about the output you are getting? The only deviation from your goal I see in code is that you only set the delete flag to 1 for row pairs with 0.74 < fm(...) < 1, while it should be rather 0.75 < fm(...).
As a side note, sorting is redundant in your code, since you end up comparing every possible pair of rows anyways. What you possibly had in mind when implementing the sorting was going through each consecutive pair of rows, which would improve the complexity of your code from O(n2) to O(n).
Another side note is that you don't need the if statement in your fm function: statement score = 1 - score / len(s1) would cover both cases.

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....

First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

Obtaining number of occurrences of a variable from a column in python using pandas

I am in the learning phase of analyzing data using python, stumbled upon a doubt.
Consider the following data set:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
how to get the maximum number of jobs from a particular state using pandas.
eg:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------

First solution is groupby with Counter and most_common:
For DELHI is same number 2 for JOURNALIST and PHYSICIAN, so difference in output of solutions.
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
Another solution with groupby, size and nlargest:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
EDIT:
For debugging here is the best custom function what is same as lambda function:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having trouble merging two dataframes in python - python

Related

While loop data not appending to list outside of while loop

Need help in matching strings from phrases from multiple columns of a dataframe in python

how to drop similar values in pandas using levenshtein function

Merge two pandas dataframe two create a new dataframe with a specific operation

Obtaining number of occurrences of a variable from a column in python using pandas

Categories

Resources