I am trying to scrape data, write it to a pd series then go into a while loop for the remaining pages of the website appending to the original series (located outside of the while loop) after each iteration. I'm not sure why this isn't working. Here's where I'm stuck:
current_url = 'https://www.yellowpages.com/search?search_terms=hvac&geo_location_terms=97080'
def get_data_run(current_url):
company_names1 = get_company_name(current_url)
print(company_names1) #1
page = 1
max_page = 3
company_names1 = paginate(current_url, page, max_page, company_names1)
print(company_names1) #2
def paginate(current_url, page, max_page, company_names1):
while (page <= max_page):
new_url = current_url + f"&page={page}"
print(new_url)
company_names = get_company_name(new_url)
company_names1.append(company_names)
print(company_names) #3
print(company_names1) #4
page +=1
if page == max_page:
return company_names1
def get_company_name(url):
company_names = []
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
box = list(soup.findAll("div", {"class": "result"}))
for i in range(len(box)):
try:
company_names.append(box[i].find("a", {"class": "business-name"}).text.strip())
except Exception:
company_names.append("null")
else:
continue
company_names = pd.Series(company_names, dtype='string')
return company_names
get_data_run(current_url)
I've labeled the different prints and all of the prints of company_names1 and company_names and each time company_names1 it prints the same series of companies even after appending company_names inside the while loop. The thing I can't understand is that when I print company_names (#3) it prints the next page of company names. I don't understand why its not appending inside the while loop then why it's not returning outside of the function successfully and printing the combined series in the #2 print. Thanks!
UPDATE:
Here is some sample output:
when I print #3:
(pyfinance) justinbenfit#MacBook-Pro-3 yellowpages_scrape % /usr/local/anaconda3/envs/pyfinance/bin/python /Users/justinbenfit/Desktop/yellowpages_scrape/test.py
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Mt. Hood Heating Cooling & Refrigeration
22 Chuck's Heating & Cooling
23 Mr. Furnace
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
0 Air-Trix
1 Johnstone Supply
2 Buss Heating & Cooling Inc
3 The Heat Exchange
4 Hoodview Heating & Air Conditioning
5 Loomis Heating Cooling & Refrigeration
6 All About Air Heating & Cooling
7 Hanson Heating
8 Sparks Heating & Cooling
9 Interior Comfort Systems
10 P D X Heating & Cooling
11 Apcom Power Inc
12 Area Heating Inc
13 Four Seasons Heating Air Conditioning & Servic...
14 Perfect Climate Inc
15 Combustion Consultants Inc
16 Classic Heat Source, Inc.
17 Multnomah Heating, Inc
18 Apollo Plumbing, Heating & Air Conditioning - OR
19 Art's Furnace & Air Cond
20 Kurchel Heating
21 P & O Construction Inc
22 Systems Management NW
23 Bridgetown Heating
24 Amana Heating & Air Conditioning Systems
25 QualitySmith
26 Wilbert Jr, Wilson
27 Faith Heating & Air Conditioning Inc
28 Northwest Commercial Heating & Air Conditionin...
29 Heat Master Corp
dtype: string
when I print #1, #2, and #4
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Chuck's Heating & Cooling
22 Mr. Furnace
23 Mt. Hood Heating Cooling & Refrigeration
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
The problem is you're treating pd.Series as a list, but the former are immutable while the later are mutable. This means, appending data to a list works like this:
lst = [1,2,3]
lst.append(4)
print(lst)
# [1, 2, 3, 4]
The object changes without having to explicitly assign it. If you do the same with Series, this happens:
series = pd.Series([1,2,3])
series.append(pd.Series([4]))
print(series)
The output is:
0 1
1 2
2 3
dtype: int64
So, to update a Series, you have to replace the original object or create a new one. If there's no assignment it won't be stored in memory:
series = pd.Series([1,2,3])
series = series.append(pd.Series([4]))
print(series)
Output:
0 1
1 2
2 3
0 4
dtype: int64
In the case of your problem it lies in the paginate function, you should change this line:
company_names1.append(company_names)
to:
company_names1 = company_names1.append(company_names)
And everything should work
Need help in matching phrases in the data given below where I need to match phrases from both TextA and TextB.
The following code did not helped me in doing it how can I address this I had 100s of them to match
#sorting jumbled phrases
def sorts(string_value):
sorted_string = sorted(string_value.split())
sorted_string = ' '.join(sorted_string)
return sorted_string
#Removing punctuations in string
punc = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
def punt(test_str):
for ele in test_str:
if ele in punc:
test_str = test_str.replace(ele, "")
return(test_str)
#matching strings
def lets_match(x):
for text1 in TextA:
for text2 in TextB:
try:
if sorts(punt(x[text1.casefold()])) == sorts(punt(x[text2.casefold()])):
return True
except:
continue
return False
df['result'] = df.apply(lets_match,axis =1)
even after implementing string sort, removing punctuations and case sensitivity I am still getting those strings as not matching. I am I missing something here can some help me in achieving it
Actually you can use difflib to match two text, here's what you can try:
from difflib import SequenceMatcher
def similar(a, b):
a=str(a).lower()
b=str(b).lower()
return SequenceMatcher(None, a, b).ratio()
def lets_match(d):
print(d[0]," --- ",d[1])
result=similar(d[0],d[1])
print(result)
if result>0.6:
return True
else:
return False
df["result"]=df.apply(lets_match,axis =1)
You can play with if result>0.6 value.
For more information about difflib you can visit here. There are other sequence matchers also like textdistance but I found it easy so I tried this.
Is there any issues with using the fuzzy match lib? The implementation is pretty straight forward and works well given the above data is relatively similar. I've performed the below without preprocessing.
import pandas as pd
""" Install the libs below via terminal:
$pip install fuzzywuzzy
$pip install python-Levenshtein
"""
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#creating the data frames
text_a = ['AKIL KUMAR SINGH','OUSMANI DJIBO','PETER HRYB','CNOC LIMITED','POLY NOVA INDUSTRIES LTD','SAM GAWED JR','ADAN GENERAL LLC','CHINA MOBLE LIMITED','CASTAR CO., LTD.','MURAN','OLD SAROOP FOR CAR SEAT COVERS','CNP HEALTHCARE, LLC','GLORY PACK LTD','AUNCO VENTURES','INTERNATIONAL COMPANY','SAMEERA HEAT AND ENERGY FUND']
text_b = ['Singh, Akil Kumar','DJIBO, Ousmani Illiassou','HRYB, Peter','CNOOC LIMITED','POLYNOVA INDUSTRIES LTD.','GAWED, SAM','ADAN GENERAL TRADING FZE','CHINA MOBILE LIMITED','CASTAR GROUP CO., LTD.','MURMAN','Old Saroop for Car Seat Covers','CNP HEATHCARE, LLC','GLORY PACK LTD.','AUNCO VENTURE','INTL COMPANY','SAMEERA HEAT AND ENERGY PROPERTY FUND']
df_text_a = pd.DataFrame(text_a, columns=['text_a'])
df_text_b = pd.DataFrame(text_b, columns=['text_b'])
def lets_match(txt: str, chklist: list) -> str:
return process.extractOne(txt, chklist, scorer=fuzz.token_set_ratio)
#match Text_A against Text_B
result_txt_ab = df_text_a.apply(lambda x: lets_match(str(x), text_b), axis=1, result_type='expand')
result_txt_ab.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_a[result_txt_ab.columns]=result_txt_ab
df_text_a
text_a Return Match Match Value
0 AKIL KUMAR SINGH Singh, Akil Kumar 100
1 OUSMANI DJIBO DJIBO, Ousmani Illiassou 72
2 PETER HRYB HRYB, Peter 100
3 CNOC LIMITED CNOOC LIMITED 70
4 POLY NOVA INDUSTRIES LTD POLYNOVA INDUSTRIES LTD. 76
5 SAM GAWED JR GAWED, SAM 100
6 ADAN GENERAL LLC ADAN GENERAL TRADING FZE 67
7 CHINA MOBLE LIMITED CHINA MOBILE LIMITED 79
8 CASTAR CO., LTD. CASTAR GROUP CO., LTD. 81
9 MURAN SAMEERA HEAT AND ENERGY PROPERTY FUND 41
10 OLD SAROOP FOR CAR SEAT COVERS Old Saroop for Car Seat Covers 100
11 CNP HEALTHCARE, LLC CNP HEATHCARE, LLC 58
12 GLORY PACK LTD GLORY PACK LTD. 100
13 AUNCO VENTURES AUNCO VENTURE 56
14 INTERNATIONAL COMPANY INTL COMPANY 74
15 SAMEERA HEAT AND ENERGY FUND SAMEERA HEAT AND ENERGY PROPERTY FUND 86
#match Text_B against Text_A
result_txt_ba= df_text_b.apply(lambda x: lets_match(str(x), text_a), axis=1, result_type='expand')
result_txt_ba.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_b[result_txt_ba.columns]=result_txt_ba
df_text_b
text_b Return Match Match Value
0 Singh, Akil Kumar AKIL KUMAR SINGH 100
1 DJIBO, Ousmani Illiassou OUSMANI DJIBO 100
2 HRYB, Peter PETER HRYB 100
3 CNOOC LIMITED CNOC LIMITED 74
4 POLYNOVA INDUSTRIES LTD. POLY NOVA INDUSTRIES LTD 74
5 GAWED, SAM SAM GAWED JR 86
6 ADAN GENERAL TRADING FZE ADAN GENERAL LLC 86
7 CHINA MOBILE LIMITED CHINA MOBLE LIMITED 81
8 CASTAR GROUP CO., LTD. CASTAR CO., LTD. 100
9 MURMAN ADAN GENERAL LLC 33
10 Old Saroop for Car Seat Covers OLD SAROOP FOR CAR SEAT COVERS 100
11 CNP HEATHCARE, LLC CNP HEALTHCARE, LLC 56
12 GLORY PACK LTD. GLORY PACK LTD 100
13 AUNCO VENTURE AUNCO VENTURES 53
14 INTL COMPANY INTERNATIONAL COMPANY 50
15 SAMEERA HEAT AND ENERGY PROPERTY FUND SAMEERA HEAT AND ENERGY FUND 100
I think you can't do it without a strings distance notion, what you can do is use, for example record linkage.
I will not get into details, but i'll show you an example of usage on this case.
import pandas as pd
import recordlinkage as rl
from recordlinkage.preprocessing import clean
# creating first dataframe
df_text_a = pd.DataFrame({
"Text A":[
"AKIL KUMAR SINGH",
"OUSMANI DJIBO",
"PETER HRYB",
"CNOC LIMITED",
"POLY NOVA INDUSTRIES LTD",
"SAM GAWED JR",
"ADAN GENERAL LLC",
"CHINA MOBLE LIMITED",
"CASTAR CO., LTD.",
"MURAN",
"OLD SAROOP FOR CAR SEAT COVERS",
"CNP HEALTHCARE, LLC",
"GLORY PACK LTD",
"AUNCO VENTURES",
"INTERNATIONAL COMPANY",
"SAMEERA HEAT AND ENERGY FUND"]
}
)
# creating second dataframe
df_text_b = pd.DataFrame({
"Text B":[
"Singh, Akil Kumar",
"DJIBO, Ousmani Illiassou",
"HRYB, Peter",
"CNOOC LIMITED",
"POLYNOVA INDUSTRIES LTD. ",
"GAWED, SAM",
"ADAN GENERAL TRADING FZE",
"CHINA MOBILE LIMITED",
"CASTAR GROUP CO., LTD.",
"MURMAN ",
"Old Saroop for Car Seat Covers",
"CNP HEATHCARE, LLC",
"GLORY PACK LTD.",
"AUNCO VENTURE",
"INTL COMPANY",
"SAMEERA HEAT AND ENERGY PROPERTY FUND"
]
}
)
# preprocessing in very important on results, you have to find which fit well on yuor problem.
cleaned_a = pd.DataFrame(clean(df_text_a["Text A"], lowercase=True))
cleaned_b = pd.DataFrame(clean(df_text_b["Text B"], lowercase=True))
# creating an indexing which will be used for comprison, you have various type of indexing, watch documentation.
indexer = rl.Index()
indexer.full()
# generating all passible pairs
pairs = indexer.index(cleaned_a, cleaned_b)
# starting evaluation phase
compare = rl.Compare(n_jobs=-1)
compare.string("Text A", "Text B", method='jarowinkler', label = 'text')
matches = compare.compute(pairs, cleaned_a, cleaned_b)
matches is now a MultiIndex DataFrame, what you want to do next is to find all max on the second index by first index. So you will have the results you need.
Results can be improved working on distance, indexing and/or preprocessing.
I have a dataframe which looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITAN BANK
2 ABC BANK HABIB METROPOLITIAN BANK
3 BANK OF AMERICA HSBC BANK MALAYSIA BHD
4 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
5 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK
6 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
7 CITIBANK N.A. CHINA GUANGFA BANK CO LTD
8 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
9 SECURITY BANK CORP. SECURITY BANK CORP
10 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
11 TEMU ANZ BANK SAMOA LTD
I have written a levenshtein function which loooks like -
def fm(s1, s2):
score = Levenshtein.distance(s1,s2)
if score == 0.0:
score = 1.0
else:
score = 1 - (score / len(s1))
return score
I wanted to write a code that if the levenstein score of two EDT_ENTITY_NAME values is greater than .75 then we drop the one value having less length and retain the one having more length.Also the ML_ENTITY_NAME for comparison should be same.
My final output should looks like -
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITIAN BANK
2 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
3 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
4 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
5 SECURITY BANK CORP. SECURITY BANK CORP
6 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
7 TEMU ANZ BANK SAMOA LTD
Currently my approach is to sort the df and iterate over the loop and check if ML_ENTITY_NAME values are same then calculate the levenshtein for EDT_ENTITY_NAME. i have added a new column delete and I'm updating the delete column to 1 if the above conditions satifies and the length one ML_ENTITY_NAME is smaller than other ML_ENTITY_NAME.
my code looks like -
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
for row2 in df.itertuples():
if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):
if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
df.loc[row2.Index,row2[2]]=1
print(df)
currently it's giving wrong output.
can someone help me with some answers/hints/suggestions?
I believe you need:
#cross join by ML_ENTITY_NAME column
df1 = df.merge(df, on='ML_ENTITY_NAME', how='outer')
#remove same values per rows (distance 1)
df1 = df1[df1['EDT_ENTITY_NAME_x'] != df1['EDT_ENTITY_NAME_y']]
#apply function and compare
m1 = df1.apply(lambda x: fm(x['EDT_ENTITY_NAME_x'], x['EDT_ENTITY_NAME_y']), axis=1) > .75
m2 = df1['EDT_ENTITY_NAME_x'].str.len() > df1['EDT_ENTITY_NAME_y'].str.len()
#filtering
df2 = df1.loc[m1 & m2, ['ML_ENTITY_NAME','EDT_ENTITY_NAME_x']]
#remove `_x`
df2.columns = df2.columns.str.replace('_x$', '')
#add unique rows per ML_ENTITY_NAME
df2 = df2.append(df[~df['ML_ENTITY_NAME'].duplicated(keep=False)]).reset_index(drop=True)
print (df2)
ML_ENTITY_NAME EDT_ENTITY_NAME
0 ABC BANK HABIB METROPOLITIAN BANK
1 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
2 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
3 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
4 SECURITY BANK CORP. SECURITY BANK CORP
5 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
6 TEMU ANZ BANK SAMOA LTD
Could you specify what exactly is wrong about the output you are getting? The only deviation from your goal I see in code is that you only set the delete flag to 1 for row pairs with 0.74 < fm(...) < 1, while it should be rather 0.75 < fm(...).
As a side note, sorting is redundant in your code, since you end up comparing every possible pair of rows anyways. What you possibly had in mind when implementing the sorting was going through each consecutive pair of rows, which would improve the complexity of your code from O(n2) to O(n).
Another side note is that you don't need the if statement in your fm function: statement score = 1 - score / len(s1) would cover both cases.
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
I am in the learning phase of analyzing data using python, stumbled upon a doubt.
Consider the following data set:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
how to get the maximum number of jobs from a particular state using pandas.
eg:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------
First solution is groupby with Counter and most_common:
For DELHI is same number 2 for JOURNALIST and PHYSICIAN, so difference in output of solutions.
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
Another solution with groupby, size and nlargest:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
EDIT:
For debugging here is the best custom function what is same as lambda function:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER