I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64
Related
I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?
You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]
Need help in matching phrases in the data given below where I need to match phrases from both TextA and TextB.
The following code did not helped me in doing it how can I address this I had 100s of them to match
#sorting jumbled phrases
def sorts(string_value):
sorted_string = sorted(string_value.split())
sorted_string = ' '.join(sorted_string)
return sorted_string
#Removing punctuations in string
punc = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
def punt(test_str):
for ele in test_str:
if ele in punc:
test_str = test_str.replace(ele, "")
return(test_str)
#matching strings
def lets_match(x):
for text1 in TextA:
for text2 in TextB:
try:
if sorts(punt(x[text1.casefold()])) == sorts(punt(x[text2.casefold()])):
return True
except:
continue
return False
df['result'] = df.apply(lets_match,axis =1)
even after implementing string sort, removing punctuations and case sensitivity I am still getting those strings as not matching. I am I missing something here can some help me in achieving it
Actually you can use difflib to match two text, here's what you can try:
from difflib import SequenceMatcher
def similar(a, b):
a=str(a).lower()
b=str(b).lower()
return SequenceMatcher(None, a, b).ratio()
def lets_match(d):
print(d[0]," --- ",d[1])
result=similar(d[0],d[1])
print(result)
if result>0.6:
return True
else:
return False
df["result"]=df.apply(lets_match,axis =1)
You can play with if result>0.6 value.
For more information about difflib you can visit here. There are other sequence matchers also like textdistance but I found it easy so I tried this.
Is there any issues with using the fuzzy match lib? The implementation is pretty straight forward and works well given the above data is relatively similar. I've performed the below without preprocessing.
import pandas as pd
""" Install the libs below via terminal:
$pip install fuzzywuzzy
$pip install python-Levenshtein
"""
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#creating the data frames
text_a = ['AKIL KUMAR SINGH','OUSMANI DJIBO','PETER HRYB','CNOC LIMITED','POLY NOVA INDUSTRIES LTD','SAM GAWED JR','ADAN GENERAL LLC','CHINA MOBLE LIMITED','CASTAR CO., LTD.','MURAN','OLD SAROOP FOR CAR SEAT COVERS','CNP HEALTHCARE, LLC','GLORY PACK LTD','AUNCO VENTURES','INTERNATIONAL COMPANY','SAMEERA HEAT AND ENERGY FUND']
text_b = ['Singh, Akil Kumar','DJIBO, Ousmani Illiassou','HRYB, Peter','CNOOC LIMITED','POLYNOVA INDUSTRIES LTD.','GAWED, SAM','ADAN GENERAL TRADING FZE','CHINA MOBILE LIMITED','CASTAR GROUP CO., LTD.','MURMAN','Old Saroop for Car Seat Covers','CNP HEATHCARE, LLC','GLORY PACK LTD.','AUNCO VENTURE','INTL COMPANY','SAMEERA HEAT AND ENERGY PROPERTY FUND']
df_text_a = pd.DataFrame(text_a, columns=['text_a'])
df_text_b = pd.DataFrame(text_b, columns=['text_b'])
def lets_match(txt: str, chklist: list) -> str:
return process.extractOne(txt, chklist, scorer=fuzz.token_set_ratio)
#match Text_A against Text_B
result_txt_ab = df_text_a.apply(lambda x: lets_match(str(x), text_b), axis=1, result_type='expand')
result_txt_ab.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_a[result_txt_ab.columns]=result_txt_ab
df_text_a
text_a Return Match Match Value
0 AKIL KUMAR SINGH Singh, Akil Kumar 100
1 OUSMANI DJIBO DJIBO, Ousmani Illiassou 72
2 PETER HRYB HRYB, Peter 100
3 CNOC LIMITED CNOOC LIMITED 70
4 POLY NOVA INDUSTRIES LTD POLYNOVA INDUSTRIES LTD. 76
5 SAM GAWED JR GAWED, SAM 100
6 ADAN GENERAL LLC ADAN GENERAL TRADING FZE 67
7 CHINA MOBLE LIMITED CHINA MOBILE LIMITED 79
8 CASTAR CO., LTD. CASTAR GROUP CO., LTD. 81
9 MURAN SAMEERA HEAT AND ENERGY PROPERTY FUND 41
10 OLD SAROOP FOR CAR SEAT COVERS Old Saroop for Car Seat Covers 100
11 CNP HEALTHCARE, LLC CNP HEATHCARE, LLC 58
12 GLORY PACK LTD GLORY PACK LTD. 100
13 AUNCO VENTURES AUNCO VENTURE 56
14 INTERNATIONAL COMPANY INTL COMPANY 74
15 SAMEERA HEAT AND ENERGY FUND SAMEERA HEAT AND ENERGY PROPERTY FUND 86
#match Text_B against Text_A
result_txt_ba= df_text_b.apply(lambda x: lets_match(str(x), text_a), axis=1, result_type='expand')
result_txt_ba.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_b[result_txt_ba.columns]=result_txt_ba
df_text_b
text_b Return Match Match Value
0 Singh, Akil Kumar AKIL KUMAR SINGH 100
1 DJIBO, Ousmani Illiassou OUSMANI DJIBO 100
2 HRYB, Peter PETER HRYB 100
3 CNOOC LIMITED CNOC LIMITED 74
4 POLYNOVA INDUSTRIES LTD. POLY NOVA INDUSTRIES LTD 74
5 GAWED, SAM SAM GAWED JR 86
6 ADAN GENERAL TRADING FZE ADAN GENERAL LLC 86
7 CHINA MOBILE LIMITED CHINA MOBLE LIMITED 81
8 CASTAR GROUP CO., LTD. CASTAR CO., LTD. 100
9 MURMAN ADAN GENERAL LLC 33
10 Old Saroop for Car Seat Covers OLD SAROOP FOR CAR SEAT COVERS 100
11 CNP HEATHCARE, LLC CNP HEALTHCARE, LLC 56
12 GLORY PACK LTD. GLORY PACK LTD 100
13 AUNCO VENTURE AUNCO VENTURES 53
14 INTL COMPANY INTERNATIONAL COMPANY 50
15 SAMEERA HEAT AND ENERGY PROPERTY FUND SAMEERA HEAT AND ENERGY FUND 100
I think you can't do it without a strings distance notion, what you can do is use, for example record linkage.
I will not get into details, but i'll show you an example of usage on this case.
import pandas as pd
import recordlinkage as rl
from recordlinkage.preprocessing import clean
# creating first dataframe
df_text_a = pd.DataFrame({
"Text A":[
"AKIL KUMAR SINGH",
"OUSMANI DJIBO",
"PETER HRYB",
"CNOC LIMITED",
"POLY NOVA INDUSTRIES LTD",
"SAM GAWED JR",
"ADAN GENERAL LLC",
"CHINA MOBLE LIMITED",
"CASTAR CO., LTD.",
"MURAN",
"OLD SAROOP FOR CAR SEAT COVERS",
"CNP HEALTHCARE, LLC",
"GLORY PACK LTD",
"AUNCO VENTURES",
"INTERNATIONAL COMPANY",
"SAMEERA HEAT AND ENERGY FUND"]
}
)
# creating second dataframe
df_text_b = pd.DataFrame({
"Text B":[
"Singh, Akil Kumar",
"DJIBO, Ousmani Illiassou",
"HRYB, Peter",
"CNOOC LIMITED",
"POLYNOVA INDUSTRIES LTD. ",
"GAWED, SAM",
"ADAN GENERAL TRADING FZE",
"CHINA MOBILE LIMITED",
"CASTAR GROUP CO., LTD.",
"MURMAN ",
"Old Saroop for Car Seat Covers",
"CNP HEATHCARE, LLC",
"GLORY PACK LTD.",
"AUNCO VENTURE",
"INTL COMPANY",
"SAMEERA HEAT AND ENERGY PROPERTY FUND"
]
}
)
# preprocessing in very important on results, you have to find which fit well on yuor problem.
cleaned_a = pd.DataFrame(clean(df_text_a["Text A"], lowercase=True))
cleaned_b = pd.DataFrame(clean(df_text_b["Text B"], lowercase=True))
# creating an indexing which will be used for comprison, you have various type of indexing, watch documentation.
indexer = rl.Index()
indexer.full()
# generating all passible pairs
pairs = indexer.index(cleaned_a, cleaned_b)
# starting evaluation phase
compare = rl.Compare(n_jobs=-1)
compare.string("Text A", "Text B", method='jarowinkler', label = 'text')
matches = compare.compute(pairs, cleaned_a, cleaned_b)
matches is now a MultiIndex DataFrame, what you want to do next is to find all max on the second index by first index. So you will have the results you need.
Results can be improved working on distance, indexing and/or preprocessing.
I have a dataframe with names field as:
print(df)
names
--------------------------------
0 U.S.A.
1 United States of America
2 USA
4 US America
5 Kenyan Footbal League
6 Kenyan Football League
7 Kenya Football League Assoc.
8 Kenya Footbal League Association
9 Tata Motors
10 Tat Motor
11 Tata Motors Ltd.
12 Tata Motor Limited
13 REL
14 Reliance Limited
15 Reliance Co.
Now I want to club all these similar kind of names into one category such that the final dataframe looks something like this:
print(df)
names group_name
---------------------------------------------
0 U.S.A. USA
1 United States of America USA
2 USA USA
4 US America USA
5 Kenyan Footbal League Kenya Football League
6 Kenyan Football League Kenya Football League
7 Kenya Football League Assoc. Kenya Football League
8 Kenya Footbal League Association Kenya Football League
9 Tata Motors Tata Motors
10 Tat Motor Tata Motors
11 Tata Motors Ltd. Tata Motors
12 Tata Motor Limited Tata Motors
13 REL Reliance
14 Reliance Limited. Reliance
15 Reliance Co. Reliance
Now this is just 16 records, so its easy to look up all the possible names and anomalies in their names, and create a dictionary for mapping. But in actual I have a data-frame with about 5800 unique names (NOTE: 'USA' and 'U.S.A.' are counted as different entities when stating the count of unique). So is there any programmatic approach to tackle such a scenario?
I tried running fuzzy match using difflib and fuzzywuzzy libraries but even its final results are not concrete. Often times difflib would just match up based on words like 'limited','association',etc. even though they would be referring to two different names with just 'association' or 'limited' as the common word among them.
Any help is appreciated.
EDIT:
Even if I create a list of Stop-words with words like 'associatio','limited','cooprations','group',etc there are chances of missing out these stop word names when mentioned differently. For instance, if 'association' and 'limited' are just mentioned as 'assoc.','ltd' and 'ltd.' there are chances that I'll miss out adding some of these to the stop-word list.
I have already tried, topic modelling with LDA and NMF the results were pretty similar to what I had achieved earlier using difflib and fuzzywuzzy libraries. And yes I did all the preprocessing (converting to lower cases, leamtization, extra whitespaces handling) before any of these approaches
Late answer, focusing on it for a hour, you can use difflib.SequenceMatcher and filter the ratio where it is greater than 0.6, and a big chunk of code as well... also I simply remove the last word of each list, in the names column after it is modified, and get the longest word which apparently gets your desired result, and here it is...
import difflib
df2 = df.copy()
df2.loc[df2.names.str.contains('America'), 'names'] = 'US'
df2['names'] = df2.names.str.replace('.', '').str.lstrip()
df2.loc[df2.names.str.contains('REL'), 'names'] = 'Reliance'
df['group_name'] = df2.names.apply(lambda x: max(sorted([i.rsplit(None, 1)[0] for i in df2.names.tolist() if difflib.SequenceMatcher(None, x, i).ratio() > 0.6]), key=len))
print(df)
Output:
names group_name
0 U.S.A. USA
1 United States of America USA
2 USA USA
3 US America USA
4 Kenyan Footbal League Kenya Football League
5 Kenyan Football League Kenya Football League
6 Kenya Football League Assoc. Kenya Football League
7 Kenya Footbal League Association Kenya Football League
8 Tata Motors Tata Motors
9 Tat Motor Tata Motors
10 Tata Motors Ltd. Tata Motors
11 Tata Motor Limited Tata Motors
12 REL Reliance
13 Reliance Limited Reliance
14 Reliance Co. Reliance
A code with my best effort.
So according to my knowledge. I don't thinks so you can have accurate results but you can do some of things which would help you to clean your data
First lower the strings using .lower()
Strip the strings to remove extra spaces using strip()
tokenize the strings
Stemming and lemmatization of your data
you should do research on sentence similarity multiple libraries exist in python such as gensim,nltk
https://radimrehurek.com/gensim/tutorial.html
https://spacy.io/
https://www.nltk.org/
Even I created very basic document similarity project you can check this github
https://github.com/tawabshakeel/Document-similarity-NLP-
I hope all these things would help you in solving your problem.
I want to group multiple categories in a pandas variable using numpy.where and dictionary.
Currently I am trying this using just numpy.where which increases my code a lot if I have a lot of categories. I want to create a map using dictionary and then use that map in numpy.where .
Sample Data frame:
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
dataF
TITLE
0 CEO
1 CHIEF EXECUTIVE
2 EXECUTIVE OFFICER
3 FOUNDER
4 CHIEF OP
5 TECH OFFICER
6 CHIEF TECH
7 VICE PRES
8 PRESIDENT
9 PRESIDANTE
10 OWNER
11 CO OWNER
12 DIRECTOR
13 MANAGER
14 NaN
Numpy operation
dataF['TITLE_GRP'] = np.where(dataF['TITLE'].isna(),'NOTAVAILABLE',
np.where(dataF['TITLE'].str.contains('CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN'),'CEO_FOUNDER',
np.where(dataF['TITLE'].str.contains('CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$'),'OTHER_OFFICERS',
np.where(dataF['TITLE'].str.contains('VICE|VP'),'VP',
np.where(dataF['TITLE'].str.contains('PRESIDENT|PRES'),'PRESIDENT',
np.where(dataF['TITLE'].str.contains('OWNER'),'OWNER_CO_OWN',
np.where(dataF['TITLE'].str.contains('MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'),'DIR_MGR_HEAD'
,dataF['TITLE'])))))))
Transformed Data
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
What I want to do is create some mapping like below:
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
And then feed it to some function which applies the stepwise numpy operation and gives me the same result as above.
I am doing this I have to parameterize my code in such a way that all parameter for data manipulation will be provided from a json file.
I was trying pandas.replace as it has dictionary capability but it doesnt preserve the hiercichal structure as done in nested np.where, its also not able to replace the whole title as it just replaces the string when it finds a match.
In case you are able to provide solution for above I would also like to know how to solve following 2 other scenario:
This scenario contains .isin operation instead of regex
dataF['INDUSTRY'] = np.where(dataF['INDUSTRY'].isin(['AEROSPACE','AGRICULTURE/MINING','EDUCATION','ENERGY']),'AER_AGR_MIN_EDU_ENER',
np.where(dataF['INDUSTRY'].isin(['TRAVEL','INSURANCE','GOVERNMENT','FINANCIAL SERVICES','AUTO','PHARMACEUTICALS']),'TRA_INS_GOVT_FIN_AUT_PHAR',
np.where(dataF['INDUSTRY'].isin(['BUSINESS GOODS/SERVICES','CHEMICALS ','TELECOM','TRANSPORTATION']),'BS_CHEM_TELE_TRANSP',
np.where(dataF['INDUSTRY'].isin(['CONSUMER GOODS','ENTERTAINMENT','FOOD AND BEVERAGE','HEALTHCARE','INDUSTRIAL/MANUFACTURING','TECHNOLOGY']),'CG_ENTER_FB_HLTH_IND_TECH',
np.where(dataF['INDUSTRY'].isin(['ADVERTISING','ASSOCIATION','CONSULTING/ACCOUNTING','PUBLISHING/MEDIA','TECHNOLOGY']),'ADV_ASS_CONS_ACC_PUBL_MED_TECH',
np.where(dataF['INDUSTRY'].isin(['RESTAURANT','SOFTWARE']),'REST_SOFT',
'NOTAVAILABLE'))))))
This scenario contains .between operation
dataF['annual_revn'] = np.where(dataF['annual_revn'].between(1000000,10000000),'1_10_MILLION',
np.where(dataF['annual_revn'].between(10000000,15000000),'10_15_MILLION',
np.where(dataF['annual_revn'].between(15000000,20000000),'15_20_MILLION',
np.where(dataF['annual_revn'].between(20000000,50000000),'20_50_MILLION',
np.where(dataF['annual_revn'].between(50000000,1000000000),'50_1000_MILLION',
'NOTAVAILABLE_OUTLIER')))))
The below method works, but it isn't particularly elegant, and it may not be that fast.
import pandas as pd
import numpy as np
import re
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
# Swap the keys and values from the raw data, and split regex by '|'
reverse_replace = {}
for key, value in TITLE_REPLACE.items():
for value_single in value.split('|'):
reverse_replace[value_single] = key
def mapping_func(x):
if not x is np.nan:
for key, value in reverse_replace.items():
if re.compile(key).search(x):
return value
return 'NOTAVAILABLE'
dataF['TITLE_GRP'] = dataF['TITLE'].apply(mapping_func)
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
For your additional scenario, it may make sense to construct a df with the industry mapping data, then do df.merge to determine the grouping from the industry
So I've been given a large text file with an assortment . The first thing I did was iterate over it and add it to a list so that each line is an element. I then made it so that the line could be indexed into. See code below.
def main():
f = open("/usr/local/doc/FEguide.txt", "r")
full = list()
companies= list()
for line in f:
line = line.split(",")
full.append(line[1:])
A line from the opened file is in this format. (The 1: slice is in order to omit the useless first element in the text line, not shown below)
Now I need to make it so that the user can enter the search term of an Automaker or Type (ie. Standard SUV). My hunch is that I need to make a list of just automakers (which could be done with some kind of slice) and then a list of (all the types) and then have this call the entire line if true. I'm having trouble actually implementing that.
you could use zip and dict.
assuming you have this file:
General Motors,Chevrolet,K1500 TAHOE 4WD,2900
General Motors,Chevrolet,TRAVERSE AWD,2750
Chrysler Group LLC,Dodge,Durango AWD,2750
Chrysler Group LLC,Dodge,Durango AWD,3400
Ford Motor Company,Ford,Expedition 4WD,3100
Ford Motor Company,Ford,EXPLORER AWD,275
first define how your header should looks like:
...cars.py...
import sys
cars_list = []
# header list
headers = ['company', 'Line', 'Type', 'Annual Cost']
with open('/home/ajava/tst.txt') as file:
# you should maybe check if both zip matrixs have the same size!!
for line in file:
zipped_list = zip(headers,line.strip().split(','))
#create a dictionary of zipped-tuples and append it to the car_list
cars_list.append(dict(zipped_list))
# printing results
print("\t".join(headers))
for item in cars_list:
print("{company}\t{line}\t{type}\t{annual cost}".format(**item))
your output should looks like that:
company line type annual cost
General Motors Chevrolet K1500 TAHOE 4WD 2900
General Motors Chevrolet TRAVERSE AWD 2750
Chrysler Group LLC Dodge Durango AWD 2750
Chrysler Group LLC Dodge Durango AWD 3400
Ford Motor Company Ford Expedition 4WD 3100
Ford Motor Company Ford EXPLORER AWD 275
this is ofcourse just a simple example how you could do that with no extra libs.
I think there is no need to reinvent the wheel, unless this is a programming assignment. I'll leave the command-line interactions up to you, but the basic functionality is here using pandas:
import pandas as pd
df = pd.read_csv('FEguid.txt')
print '----------------------------'
print 'All companies sorted:'
print df.sort('Company').Company
print '----------------------------'
print 'All Dodge models:'
print df[df['Line'] == 'Dodge']
print '----------------------------'
print 'Mean MPG and annual cost per company'
print df.groupby('Company').mean()
print '----------------------------'
print 'Mean MPG and annual cost per car type'
print df.groupby('Type').mean()
Output:
----------------------------
All companies sorted:
2 Chrysler Group LLC
3 Chrysler Group LLC
4 Ford Motor Company
5 Ford Motor Company
1 General Motors
0 General Motors
Name: Company, dtype: object
----------------------------
All Dodge models:
Company Line Type MPG Annual Cost Category
2 Chrysler Group LLC Dodge Durango AWD 19 2750 Standard SUV 4WD
3 Chrysler Group LLC Dodge Durango AWD 16 3400 Standard SUV 4WD
----------------------------
Mean MPG and annual cost per company
MPG Annual Cost
Company
Chrysler Group LLC 17.5 3075
Ford Motor Company 18.0 2925
General Motors 18.5 2825
----------------------------
Mean MPG and annual cost per car type
MPG Annual Cost
Type
Durango AWD 17.5 3075
EXPLORER AWD 19.0 2750
Expedition 4WD 17.0 3100
K1500 TAHOE 4WD 18.0 2900
TRAVERSE AWD 19.0 2750