Regex python: Extracting text from a table

Regex python: Extracting text from a table - python

We have a table (in a .doc document) that looks as follows:
item_number
item_code
description
unit
QUANTITY
BID
AMOUNT
1
074016
CONSTRUCTION SITE MANAGEMENT
LS
LUMP SUM
24,826.49
24,826.49
2
074017
PREPARE WATER POLLUTION CONTROL PROGRAM
LS
30
125.38
3,761.40
3
840521
4" THERMOPLASTIC TRAFFIC STRIPE (BROKEN 6-1)
SQFT
LUMP SUM
.19
32.30
I imported the text in the .doc file and am now using regex to extract the table. When imported, the table text looks as follows:
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
2 074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00
I am trying to create a pattern that can capture different variable values in different regex groups. Right now, the pattern I have is ^(\s{6}|\s{7})(\d+)\s+(\d+)\s+(([A-Z.]{2}[^\n\d]*[A-Z)]\s{2})). But, it also captures LS and LUMP SUM in the third group (description).
Code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
text = '...
074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00
015299 LEAD COMPLIANCE PLAN (STRIPE REMOVAL) LS LUMP SUM 828.25 828.25
374002 ASPHALTIC EMULSION (FOG SEAL COAT) TON 18 1,013.60 18,244.80
390095 REPLACE ASPHALT CONCRETE SURFACING CY 160 277.89 44,462.40
390137 RUBBERIZED HOT MIX ASPHALT (GAP GRADED) TON 9,650 101.05 975,132.50
394050 RUMBLE STRIP STA 180 26.38 4,748.40
015300 REPLACE AIR MARKER EA 50 139.27 6,963.50
840504 4" THERMOPLASTIC TRAFFIC STRIPE LF 146,000 .36 52,560.00'
# creating a list: each item is a row from the dataset
text = re.split(r'(?ms)\n\s+\d+', text)
Is there a way to capture different variables in different groups? Any help would be appreciated. Thank you!
Edit 1:
The code misses datasets that look as follows:
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 240.00 240.00
2 074019 PREPARE STORM WATER POLLUTION LS LUMP SUM 2,300.00 2,300.00
PREVENTION PLAN
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 12 240.00 2,880.00
4 074041 STREET SWEEPING LS LUMP SUM 1,700.00 1,700.00
5 074042 TEMPORARY CONCRETE WASHOUT (PORTABLE) LS LUMP SUM 370.00 370.00
6 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 7,100.00 7,100.00
7 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 35,900.00 35,900.00
8 120165 CHANNELIZER (SURFACE MOUNTED) EA 40 20.00 800.00
9 128650 PORTABLE CHANGEABLE MESSAGE SIGN EA 4 2,200.00 8,800.00
10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00
11 129100 TEMPORARY CRASH CUSHION MODULE EA 56 127.00 7,112.00
12 150662 REMOVE METAL BEAM GUARD RAILING LF 1,390 3.00 4,170.00
13 153210 REMOVE CONCRETE CY 2 660.00 1,320.00
14 015310 REMOVE BRIDGE APPROACH GUARD RAILING LF 200 6.30 1,260.00
15 156585 REMOVE CRASH CUSHION EA 1 300.00 300.00
16 160101 CLEARING AND GRUBBING LS LUMP SUM 2,500.00 2,500.00
17 190110 LEAD COMPLIANCE PLAN LS LUMP SUM 850.00 850.00
18 (F) 510502 MINOR CONCRETE (MINOR STRUCTURE) CY 4 2,740.00 10,960.00
19 820118 GUARD RAILING DELINEATOR EA 12 15.00 180.00
20 839303 SINGLE THRIE BEAM BARRIER (STEEL POST) LF 3,630 22.00 79,860.00
I suspect this could be due to the (F) there. Is there a way to tackle this? Thank you so much!
Edit 2:
#import sys
#sys.modules[__name__].__dict__.clear()
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
from io import StringIO
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')
text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")
# splitting by contract number
nob = text.split('BID OPENING DATE')
del nob[0]
# create a dataframe to store the data
# dff = pd.DataFrame(columns = ['contract_number', 'item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT'])
dff = pd.DataFrame(columns = ['0', '1', '2', '3', '4', '5', '6', '7'])
# file level loop starts here
dataframes = dict()
for i in range(len(nob)):
try:
txt = nob[i]
# contract number
cn1 = re.search('CONTRACT NUMBER\s+(.........)', txt)
cn2 = re.search('CONTRACT NUMBER\n+(.*)', txt)
if not (cn1 is None):
cn = cn1.group(1)
elif not (cn2 is None):
cn = cn2.group(1)
else:
cn = "Not captured"
# getting the contract proposal of low bidder table
hi = re.split('(?ms)C O N T R A C T\s+P R O P O S A L\s+O F\s+L O W\s+B I D D E R(.*?)S U M M A R Y', txt)
hi = hi[1]
# splitting again
hi = re.split('(?ms)---(\s+\n.*?\s*)TOTAL', hi)
hi = hi[1]
hi.replace('(F)', ' ')
hi.replace('(S)', ' ')
df = pd.read_fwf(StringIO(hi), header=None, dtype=str)
# df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']
df['item_number'] = df['item_number'].ffill()
df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()
df['contract_number'] = cn
i = df.columns.get_loc('item_number')
# dff = dff.append(df, ignore_index = True)
dataset = pd.read_csv('dataset.csv')
dataframes['ok'] = dataset
dff = dff.append(df, ignore_index = True)
except Exception as e:
print(e)
print('Error in contract number: ' + cn)
# print(dff)

To join correct the table you can try next example:
from io import StringIO
txt = '''\
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
2 074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00'''
df = pd.read_fwf(StringIO(txt), header=None, dtype=str)
df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']
df['item_number'] = df['item_number'].ffill()
df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()
print(df)
Prints:
item_number item_code description unit QUANTITY BID AMOUNT
0 1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
1 2 074017 PREPARE WATER POLLUTION CONTROL PROGRAM LS LUMP SUM 708.63 708.63
2 3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
3 4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
4 5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
5 6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
6 7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
7 8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00

Related

isin only returning first line from csv

I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?

You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]

Drawing a random sub-sample from a df proportionally to categories

I have a dataframe like this
names = ["Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5", "Patient 6", "Patient 7"]
categories = ["Internal medicine, Gastroenterology", "Internal medicine, General Med, Endocrinology", "Pediatrics, Medical genetics, Laboratory medicine", "Internal medicine", "Endocrinology", "Pediatrics", "General Med, Laboratory medicine"]
zippedList = list(zip(names, categories))
df = pd.DataFrame(zippedList, columns=['names', 'categories'])
yielding:
print(df)
names categories
0 Patient 1 Internal medicine, Gastroenterology
1 Patient 2 Internal medicine, General Med, Endocrinology
2 Patient 3 Pediatrics, Medical genetics, Laboratory medicine
3 Patient 4 Internal medicine
4 Patient 5 Endocrinology
5 Patient 6 Pediatrics
6 Patient 7 General Med, Laboratory medicine
(The real data-frame has >1000 rows)
and counting the categories yields:
print(df['categories'].str.split(", ").explode().value_counts())
Internal medicine 3
General Med 2
Endocrinology 2
Laboratory medicine 2
Pediatrics 2
Gastroenterology 1
Medical genetics 1
I would like to draw a random sub-sample of n rows so that each medial category is proportionally represented. e.g. 3 of 13 (~23%) categories are "Internal medicine". Therefore ~23% of the sub-sample should have this category. This wouldn't be too hard if each patient had 1 category but unfortunately they can have multiple (eg patient 3 got even 3 categories). How can I do this?

The fact your patients have many categories doesn't affect the subsampling process. When you take n rows out of nrows (which is len(df) ), subsampling will maintain the categories weights, +/- the probability one class is more represented in your random subset -it converges to 0 as n gets higher-
Typically,
n = 2000
df2 = df.sample(n).copy(deep = True)
print(df2['categories'].str.split(", ").explode().value_counts())
should work the way you want.
I also read you have around 1000 categories. Do not forget to preprocess them before using, as some of them could disappear after your subsampling fit.

How to combine common rows in DataFrame

I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'

OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64

How to group categories in a variable using numpy and dictionary

I want to group multiple categories in a pandas variable using numpy.where and dictionary.
Currently I am trying this using just numpy.where which increases my code a lot if I have a lot of categories. I want to create a map using dictionary and then use that map in numpy.where .
Sample Data frame:
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
dataF
TITLE
0 CEO
1 CHIEF EXECUTIVE
2 EXECUTIVE OFFICER
3 FOUNDER
4 CHIEF OP
5 TECH OFFICER
6 CHIEF TECH
7 VICE PRES
8 PRESIDENT
9 PRESIDANTE
10 OWNER
11 CO OWNER
12 DIRECTOR
13 MANAGER
14 NaN
Numpy operation
dataF['TITLE_GRP'] = np.where(dataF['TITLE'].isna(),'NOTAVAILABLE',
np.where(dataF['TITLE'].str.contains('CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN'),'CEO_FOUNDER',
np.where(dataF['TITLE'].str.contains('CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$'),'OTHER_OFFICERS',
np.where(dataF['TITLE'].str.contains('VICE|VP'),'VP',
np.where(dataF['TITLE'].str.contains('PRESIDENT|PRES'),'PRESIDENT',
np.where(dataF['TITLE'].str.contains('OWNER'),'OWNER_CO_OWN',
np.where(dataF['TITLE'].str.contains('MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'),'DIR_MGR_HEAD'
,dataF['TITLE'])))))))
Transformed Data
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
What I want to do is create some mapping like below:
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
And then feed it to some function which applies the stepwise numpy operation and gives me the same result as above.
I am doing this I have to parameterize my code in such a way that all parameter for data manipulation will be provided from a json file.
I was trying pandas.replace as it has dictionary capability but it doesnt preserve the hiercichal structure as done in nested np.where, its also not able to replace the whole title as it just replaces the string when it finds a match.
In case you are able to provide solution for above I would also like to know how to solve following 2 other scenario:
This scenario contains .isin operation instead of regex
dataF['INDUSTRY'] = np.where(dataF['INDUSTRY'].isin(['AEROSPACE','AGRICULTURE/MINING','EDUCATION','ENERGY']),'AER_AGR_MIN_EDU_ENER',
np.where(dataF['INDUSTRY'].isin(['TRAVEL','INSURANCE','GOVERNMENT','FINANCIAL SERVICES','AUTO','PHARMACEUTICALS']),'TRA_INS_GOVT_FIN_AUT_PHAR',
np.where(dataF['INDUSTRY'].isin(['BUSINESS GOODS/SERVICES','CHEMICALS ','TELECOM','TRANSPORTATION']),'BS_CHEM_TELE_TRANSP',
np.where(dataF['INDUSTRY'].isin(['CONSUMER GOODS','ENTERTAINMENT','FOOD AND BEVERAGE','HEALTHCARE','INDUSTRIAL/MANUFACTURING','TECHNOLOGY']),'CG_ENTER_FB_HLTH_IND_TECH',
np.where(dataF['INDUSTRY'].isin(['ADVERTISING','ASSOCIATION','CONSULTING/ACCOUNTING','PUBLISHING/MEDIA','TECHNOLOGY']),'ADV_ASS_CONS_ACC_PUBL_MED_TECH',
np.where(dataF['INDUSTRY'].isin(['RESTAURANT','SOFTWARE']),'REST_SOFT',
'NOTAVAILABLE'))))))
This scenario contains .between operation
dataF['annual_revn'] = np.where(dataF['annual_revn'].between(1000000,10000000),'1_10_MILLION',
np.where(dataF['annual_revn'].between(10000000,15000000),'10_15_MILLION',
np.where(dataF['annual_revn'].between(15000000,20000000),'15_20_MILLION',
np.where(dataF['annual_revn'].between(20000000,50000000),'20_50_MILLION',
np.where(dataF['annual_revn'].between(50000000,1000000000),'50_1000_MILLION',
'NOTAVAILABLE_OUTLIER')))))

The below method works, but it isn't particularly elegant, and it may not be that fast.
import pandas as pd
import numpy as np
import re
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
# Swap the keys and values from the raw data, and split regex by '|'
reverse_replace = {}
for key, value in TITLE_REPLACE.items():
for value_single in value.split('|'):
reverse_replace[value_single] = key
def mapping_func(x):
if not x is np.nan:
for key, value in reverse_replace.items():
if re.compile(key).search(x):
return value
return 'NOTAVAILABLE'
dataF['TITLE_GRP'] = dataF['TITLE'].apply(mapping_func)
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
For your additional scenario, it may make sense to construct a df with the industry mapping data, then do df.merge to determine the grouping from the industry

Having trouble merging two dataframes in python

I am new to Python and I am trying to merge two datasets for my research together:
df1 has the column names: companyname, ticker, and Dscode,
df2 has companyname, ticker, grouptcode, and Dscode.
I want to merge the grouptcode from df1 to df2, however, the companyname is slightly different, but very similar between the two dataframes.
For each ticker, there is an associated Dscode. However, multiple companies have the same ticker, and therefore the same Dscode.
Problem
I am only interested in merging the grouptcode for the associated ticker and Dscode that matches the companyname (which at times is slightly different - this part is what I cannot get past). The code I have been using is below.
Code
import pandas as pd
import os
# set working directory
path = "/Users/name/Desktop/Python"
os.chdir(path)
os.getcwd() # Prints the working directory
# read in excel file
file = "/Users/name/Desktop/Python/Excel/DSROE.xlsx"
x1 = pd.ExcelFile(file)
print(x1.sheet_names)
df1 = x1.parse('Sheet1')
df1.head()
df1.tail()
file2 = "/Users/name/Desktop/Python/Excel/tcode2.xlsx"
x2 = pd.ExcelFile(file2)
print(x2.sheet_names)
df2 = x2.parse('Sheet1')
df2['companyname'] = df2['companyname'].str.upper() ## make column uppercase
df2.head()
df2.tail()
df2 = df2.dropna()
x3 = pd.merge(df1, df2,how = 'outer') # merge
Data
df1
Dscode ticker companyname
65286 8933TC 3pl 3P LEARNING LIMITED
79291 9401FP a2m A2 MILK COMPANY LIMITED
1925 14424Q aac AUSTRALIAN AGRICULTURAL COMPANY LIMITED
39902 675493 aad ARDENT LEISURE GROUP
1400 133915 aba AUSWIDE BANK LIMITED
74565 922472 abc ADELAIDE BRIGHTON LIMITED
7350 26502C abp ABACUS PROPERTY GROUP
39202 675142 ada ADACEL TECHNOLOGIES LIMITED
80866 9661AD adh ADAIRS
80341 9522QV afg AUSTRALIAN FINANCE GROUP LIMITED
45327 691938 agg ANGLOGOLD ASHANTI LIMITED
2625 14880E agi AINSWORTH GAME TECHNOLOGY LIMITED
75090 923040 agl AGL ENERGY LIMITED
19251 29897X ago ATLAS IRON LIMITED
64409 890588 agy ARGOSY MINERALS LIMITED
24151 31511D ahg AUTOMOTIVE HOLDINGS GROUP LIMITED
64934 8917JD ahy ASALEO CARE LIMITED
42877 691152 aia AUCKLAND INTERNATIONAL AIRPORT LIMITED
61433 88013C ajd ASIA PACIFIC DATA CENTRE GROUP
44452 691704 ajl AJ LUCAS GROUP LIMITED
700 13288C ajm ALTURA MINING LIMITED
19601 29929D akp AUDIO PIXELS HOLDINGS LIMITED
79816 951404 alk ALKANE RESOURCES LIMITED
56008 865613 all ARISTOCRAT LEISURE LIMITED
51807 771351 alq ALS LIMITED
44277 691685 alu ALTIUM LIMITED
42702 68625C alx ATLAS ARTERIA GROUP
30101 41162F ama AMA GROUP LIMITED
67386 902201 amc AMCOR LIMITED
33426 50431L ami AURELIA METALS LIMITED
df2
companyname grouptcode ticker
524 3P LEARNING LIMITED.. tpn1 3pl
1 THE A2 MILK COMPANY LIMITED a2m1 a2m
2 AUSTRALIAN AGRICULTURAL COMPANY LIMITED. aac2 aac
3 AAPC LIMITED. aad1 aad
6 ADVANCE BANK AUSTRALIA LIMITED aba1 aba
7 ADELAIDE BRIGHTON CEMENT HOLDINGS LIMITED abc1 abc
8 ABACUS PROPERTY GROUP abp1 abp
9 ADACEL TECHNOLOGIES LIMITED ada1 ada
288 ADA CORPORATION LIMITED khs1 ada
10 AERODATA HOLDINGS LIMITED adh1 adh
11 ADAMS (HERBERT) HOLDINGS LIMITED adh2 adh
12 ADAIRS LIMITED adh3 adh
431 ALLCO FINANCE GROUP LIMITED rcd1 afg
13 AUSTRALIAN FINANCE GROUP LTD afg1 afg
14 ANGLOGOLD ASHANTI LIMITED agg1 agg
15 APGAR INDUSTRIES LIMITED agi1 agi
16 AINSWORTH GAME TECHNOLOGY LIMITED agi2 agi
17 AUSTRALIAN GAS LIGHT COMPANY (THE) agl1 agl
18 ATLAS IRON LIMITED ago1 ago
393 ACM GOLD LIMITED pgo2 ago
19 AUSTRALIAN GYPSUM INDUSTRIES LIMITED agy1 agy
142 ARGOSY MINERALS INC cio1 agy
21 ARCHAEAN GOLD NL ahg1 ahg
22 AUSTRALIAN HYDROCARBONS N.L. ahy1 ahy
23 ASALEO CARE LIMITED ahy2 ahy
24 AUCKLAND INTERNATIONAL AIRPORT LIMITED aia1 aia
25 ASIA PACIFIC DATA CENTRE GROUP ajd1 ajd
26 AJ LUCAS GROUP LIMITED ajl1 ajl
27 AJAX MCPHERSON'S LIMITED ajm1 ajm
29 ALKANE EXPLORATION (TERRIGAL) N.L. alk1 alk
Dscode
524 8933TC
1 9401FP
2 14424Q
3 675493
6 133915
7 922472
8 26502C
9 675142
288 675142
10 9661AD
11 9661AD
12 9661AD
431 9522QV
13 9522QV
14 691938
15 14880E
16 14880E
17 923040
18 29897X
393 29897X
19 890588
142 890588
21 31511D
22 8917JD
23 8917JD
24 691152
25 88013C
26 691704
27 13288C
29 951404

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex python: Extracting text from a table - python

Related

isin only returning first line from csv

Drawing a random sub-sample from a df proportionally to categories

How to combine common rows in DataFrame

How to group categories in a variable using numpy and dictionary

Having trouble merging two dataframes in python

Categories

Resources