I'm trying to concatenate with new observations. I got the answer that I think it's right but still get the system came back to me saying "ValueError
Can only compare identically-labeled DataFrame objects" Can anyone tell me why there's value error while I think I got the right result?
Here is the question:
Assume the data frame Employee is as below:
Department Title Year Education Sex
Name
Bob IT analyst 1 Bachelor M
Sam Trade associate 3 PHD M
Peter HR VP 8 Master M
Jake IT analyst 2 Master M
and another data frame new_observations is:
Department Education Sex Title Year
Mary IT F VP 9.0
Amy ? PHD F associate 5.0
Jennifer Trade Master F associate NaN
John HR Master M analyst 2.0
Judy HR Bachelor F analyst 2.0
Update Employee with these new observations.
Here is my code:
import pandas as pd
Employee =pd.DataFrame({"Name":["Bob","Sam","Peter","Jake"],
"Education":["Bachelor","PHD","Master","Master"],
"Sex":["M","M","M","M"],
"Year":[1,3,8,2],
"Department":["IT","Trade","HR","IT"],
"Title":["analyst", "associate", "VP", "analyst"]})
Employee=Employee.set_index('Name')
new_observations = pd.DataFrame({
"Name": ["Mary","Amy","Jennifer","John","Judy"],
"Department":["IT","?","Trade","HR","HR"],
"Education":["","PHD","Master","Master","Bachelor"],
"Sex":["F","F","F","M","F"],
"Title":["VP","associate","associate","analyst","analyst"],
"Year":[9.0,5.0,"NaN",2.0,2.0]},
columns=
["Name","Department","Education","Sex","Title","Year"])
new_observations=new_observations.set_index('Name')
Employee = Employee.append(new_observations,sort=False)
Here is my result:
code result
I also tried
Employee = pd.concat([Employee, new_observations], axis = 1, sort=False)
Use pd.concat on axis=0, which is default, so you don't need to include axis:
pd.concat([Employee, new_observations], sort=False)
Output:
Education Sex Year Department Title
Name
Bob Bachelor M 1 IT analyst
Sam PHD M 3 Trade associate
Peter Master M 8 HR VP
Jake Master M 2 IT analyst
Mary F 9 IT VP
Amy PHD F 5 ? associate
Jennifer Master F NaN Trade associate
John Master M 2 HR analyst
Judy Bachelor F 2 HR analyst
Related
I have two dataframes df1 and df2 which different row sizes but same columns, The ID column is common across both dataframes. I want a write the difference in a text file. For example:
df1:
ID Name Age Profession sex
1 Tom 20 engineer M
2 nick 21 doctor M
3 krishi 19 lawyer F
4 jacky 18 dentist F
df2:
ID Name Age Profession sex
1 Tom 20 plumber M
2 nick 21 doctor M
3 krishi 23 Analyst F
4 jacky 18 dentist F
The resultant text file should look like:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19 23 lawyer Analyst
You can use compare and a loop:
df3 = df1.set_index('ID').compare(df2.set_index('ID'))
df3.columns = (df3.rename({'self': 'old', 'other': 'new'}, level=1, axis=1)
.columns.map('_'.join)
)
for id, row in df3.iterrows():
print(f'ID : {id}')
print(row.dropna().to_frame().T.to_string(index=False))
print()
output:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19.0 23.0 lawyer Analyst
NB. using print here for the demo, to write to a file:
with open('file.txt') as f:
f.write(f'ID : {id}\n')
f.write(row.dropna().to_frame().T.to_string(index=False))
f.write('\n\n')
You could also directly use df3:
Age_old Age_new Profession_old Profession_new
ID
1 NaN NaN engineer plumber
3 19.0 23.0 lawyer Analyst
I have an issue where I have multiple rows in a csv file that have to be converted to a pandas data frame but there are some rows where the columns 'name' and 'business' have multiple names and businesses that should be in separate rows and need to be split up while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J., Ricky B.
Unspecified, Unspecified, Self-employed
output I need:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
def
Ricky B
Self-employed
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Here's an alternative solution I was trying as well but it gives me an error too:
review_path = r'data/base_data'
review_files = glob.glob(review_path + "/test_data.csv")
review_df_list = []
for review_file in review_files:
df = pd.read_csv(io.StringIO(review_file), sep = '\t')
print(df.head())
df["business"] = (df["business"].str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))").groupby(level=0).agg(list))
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
outPutPath = Path('data/base_data/test_data.csv')
df.to_csv(outPutPath, index=False)
Error Message for alternative solution:
Read:data/base_data/review_base.csv
Success!
Empty DataFrame
Columns: [data/base_data/test_data.csv]
Index: []
Try:
df["business"] = (
df["business"]
.str.extractall(r"(?:[\s,]*)(.*?(?:Unspecified|employees|Self-employed))")
.groupby(level=0)
.agg(list)
)
df["name"] = df["name"].str.split(r"\s*,\s*")
print(df.explode(["name", "business"]))
Prints:
software name business
0 abc Andrew Johnson Outsourcing/Offshoring, 201-500 employees
0 abc Steve Martin Health, Wellness and Fitness, 5001-10,000 employees
1 xyz Jack Jones Banking, 1001-5000 employees
1 xyz Rick Paul Construction, 51-200 employees
1 xyz Johnny Jones Consumer Goods, 10,001+ employees
2 def Tom D. Unspecified
2 def Connie J. Unspecified
2 def Ricky B. Self-employed
I want to group multiple categories in a pandas variable using numpy.where and dictionary.
Currently I am trying this using just numpy.where which increases my code a lot if I have a lot of categories. I want to create a map using dictionary and then use that map in numpy.where .
Sample Data frame:
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
dataF
TITLE
0 CEO
1 CHIEF EXECUTIVE
2 EXECUTIVE OFFICER
3 FOUNDER
4 CHIEF OP
5 TECH OFFICER
6 CHIEF TECH
7 VICE PRES
8 PRESIDENT
9 PRESIDANTE
10 OWNER
11 CO OWNER
12 DIRECTOR
13 MANAGER
14 NaN
Numpy operation
dataF['TITLE_GRP'] = np.where(dataF['TITLE'].isna(),'NOTAVAILABLE',
np.where(dataF['TITLE'].str.contains('CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN'),'CEO_FOUNDER',
np.where(dataF['TITLE'].str.contains('CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$'),'OTHER_OFFICERS',
np.where(dataF['TITLE'].str.contains('VICE|VP'),'VP',
np.where(dataF['TITLE'].str.contains('PRESIDENT|PRES'),'PRESIDENT',
np.where(dataF['TITLE'].str.contains('OWNER'),'OWNER_CO_OWN',
np.where(dataF['TITLE'].str.contains('MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'),'DIR_MGR_HEAD'
,dataF['TITLE'])))))))
Transformed Data
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
What I want to do is create some mapping like below:
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
And then feed it to some function which applies the stepwise numpy operation and gives me the same result as above.
I am doing this I have to parameterize my code in such a way that all parameter for data manipulation will be provided from a json file.
I was trying pandas.replace as it has dictionary capability but it doesnt preserve the hiercichal structure as done in nested np.where, its also not able to replace the whole title as it just replaces the string when it finds a match.
In case you are able to provide solution for above I would also like to know how to solve following 2 other scenario:
This scenario contains .isin operation instead of regex
dataF['INDUSTRY'] = np.where(dataF['INDUSTRY'].isin(['AEROSPACE','AGRICULTURE/MINING','EDUCATION','ENERGY']),'AER_AGR_MIN_EDU_ENER',
np.where(dataF['INDUSTRY'].isin(['TRAVEL','INSURANCE','GOVERNMENT','FINANCIAL SERVICES','AUTO','PHARMACEUTICALS']),'TRA_INS_GOVT_FIN_AUT_PHAR',
np.where(dataF['INDUSTRY'].isin(['BUSINESS GOODS/SERVICES','CHEMICALS ','TELECOM','TRANSPORTATION']),'BS_CHEM_TELE_TRANSP',
np.where(dataF['INDUSTRY'].isin(['CONSUMER GOODS','ENTERTAINMENT','FOOD AND BEVERAGE','HEALTHCARE','INDUSTRIAL/MANUFACTURING','TECHNOLOGY']),'CG_ENTER_FB_HLTH_IND_TECH',
np.where(dataF['INDUSTRY'].isin(['ADVERTISING','ASSOCIATION','CONSULTING/ACCOUNTING','PUBLISHING/MEDIA','TECHNOLOGY']),'ADV_ASS_CONS_ACC_PUBL_MED_TECH',
np.where(dataF['INDUSTRY'].isin(['RESTAURANT','SOFTWARE']),'REST_SOFT',
'NOTAVAILABLE'))))))
This scenario contains .between operation
dataF['annual_revn'] = np.where(dataF['annual_revn'].between(1000000,10000000),'1_10_MILLION',
np.where(dataF['annual_revn'].between(10000000,15000000),'10_15_MILLION',
np.where(dataF['annual_revn'].between(15000000,20000000),'15_20_MILLION',
np.where(dataF['annual_revn'].between(20000000,50000000),'20_50_MILLION',
np.where(dataF['annual_revn'].between(50000000,1000000000),'50_1000_MILLION',
'NOTAVAILABLE_OUTLIER')))))
The below method works, but it isn't particularly elegant, and it may not be that fast.
import pandas as pd
import numpy as np
import re
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
# Swap the keys and values from the raw data, and split regex by '|'
reverse_replace = {}
for key, value in TITLE_REPLACE.items():
for value_single in value.split('|'):
reverse_replace[value_single] = key
def mapping_func(x):
if not x is np.nan:
for key, value in reverse_replace.items():
if re.compile(key).search(x):
return value
return 'NOTAVAILABLE'
dataF['TITLE_GRP'] = dataF['TITLE'].apply(mapping_func)
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
For your additional scenario, it may make sense to construct a df with the industry mapping data, then do df.merge to determine the grouping from the industry
I have a large df called data which looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 05/09/14
1 10610.0 Cooper Amy 16/08/12
I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.
What can I try next?
a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),
But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.
I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below
Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key
import sys
import pandas as pd
data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
[54213.0,'Jones','Sally','15/04/15','FT','NW'],
[12237.0,'Evans','Steve','26/08/14','FT','SE'],
[10610.0,'Cooper','Amy','16/08/12','FT','SE']],
columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])
updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
[10610.0,'Cooper','Amy','16/08/12']],
columns = ['Identifier','Surname','First names(s)','Date change'])
def update(df1, df2, keys_list):
df1 = df1.set_index(keys_list)
df2 = df2.set_index(keys_list)
dup_idx1 = df1.index.get_duplicates()
dup_idx2 = df2.index.get_duplicates()
if len(dup_idx1) > 0 or len(dup_idx2) > 0:
print('\n'+'#'*50+'\nError! Duplicate Indicies:')
for element in dup_idx1:
print('df1: %s' %(element,))
for element in dup_idx2:
print('df2: %s' %(element,))
print('#'*50+'\n\n')
df1.update(df2, overwrite=True)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
return df1
# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])
Output:
print (data)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
print (updates)
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 15/09/14
1 10610.0 Cooper Amy 16/08/12
df = update(data, updates, ['Identifier'])
In [19]: print (df)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
Using DataFrame.update.
First set index:
data.set_index('Identifier', inplace=True)
updates.set_index('Identifier', inplace=True)
Then update:
data.update(updates)
print(data)
Surname First names(s) Date change Work Pattern Region
Identifier
12233.0 Smith Bob 15/09/14 FT NW
54213.0 Jones Sally 15/04/15 FT NW
12237.0 Evans Steve 26/08/14 FT SE
10610.0 Cooper Amy 16/08/12 FT SE
If you need multiple columns to create a unique index you can just set them with a list. For example:
data.set_index(['Identifier', 'Surname'], inplace=True)
updates.set_index(['Identifier', 'Surname'], inplace=True)
data.update(updates)
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09