I want to create two new columns in job_transitions_sample.csv and add the wage data from wage_data_sample.csv for both Title 1 and Title 2:
job_transitions_sample.csv:
Title 1 Title 2 Count
0 administrative assistant office manager 20
1 accountant cashier 1
2 accountant financial analyst 22
4 accountant senior accountant 23
6 accounting clerk bookkeeper 11
7 accounts payable clerk accounts receivable clerk 8
8 administrative assistant accounting clerk 8
9 administrative assistant administrative clerk 12
...
wage_data_sample.csv
title wage
0 cashier 17.00
1 sandwich artist 18.50
2 dishwasher 20.00
3 babysitter 20.00
4 barista 21.50
5 housekeeper 21.50
6 retail sales associate 23.00
7 bartender 23.50
8 cleaner 23.50
9 line cook 23.50
10 pizza cook 23.50
...
I want the end result to look like this:
Title 1 Title 2 Count Wage of Title 1 Wage of Title 2
0 administrative assistant office manager 20 NaN NaN
1 accountant cashier 1 NaN NaN
2 accountant financial analyst 22 NaN NaN
...
I'm thinking of using dictionaries then try to iterate every column but is there a more elegant built in solution? This is my code so far:
wage_data = pd.read_csv('wage_data_sample.csv')
dict = dict(zip(wage_data.title, wage_data.wage))
Use Series.map by dictionary d - cannot use dict for varialbe name, because python code name:
df = pd.read_csv('job_transitions_sample.csv')
wage_data = pd.read_csv('wage_data_sample.csv')
d = dict(zip(wage_data.title, wage_data.wage))
df['Wage of Title 1'] = df['Title 1'].map(d)
df['Wage of Title 2'] = df['Title 2'].map(d)
You can try with 2 merge con the 2 different Titles subsequentely.
For example, let be
df1 : job_transitions_sample.csv
df2 : wage_data_sample.csv
df1.merge(df2, left_on='Title 1', right_on='title',suffixes=('', 'Wage of')).merge(df2, left_on='Title 2', right_on='title',suffixes=('', 'Wage of'))
Related
I have two dataframes df1 and df2 which different row sizes but same columns, The ID column is common across both dataframes. I want a write the difference in a text file. For example:
df1:
ID Name Age Profession sex
1 Tom 20 engineer M
2 nick 21 doctor M
3 krishi 19 lawyer F
4 jacky 18 dentist F
df2:
ID Name Age Profession sex
1 Tom 20 plumber M
2 nick 21 doctor M
3 krishi 23 Analyst F
4 jacky 18 dentist F
The resultant text file should look like:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19 23 lawyer Analyst
You can use compare and a loop:
df3 = df1.set_index('ID').compare(df2.set_index('ID'))
df3.columns = (df3.rename({'self': 'old', 'other': 'new'}, level=1, axis=1)
.columns.map('_'.join)
)
for id, row in df3.iterrows():
print(f'ID : {id}')
print(row.dropna().to_frame().T.to_string(index=False))
print()
output:
ID : 1
Profession_old Profession_new
engineer plumber
ID : 3
Age_old Age_new Profession_old Profession_new
19.0 23.0 lawyer Analyst
NB. using print here for the demo, to write to a file:
with open('file.txt') as f:
f.write(f'ID : {id}\n')
f.write(row.dropna().to_frame().T.to_string(index=False))
f.write('\n\n')
You could also directly use df3:
Age_old Age_new Profession_old Profession_new
ID
1 NaN NaN engineer plumber
3 19.0 23.0 lawyer Analyst
I have two dataframes.
data_df
Investor Company Name CUSIP Symbol
0 Hank FaCeBoOk 30303M102
1 Dale Fraud Co 88160R101
2 Bill Netflix Inc 64110L106
3 Kahn 64110L106
4 Peggy Amazon 23135106
5 Rusty Costco 22160K105
6 Bobby BankAmericard 92826C839
7 Minh Placeholder 92826C839
8 Chappy Other 29786A106
and cusips_df:
Company Name Symbol CUSIP
0 Facebook FB 30303M102
1 Tesla TSLA 88160R101
2 Netflix Inc NFLX 64110L106
3 Amazon AMZN 23135106
4 Costco COST 22160K105
5 Visa V 92826C839
6 Mega-Lo Mart MLM 543535F63
7 Strickland Propane SPPN 453HGR001
8 Etsy ETSY 29786A106
I am matching the two dataframes on CUSIP, and then updating the Company Name and Symbol in data_df with those values from cusips_df.
data_df = data_df.set_index('CUSIP')
cusips_df = cusips_df.set_index('CUSIP')
data_df.update(cusips_df)
data_df = data_df.reset_index()
print(data_df)
But when I do, the CUSIP column gets moved to position 0, rather than stay in position 2:
CUSIP Investor Company Name Symbol
0 30303M102 Hank Facebook FB
1 88160R101 Dale Tesla TSLA
2 64110L106 Bill Netflix Inc NFLX
3 64110L106 Kahn Netflix Inc NFLX
4 23135106 Peggy Amazon AMZN
5 22160K105 Rusty Costco COST
6 92826C839 Bobby Visa V
7 92826C839 Minh Visa V
8 29786A106 Chappy Etsy ETSY
I know I can simply reorder the dataframe columns, but is there a more pythonic way of doing this so that the order of the columns in data_df stays the same?
We can do
d=dict(zip(cusips_df.CUSIP, cusips_df.Symbol))
data_df.Symbol.update(data_df.CUSIP.map(d))
My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
I am new to Python and I am trying to merge two datasets for my research together:
df1 has the column names: companyname, ticker, and Dscode,
df2 has companyname, ticker, grouptcode, and Dscode.
I want to merge the grouptcode from df1 to df2, however, the companyname is slightly different, but very similar between the two dataframes.
For each ticker, there is an associated Dscode. However, multiple companies have the same ticker, and therefore the same Dscode.
Problem
I am only interested in merging the grouptcode for the associated ticker and Dscode that matches the companyname (which at times is slightly different - this part is what I cannot get past). The code I have been using is below.
Code
import pandas as pd
import os
# set working directory
path = "/Users/name/Desktop/Python"
os.chdir(path)
os.getcwd() # Prints the working directory
# read in excel file
file = "/Users/name/Desktop/Python/Excel/DSROE.xlsx"
x1 = pd.ExcelFile(file)
print(x1.sheet_names)
df1 = x1.parse('Sheet1')
df1.head()
df1.tail()
file2 = "/Users/name/Desktop/Python/Excel/tcode2.xlsx"
x2 = pd.ExcelFile(file2)
print(x2.sheet_names)
df2 = x2.parse('Sheet1')
df2['companyname'] = df2['companyname'].str.upper() ## make column uppercase
df2.head()
df2.tail()
df2 = df2.dropna()
x3 = pd.merge(df1, df2,how = 'outer') # merge
Data
df1
Dscode ticker companyname
65286 8933TC 3pl 3P LEARNING LIMITED
79291 9401FP a2m A2 MILK COMPANY LIMITED
1925 14424Q aac AUSTRALIAN AGRICULTURAL COMPANY LIMITED
39902 675493 aad ARDENT LEISURE GROUP
1400 133915 aba AUSWIDE BANK LIMITED
74565 922472 abc ADELAIDE BRIGHTON LIMITED
7350 26502C abp ABACUS PROPERTY GROUP
39202 675142 ada ADACEL TECHNOLOGIES LIMITED
80866 9661AD adh ADAIRS
80341 9522QV afg AUSTRALIAN FINANCE GROUP LIMITED
45327 691938 agg ANGLOGOLD ASHANTI LIMITED
2625 14880E agi AINSWORTH GAME TECHNOLOGY LIMITED
75090 923040 agl AGL ENERGY LIMITED
19251 29897X ago ATLAS IRON LIMITED
64409 890588 agy ARGOSY MINERALS LIMITED
24151 31511D ahg AUTOMOTIVE HOLDINGS GROUP LIMITED
64934 8917JD ahy ASALEO CARE LIMITED
42877 691152 aia AUCKLAND INTERNATIONAL AIRPORT LIMITED
61433 88013C ajd ASIA PACIFIC DATA CENTRE GROUP
44452 691704 ajl AJ LUCAS GROUP LIMITED
700 13288C ajm ALTURA MINING LIMITED
19601 29929D akp AUDIO PIXELS HOLDINGS LIMITED
79816 951404 alk ALKANE RESOURCES LIMITED
56008 865613 all ARISTOCRAT LEISURE LIMITED
51807 771351 alq ALS LIMITED
44277 691685 alu ALTIUM LIMITED
42702 68625C alx ATLAS ARTERIA GROUP
30101 41162F ama AMA GROUP LIMITED
67386 902201 amc AMCOR LIMITED
33426 50431L ami AURELIA METALS LIMITED
df2
companyname grouptcode ticker
524 3P LEARNING LIMITED.. tpn1 3pl
1 THE A2 MILK COMPANY LIMITED a2m1 a2m
2 AUSTRALIAN AGRICULTURAL COMPANY LIMITED. aac2 aac
3 AAPC LIMITED. aad1 aad
6 ADVANCE BANK AUSTRALIA LIMITED aba1 aba
7 ADELAIDE BRIGHTON CEMENT HOLDINGS LIMITED abc1 abc
8 ABACUS PROPERTY GROUP abp1 abp
9 ADACEL TECHNOLOGIES LIMITED ada1 ada
288 ADA CORPORATION LIMITED khs1 ada
10 AERODATA HOLDINGS LIMITED adh1 adh
11 ADAMS (HERBERT) HOLDINGS LIMITED adh2 adh
12 ADAIRS LIMITED adh3 adh
431 ALLCO FINANCE GROUP LIMITED rcd1 afg
13 AUSTRALIAN FINANCE GROUP LTD afg1 afg
14 ANGLOGOLD ASHANTI LIMITED agg1 agg
15 APGAR INDUSTRIES LIMITED agi1 agi
16 AINSWORTH GAME TECHNOLOGY LIMITED agi2 agi
17 AUSTRALIAN GAS LIGHT COMPANY (THE) agl1 agl
18 ATLAS IRON LIMITED ago1 ago
393 ACM GOLD LIMITED pgo2 ago
19 AUSTRALIAN GYPSUM INDUSTRIES LIMITED agy1 agy
142 ARGOSY MINERALS INC cio1 agy
21 ARCHAEAN GOLD NL ahg1 ahg
22 AUSTRALIAN HYDROCARBONS N.L. ahy1 ahy
23 ASALEO CARE LIMITED ahy2 ahy
24 AUCKLAND INTERNATIONAL AIRPORT LIMITED aia1 aia
25 ASIA PACIFIC DATA CENTRE GROUP ajd1 ajd
26 AJ LUCAS GROUP LIMITED ajl1 ajl
27 AJAX MCPHERSON'S LIMITED ajm1 ajm
29 ALKANE EXPLORATION (TERRIGAL) N.L. alk1 alk
Dscode
524 8933TC
1 9401FP
2 14424Q
3 675493
6 133915
7 922472
8 26502C
9 675142
288 675142
10 9661AD
11 9661AD
12 9661AD
431 9522QV
13 9522QV
14 691938
15 14880E
16 14880E
17 923040
18 29897X
393 29897X
19 890588
142 890588
21 31511D
22 8917JD
23 8917JD
24 691152
25 88013C
26 691704
27 13288C
29 951404
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09