I have a table which has primary sector as one column with different entries. I need to add one more column as major sector. Major sector is to be picked up from a mapping table. How this task can achieved.
Sample Data
Primary Sector Major Sector
Skating
Painting
Engineer
Running
Gardening
Administrator
tennis
Reading
Cricket
Accountant
Mapping Table
Job Hobby Sports
Skating 0 0 1
Painting 0 1 0
Engineer 1 0 0
Running 0 0 1
Gardening 0 1 0
Administrator 1 0 0
tennis 0 0 1
Reading 0 1 0
Cricket 0 0 1
Accountant 1 0 0
Use map with idxmax using parameter axis=1 for column-wise as:
df1['Major Sector'] = df1['Primary Sector'].map(df2.idxmax(axis=1))
print(df1)
Primary Sector Major Sector
0 Skating Sports
1 Painting Hobby
2 Engineer Job
3 Running Sports
4 Gardening Hobby
5 Administrator Job
6 tennis Sports
7 Reading Hobby
8 Cricket Sports
9 Accountant Job
print(df2.idxmax(axis=1))
Skating Sports
Painting Hobby
Engineer Job
Running Sports
Gardening Hobby
Administrator Job
tennis Sports
Reading Hobby
Cricket Sports
Accountant Job
dtype: object
Related
I want to create two new columns in job_transitions_sample.csv and add the wage data from wage_data_sample.csv for both Title 1 and Title 2:
job_transitions_sample.csv:
Title 1 Title 2 Count
0 administrative assistant office manager 20
1 accountant cashier 1
2 accountant financial analyst 22
4 accountant senior accountant 23
6 accounting clerk bookkeeper 11
7 accounts payable clerk accounts receivable clerk 8
8 administrative assistant accounting clerk 8
9 administrative assistant administrative clerk 12
...
wage_data_sample.csv
title wage
0 cashier 17.00
1 sandwich artist 18.50
2 dishwasher 20.00
3 babysitter 20.00
4 barista 21.50
5 housekeeper 21.50
6 retail sales associate 23.00
7 bartender 23.50
8 cleaner 23.50
9 line cook 23.50
10 pizza cook 23.50
...
I want the end result to look like this:
Title 1 Title 2 Count Wage of Title 1 Wage of Title 2
0 administrative assistant office manager 20 NaN NaN
1 accountant cashier 1 NaN NaN
2 accountant financial analyst 22 NaN NaN
...
I'm thinking of using dictionaries then try to iterate every column but is there a more elegant built in solution? This is my code so far:
wage_data = pd.read_csv('wage_data_sample.csv')
dict = dict(zip(wage_data.title, wage_data.wage))
Use Series.map by dictionary d - cannot use dict for varialbe name, because python code name:
df = pd.read_csv('job_transitions_sample.csv')
wage_data = pd.read_csv('wage_data_sample.csv')
d = dict(zip(wage_data.title, wage_data.wage))
df['Wage of Title 1'] = df['Title 1'].map(d)
df['Wage of Title 2'] = df['Title 2'].map(d)
You can try with 2 merge con the 2 different Titles subsequentely.
For example, let be
df1 : job_transitions_sample.csv
df2 : wage_data_sample.csv
df1.merge(df2, left_on='Title 1', right_on='title',suffixes=('', 'Wage of')).merge(df2, left_on='Title 2', right_on='title',suffixes=('', 'Wage of'))
I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].
Let's say it's something like this:
RestaurantName City Restaurant ID Cuisines
Restaurant A Milan 31333 French, Spanish, Italian
Restaurant B Shanghai 63551 Pizza, Burgers
Restaurant C Dubai 7991 Burgers, Ice Cream
Here's a copy-able code as a sample:
rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'],
'City': ['Milan', 'Shanghai', 'Dubai'],
'RestaurantID': [31333,63551,7991],
'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})
I used string split to expand them into 8 different columns and added it to the dataframe.
csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]
Which leaves me with this:
https://i.stack.imgur.com/AUSDY.png
Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.
I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.
dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())
Which leaves me with all of these columns:
https://i.stack.imgur.com/84spI.png
A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.
AllCsn=np.concatenate((rst.Cuisine1.unique(),
rst.Cuisine2.unique(),
rst.Cuisine3.unique(),
rst.Cuisine4.unique(),
rst.Cuisine5.unique(),
rst.Cuisine6.unique(),
rst.Cuisine7.unique(),
rst.Cuisine8.unique()
))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn
Which leaves me with this:
https://i.stack.imgur.com/O9OpW.png
I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.
I am new to this, so please bear with me and let me know if I need to provide any more info.
It sounds like you're looking for str.split without expanding, then explode:
rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')
Creates a frame like:
RestaurantName City RestaurantID Cuisines
0 Rest A Milan 31333 French
0 Rest A Milan 31333 Spanish
0 Rest A Milan 31333 Italian
1 Rest B Shanghai 63551 Pizza
1 Rest B Shanghai 63551 Burgers
2 Rest C Dubai 7991 Burgers
2 Rest C Dubai 7991 Ice Cream
Then it sounds like either crosstab:
pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines Burgers French Ice Cream Italian Pizza Spanish
City
Dubai 1 0 1 0 0 0
Milan 0 1 0 1 0 1
Shanghai 1 0 0 0 1 0
Or value_counts
rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
City Cuisines counts
0 Dubai Burgers 1
1 Dubai Ice Cream 1
2 Milan French 1
3 Milan Italian 1
4 Milan Spanish 1
5 Shanghai Burgers 1
6 Shanghai Pizza 1
Max value_count per City via groupby head:
max_counts = (
rst[['City', 'Cuisines']].value_counts()
.groupby(level=0).head(1)
.reset_index(name='counts')
)
max_counts:
City Cuisines counts
0 Dubai Burgers 1
1 Milan French 1
2 Shanghai Burgers 1
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
Source data is from the book Python_for_Data_Analysis, chp 2.
The data for movies is as follows and can also be found here:
movies.head(n=10)
Out[3]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
The following code has trouble when I use iloc:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::',
engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
I have some understanding of why Children's is error, but why Animation,Comedy are error? I have tried:
dummies.columns.get_loc('Animation')
and the result is 2.
This is a pretty simple (and fast) answer using string matching that should work fine here and in any case where the genres names don't overlap. E.g. if you had categories "crime" and "crime thriller" then a crime thriller would be categorized under both crime AND crime thriller rather than just crime thriller. (But see note below for how you could generalize this.)
for g in genres:
movies[g] = movies.genres.str.contains(g).astype(np.int8)
(Note using np.int8 rather than int will save a lot of memory as int defaults to 64 bits rather than 8)
Results for movies.head(2):
movie_id title genres Action \
0 1 Toy Story (1995) Animation|Children's|Comedy 0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0
Adventure Animation Children's Comedy Crime Documentary ... \
0 0 1 1 1 0 0 ...
1 1 0 1 0 0 0 ...
Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller \
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
War Western
0 0 0
1 0 0
The following generalizaion of the above code may be overkill but gives you a more general approach that should avoid potential double counting of genre categories (e.g. equating Crime and Crime Thriller):
# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'
# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we
# only match 'Crime' exactly
for g in genres:
movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)
(If you're better with regular expressions than me you wouldn't need to add the '|' at the beginning and end ;-)
Try
dummies = movies.genres.str.get_dummies()
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)