Generate columns of top ranked values in Pandas - python

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.

It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Related

create two columns based on a function with apply()

I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0

Mapping values across dataframes to create a new one

I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

TypeError when using iloc to create dummy variables

Source data is from the book Python_for_Data_Analysis, chp 2.
The data for movies is as follows and can also be found here:
movies.head(n=10)
Out[3]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
The following code has trouble when I use iloc:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table( 'movies.dat', sep='::',
engine='python', header=None, names=mnames)
movies.head(n=10)
genre_iter = (set(x.split('|')) for x in movies['genres'])
genres = sorted(set.union(*genre_iter))
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies['genres']):
# the following code report error
# TypeError: '['Animation', "Children's", 'Comedy']' is an invalid key
dummies.iloc[i,dummies.columns.get_loc(gen.split('|'))] = 1
# while loc can run successfully
dummies.loc[dummies.index[[i]],gen.split('|')] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
I have some understanding of why Children's is error, but why Animation,Comedy are error? I have tried:
dummies.columns.get_loc('Animation')
and the result is 2.
This is a pretty simple (and fast) answer using string matching that should work fine here and in any case where the genres names don't overlap. E.g. if you had categories "crime" and "crime thriller" then a crime thriller would be categorized under both crime AND crime thriller rather than just crime thriller. (But see note below for how you could generalize this.)
for g in genres:
movies[g] = movies.genres.str.contains(g).astype(np.int8)
(Note using np.int8 rather than int will save a lot of memory as int defaults to 64 bits rather than 8)
Results for movies.head(2):
movie_id title genres Action \
0 1 Toy Story (1995) Animation|Children's|Comedy 0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0
Adventure Animation Children's Comedy Crime Documentary ... \
0 0 1 1 1 0 0 ...
1 1 0 1 0 0 0 ...
Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller \
0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
War Western
0 0 0
1 0 0
The following generalizaion of the above code may be overkill but gives you a more general approach that should avoid potential double counting of genre categories (e.g. equating Crime and Crime Thriller):
# add '|' delimiter to beginning and end of the genres column
movies['genres2'] = '|' + movies['genres'] + '|'
# search for '|Crime|' rather than 'Crime' which is much safer b/c
# we don't match a category which merely contains 'Crime', we
# only match 'Crime' exactly
for g in genres:
movies[g+'2'] movies.genres2.str.contains('\|'+g+'\|').astype(np.int8)
(If you're better with regular expressions than me you wouldn't need to add the '|' at the beginning and end ;-)
Try
dummies = movies.genres.str.get_dummies()

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories