I'm trying to groupby my dataframe into a tree format so that it sorts down in a hierarchical way.
DC being the first column that funnels down into Retailer, Store Count, Product descriptions, case volume and velocity in that order. Summing the retailer column into a new column "StoreCt" that is positioned after "Retailer"
The problem I'm running into is the store counts are being duplicated.
Here is the dataframe I have
Retailer
DC
Product
Cs
Volume
Velocity
joe
ABC
bars
Cs
Cost
Velocity
joe
DFC
drinks1
Cs
Cost
Velocity
joe
DFC
drinks2
Cs
Cost
Velocity
randy
ABC
bars
Cs
Cost
Velocity
peter
DFC
drinks2
Cs
Cost
Velocity
john
XYZ
drinks
Cs
Cost
Velocity
joe
XYZ
snacks
Cs
Cost
Velocity
joe
DFC
bars2
Cs
Cost
Velocity
This is the result I want. values in the cs, volume, and velocity columns need to be unchanged
DC
Retailer
StoreCt
Product
Cs
Volume
Velocity
ABC
joe
1
bars
Cs
Cost
Velocity
randy
1
bars
Cs
Cost
Velocity
DFC
joe
3
drinks1
Cs
Cost
Velocity
drinks2
Cs
Cost
Velocity
bars2
Cs
Cost
Velocity
peter
1
drinks2
Cs
Cost
Velocity
XYZ
joe
1
snacks
Cs
Cost
Velocity
john
1
drinks
Cs
Cost
Velocity
this is my code to get the store count, but i can't figure out how to add it into the dataframe without duplicating the values
store_count = df.groupby("Retailer").size().to_frame("StoreCt")
store_count
Use transform to broadcast the result to all rows:
df['StoreCt'] = df.groupby(['DC', 'Retailer']).transform('size')
print(df)
# Output:
Retailer DC Product Cs Volume Velocity StoreCt
0 joe ABC bars Cs Cost Velocity 1
1 joe DFC drinks1 Cs Cost Velocity 3
2 joe DFC drinks2 Cs Cost Velocity 3
3 randy ABC bars Cs Cost Velocity 1
4 peter DFC drinks2 Cs Cost Velocity 1
5 john XYZ drinks Cs Cost Velocity 1
6 joe XYZ snacks Cs Cost Velocity 1
7 joe DFC bars2 Cs Cost Velocity 3
To get the output, you can reorder the columns:
cols = ['DC', 'Retailer', 'StoreCt', 'Product', 'Cs', 'Volume', 'Velocity']
df = df[cols].sort_values(['DC', 'Retailer'], ignore_index=True)
print(df)
# Output
DC Retailer StoreCt Product Cs Volume Velocity
0 ABC joe 1 bars Cs Cost Velocity
1 ABC randy 1 bars Cs Cost Velocity
2 DFC joe 3 drinks1 Cs Cost Velocity
3 DFC joe 3 drinks2 Cs Cost Velocity
4 DFC joe 3 bars2 Cs Cost Velocity
5 DFC peter 1 drinks2 Cs Cost Velocity
6 XYZ joe 1 snacks Cs Cost Velocity
7 XYZ john 1 drinks Cs Cost Velocity
Related
I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0
Df before :
unnamed:0 unnamed:1 unnamed:2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
To the df look like this:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I have tried to rename the unnamed columns:
testfile.columns = testfile.columns.str.replace('Unnamed.*', 't')
testfile = testfile.rename(columns=lambda x: x+'x')
This will do it from 0 to the number of columns you have
testfile.columns = ['t{}'.format(i) for i in range(testfile.shape[1])]
you can use this to reset the column names and add prefix to them
df = df.T.reset_index(drop=True).T.add_prefix('t')
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
try rename with string split
df = df.rename(lambda x: 't'+x.split(':')[-1], axis=1)
Out[502]:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 DataScience
If you don't care about the digit in unnamed:X, just want the increment on t, you may use numpy arange and np.char.add to construct them
np.char.add('t', np.arange(df.shape[1]).astype(str))
array(['t0', 't1', 't2'], dtype='<U12')
Assign it direct to columns
df.columns = np.char.add('t', np.arange(df.shape[1]).astype(str))
Your data is already increasing. You just want t instead of unnamed: as prefix.
df.columns = df.columns.str.replace('unnamed:', 't')
Try this:
df.rename(lambda x: x.replace('unnamed:', 't'), axis=1)
output:
t0 t1 t2
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
Let's say I have this dataframe:
Name Salary Field
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
I want to add some 0-indexed numbers above the column names, but I'd also want to keep the column names. I want to reach this form:
0 1 2
Name Salary Field
0 Megan 30000 Botany
1 Ann 24000 Psychology
2 John 24000 Police
3 Mary 45000 Genetics
4 Jay 60000 Data Science
How can I do that with pandas and Python?
You don't need to create a new dataframe:
df.columns = pd.MultiIndex.from_tuples(list(enumerate(df)))
As expected:
# 0 1 2
# Name Salary Field
# 0 Megan 30000 Botany
# 1 Ann 24000 Psychology
# 2 John 24000 Police
# 3 Mary 45000 Genetics
# 4 Jay 60000 Data Science
I hope this helps.
Here is a quick way
new_df = pd.DataFrame(df.values,
columns = [ list(range(df.shape[1])), df.columns]
)
I'm sure there is a more elegant way
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)