Ranking with multiple columns in Dataframe - python

I have a dataframe with 3 columns
Alpha Bravo Charlie
20 30 40
50 10 20
40 60 10
I wish to create 3 new columns with rankings that produces the following that gives the highest among the 3 columns a rank of 3 to 1:
AlphaRank BravoRank CharlieRank
1 2 3
3 1 2
2 3 1
I understand there is dataframe.rank function but I only saw example for 1 column not 3
I tried this with issues:
for newrank in ['Alpha', 'Bravo', 'Charlie']:
ranksys = df[newrank]
ranksystem = newrank +'Rank'
df[ranksystem] = ranksys.rank(axis=1).astype(int)

I think need rank + astype:
cols = ['Alpha', 'Bravo', 'Charlie']
df[cols] = df[cols].rank().astype(int)
print (df)
Alpha Bravo Charlie
0 1 2 3
1 3 1 2
2 2 3 1
Numpy alternative with numpy.argsort:
df[cols] = pd.DataFrame(df[cols].values.argsort(axis=0) + 1,index=df.index,columns=df.columns)

Related

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Sum up value in different numbers of columns for each row

I have a data frame including number of sold tickets in different price buckets for each flight.
For each record/row, I want to use the value in one column as an index in iloc function, to sum up values in a specific number of columns.
Like, for each row, I want to sum up values from column index 5 to value in ['iloc_index']
I tried df.iloc[:, 5:df['iloc_index']].sum(axis=1) but it did not work.
sample data:
A B C D iloc_value total
0 1 2 3 2 1
1 1 3 4 2 2
2 4 6 3 2 1
for each row, I want to sum up the number of columns based on the value in ['iloc_value']
for example,
for row0, I want the total to be 1+2
for row1, I want the total to be 1+3+4
for row2, I want the total to be 4+6
EDIT:
I quickly got the results this way:
First define a function that can do it for one row:
def sum_till_iloc_value(row):
return sum(row[:row['iloc_value']+1])
Then apply it to all rows to generate your output:
df_flights['sum'] = df_flights.apply(sum_till_iloc_value, axis=1)
A B C D iloc_value sum
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
PREVIOUSLY:
Assuming you have information that looks like:
df_flights = pd.DataFrame({'flight':['f1', 'f2', 'f3'], 'business':[2,3,4], 'economy':[6,7,8]})
df_flights
flight business economy
0 f1 2 6
1 f2 3 7
2 f3 4 8
you can sum the columns you want as below:
df_flights['seat_count'] = df_flights['business'] + df_flights['economy']
This will create a new column that you can later select:
df_flights[['flight', 'seat_count']]
flight seat_count
0 f1 8
1 f2 10
2 f3 12
Here's a way to do that in a fully vectorized way: melting the dataframe, summing only the relevant columns, and getting the total back into the dataframe:
d = dict([[y, x] for x, y in enumerate(df.columns[:-1])])
temp_df = df.copy()
temp_df = temp_df.rename(columns=d)
temp_df = temp_df.reset_index().melt(id_vars = ["index", "iloc_value"])
temp_df = temp_df[temp_df.variable <= temp_df.iloc_value]
df["total"] = temp_df.groupby("index").value.sum()
The output is:
A B C D iloc_value total
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10

How to find the maximum value of a column with pandas?

I have a table with 40 columns and 1500 rows. I want to find the maximum value among the 30-32nd (3 columns). How can it be done? I want to return the maximum value among these 3 columns and the index of dataframe.
print(Max_kVA_df.iloc[30:33].max())
hi you can refer this example
import pandas as pd
df=pd.DataFrame({'col1':[1,2,3,4,5],
'col2':[4,5,6,7,8],
'col3':[2,3,4,5,7]
})
print(df)
#print(df.iloc[:,0:3].max())# Mention range of the columns which you want, In your case change 0:3 to 30:33, here 33 will be excluded
ser=df.iloc[:,0:3].max()
print(ser.max())
Output
8
Select values by positions and use np.max:
Sample: for maximum by first 5 rows:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(10, 3)), columns=list('ABC'))
print (df)
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (df.iloc[0:5])
A B C
0 2 2 6
1 1 3 9
2 6 1 0
3 1 9 0
4 0 9 3
print (np.max(df.iloc[0:5].max()))
9
Or use iloc this way:
print(df.iloc[[30, 31], 2].max())

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Calculating CAGR by row by row in a pandas data frame?

I am working with company data. I have a data set of round about 1900 companies (index) and 30 variables per company (columns). These varibales always come in pairs of three (three periods). It basically looks like this
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8]})
I am trying to compute the compound annual growth rate (CAGR) for e.g. revenue for each company (id) - such that revenue_cagr = ((revenue_2/revenue_1)^(1/3))-1
I would like to pass a function to a set of columns row by row - at least, that is my idea.
def CAGR(start_value, end_value, periods):
((end_value/start_value)^(1/periods))-1
Is it possible to apply this function row by row for a set of columns (maybe with for i, row in df.iterrows(): or df.apply())? Respectively, is there a smarter way to do this?
Update
The desired outcome - examplified with the column revenue_cagr - should look as follows:
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8],
'revenue_cagr' : [-0.48, 0.65, -0.07, 0.26],
'profit_cagr' : [0.19, -0.06, 0.31, 0.26]
})
You can use set_index + str.rsplit for triples first:
df1 = df.set_index('id')
df1.columns = df1.columns.str.rsplit('_', expand=True, n=1)
print (df1)
profit revenue
0 1 2 0 1 2
id
1 3 4 5 7 5 1
2 6 6 5 2 6 9
3 4 9 9 5 3 4
7 4 1 8 4 1 8
Then divide by div all 2 with 0 levels selected by xs, add pow, sub and add_suffix:
df1 = df1.xs('2', axis=1, level=1)
.div(df1.xs('0', axis=1, level=1))
.pow((1./3))
.sub(1)
.add_suffix('_cagr')
print (df1)
profit_cagr revenue_cagr
id
1 0.185631 -0.477242
2 -0.058964 0.650964
3 0.310371 -0.071682
7 0.259921 0.259921
Last join to original:
df = df.join(df1, on='id')
print (df)
id profit_0 profit_1 profit_2 revenue_0 revenue_1 revenue_2 \
0 1 3 4 5 7 5 1
1 2 6 6 5 2 6 9
2 3 4 9 9 5 3 4
3 7 4 1 8 4 1 8
profit_cagr revenue_cagr
0 0.185631 -0.477242
1 -0.058964 0.650964
2 0.310371 -0.071682
3 0.259921 0.259921

Categories