Python dataframe rank each column based on row values - python

I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.

You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.

Related

Python Dataframe - Get max value between specific number vs. column value

When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!
Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0
To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.

Adding and multiplying values of a dataframe in Python

I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.

How to impute missing values based on other variables

I have a dataframe like below:
df = pd.DataFrame({'one' : pd.Series(['a', 'b', 'c', 'd','aa','bb',np.nan,'b','c',np.nan, np.nan] ),
'two' : pd.Series([10, 20, 30, 40,50,60,10,20,30,40,50])} )
In which first column is the variables, second column is the values. Variable value is constant, which will never change.
example 'a' value is 10, whenever 'a' is presented corrsponding value will be10
Here some values missing in first column eg: NaN 10 which is a, NaN 40 which is d like wise dataframe contains 200 variables.
Values are not continuous variables, those are discrete and unsortable
In this case how can we impute missing values.
Expected output should be :
Please help me on this.
Regards,
Venkat.
I think in general it would be better to group and fill. We use DataFrame.groupby:
df.groupby('two').apply(lambda x: x.ffill().bfill())
It can be done without using groupby but you have to sort by both columns:
df.sort_values(['two','one']).ffill().sort_index()
Below I show you how the method proposed in another answer may fail:
Here is an example:
df=pd.DataFrame({'one':['a',np.nan,'c','d',np.nan,'c','b','b',np.nan,'a'],'two':[10,20,30,40,10,30,20,20,30,10]})
print(df)
one two
0 a 10
1 NaN 20
2 c 30
3 d 40
4 NaN 10
5 c 30
6 b 20
7 b 20
8 NaN 30
9 a 10
df.sort_values(['two']).fillna(method='ffill').sort_index()
one two
0 a 10
1 a 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As you can see the proposed method in another of the answers fails here(see row 1). This occurs because some NaN Value can be the first for a specific value of the column 'two' and is filled with the value of the upper group.
This don't happen if we group first:
df.groupby('two').apply(lambda x: x.ffill().bfill())
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
As I said we can use DataFrame.sort_values ​​but we need to sort for both columns.I recommend you this method.
df.sort_values(['two','one']).ffill().sort_index()
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 a 10
5 c 30
6 b 20
7 b 20
8 c 30
9 a 10
Here it is:
df.ffill(inplace=True)
output:
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50
Try this:
df = df.sort_values(['two']).fillna(method='ffill').sort_index()
Which will give you
one two
0 a 10
1 b 20
2 c 30
3 d 40
4 aa 50
5 bb 60
6 a 10
7 b 20
8 c 30
9 d 40
10 aa 50

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Categories