Fill a column based on multiple columns' values - python

I have the following dataframe
df
A B C D
1 2 NA 3
2 3 NA 1
3 NA 1 2
A, B, C, and D are answers to a question. Basically, respondents ranked answers from 1 to 3 which means that one line cannot have 2 values the same. I am trying to make a new column which is a summary of the top 3 something such as.
1st 2nd 3rd
A B D
D A B
C D A
This format will make it easier for me to come up with conclusions such as, here are the 3rd top answers.
I didn't find any way to do this. Could you help me, please?
Thank you very much!

One way is using argsort and indexing the columns:
pd.DataFrame(df.columns[df.values.argsort()[:,:-1]],
columns=['1st', '2nd', '2rd'])
1st 2nd 2rd
0 A B D
1 D A B
2 C D A

Another way is to use stack()/pivot():
(df.stack().astype(int)
.reset_index(name='val')
.pivot('level_0', 'val', 'level_1')
)
Output:
val 1 2 3
level_0
0 A B D
1 D A B
2 C D A

Related

pandas dataframe: compare value in one column with previous value

I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.
Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5

groupby+sum/mean/ect then have the grouped by values go back to the original-ungrouped indexes in the original dataframe?

I am grouping by values and then merging back into the original table later. I was wondering if there was any way to avoid doing this.
Like I have a table
a b v
A A 9
A B 3
A A 2
B B 4
B B 3
I want to get:
a b v
A A 11
A B 3
A A 11
B B 7
B B 7
where the new v is the sumed value of old v when grouped by a and b, w/o having unique pairs disappear after being grouped.
right now I am grouping and then joining with code that looks like this:
test = df.groupby([a,b]).sum()
test.name = new_name
df.join(test, on = [a,b], how = 'left')
Which seems a little contrived, and I was wondering if there was a way to avoid even having to join.
Try with transform
df['v']=df.groupby(['a','b']).v.transform('sum')
df
a b v
0 A A 11
1 A B 3
2 A A 11
3 B B 7
4 B B 7

Taking last values from a list in dataframe? [duplicate]

This question already has an answer here:
Pandas find last non NAN value
(1 answer)
Closed 3 years ago.
If I had an asymmetric dataframe list like:
1 2 3
0 a b c
1 d e NaN
2 f NaN NaN
3 g h NaN
or a series like:
0 [a, b, c]
1 [d, e]
2 [f]
3 [g, h]
and I required the last value from each row to create another series like:
0 c
1 e
2 f
3 h
What would be the best way about this? Thank you!
You can use fillna to propagate the last good value in each row and tale the last column, in the example provided
df.fillna(method='ffill', axis=1).iloc[:,2]
0 c
1 e
2 f
3 h
Use DataFrame.stack, groupby and last:
df.stack().groupby(level=0).last()
0 c
1 e
2 f
3 h
dtype: object

df.groupby() modification HELP needed

This is my table:
A B C E
0 1 1 5 4
1 1 1 1 1
2 3 3 8 2
Now, I want to group all rows by Column A and B. Column C should be summed and for column E, I want to use the value where value C is max.
I did the first part of grouping A and B and summing C. I did this with:
df = df.groupby(['A', 'B'])['C'].sum()
But at this point, I am not sure how to tell that column E should take the value where C is max.
The end result should look like this:
A B C E
0 1 1 6 4
1 3 3 8 2
Can somebody help me with this past piece?
Thanks!
Using groupby with agg after sorting by C.
In general, if you are applying different functions to different columns, DataFrameGroupBy.agg allows you to pass a dictionary specifying which operation is applied to each column:
df.sort_values('C').groupby(['A', 'B'], sort=False).agg({'C': 'sum', 'E': 'last'})
C E
A B
1 1 6 4
3 3 8 2
By sorting by column C first, and not sorting as part of groupby, we can select the last value of E per group, which will align with the maximum value of C for each group.

Count the intersections between duplicates using Pandas

I have a Dataframe that looks like this:
Symbols Count
A 3
A 1
A 2
A 4
B 1
B 3
B 9
C 2
C 1
C 3
What I want to do using Pandas is to identify duplicate rows on the "Count" column but I want to count the number of times the Symbols intersect with each other on the duplicate.
By this I mean, if a Count value appears twice with two different Symbols. The Symbols are listed as having one intersection between them as they share the same Count value.
Something like this:
Symbol Symbol Number of Intersections
A B 2
B A 2
C A 3
.....
I'm sure there is a Pythonic Pandas way of doing this. But its not coming to mind.
Let's use merge to do a self merge then query, and groupby:
df_selfmerge = df.merge(df, on='Count', how="inner").query('Symbols_x != Symbols_y')
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.count()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
'Count':'Number of Intersections'}))
EDIT: Use size() is safer just incase of NaN valueas
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.size()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
0:'Number of Intersections'}))
Output:
Symbol Symbol Number of Intersections
0 A B 2
1 A C 3
2 B A 2
3 B C 2
4 C A 3
5 C B 2

Categories