Maximum in two columns dataframe python pandas - python

I have two columns in a dataframe, one of them are strings (country's) and the other are integers related to each country. How do I ask which country has the biggest value using python pandas?

Setup
df = pd.DataFrame(dict(Num=[*map(int, '352741845')], Country=[*'ABCDEFGHI']))
df
Num Country
0 3 A
1 5 B
2 2 C
3 7 D
4 4 E
5 1 F
6 8 G
7 4 H
8 5 I
idxmax
df.loc[[df.Num.idxmax()]]
Num Country
6 8 G
nlargest
df.nlargest(1, columns=['Num'])
Num Country
6 8 G
sort_values and tail
df.sort_values('Num').tail(1)
Num Country
6 8 G

Related

In pandas groupby mode use user defined function, apply it to multiple columns and assign the results to new pandas columns

I have a following data set:
> dt
a b group
1: 1 5 a
2: 2 6 a
3: 3 7 b
4: 4 8 b
I have a following function:
def bigSum(a,b):
return(a.min() + b.max())
I want to apply this function to a and b columns in groupby mode (by group) and assign it to the new column c of the data frame. My wished result is
> dt
a b group c
1: 1 5 a 7
2: 2 6 a 7
3: 3 7 b 11
4: 4 8 b 11
For instance, if I would have used R data.table, I would do the following:
dt[, c := bigSum(a,b), by = group]
and it would work exactly as I expect. I am interested if there is something similar in pandas.
In pandas we have transform
g = df.groupby('group')
df['out'] = g.a.transform('min') + g.b.transform('max')
df
Out[282]:
a b group out
1 1 5 a 7
2 2 6 a 7
3 3 7 b 11
4 4 8 b 11
Update
df['new'] = df.groupby('group').apply(lambda x : bigSum(x['a'],x['b'])).reindex(df.group).values
df
Out[287]:
a b group out new
1 1 5 a 7 7
2 2 6 a 7 7
3 3 7 b 11 11
4 4 8 b 11 11

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

How to sort the data wrt final output?

I want to group my dataframe by two columns and then sort the aggregated results within the groups.
In [167]:df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
df.groupby(['job','source']).agg({'count':sum})
Out[168]:
job source count
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:
job source count
market A 5
D 4
B 3
sales E 7
C 6
B 4
I want to further sort this problem w.r.t job, so if the sum of count for sales is more, I want the data to be printed as
job source count
sales E 7
C 6
B 4
market A 5
D 4
B 3
I am unable to get the top 5 job
IIUC, we can do a further groupby and use nlargest(3) to get the top n values.
then we can create an ordered list to sort your top values to sort and create a categorical column.
s = df.groupby(['job','source']).agg({'count':sum}).groupby(level=0)['count']\
.nlargest(3).reset_index(0,drop=True).to_frame()
# see which of your indices is higher and create a sorting list.
sorter = s.groupby(level=0)['count'].sum().sort_values(ascending=False).index
#Index(['sales', 'market'], dtype='object', name='job')
s['sort'] = pd.Categorical(s.index.get_level_values(0),sorter)
df2 = s.sort_values('sort').drop('sort',axis=1)
print(df2)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
You could use the sort_values mentioned in another similar answer sorting after aggregation and again group by job to get the top N from the job like,
>>> df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
>>> agg = df.groupby(['job','source']).agg({'count':sum})
>>> agg
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
>>> agg.reset_index().sort_values(['job', 'count'], ascending=False).set_index(['job', 'source']).groupby('job').head(3)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
>>>

Pandas jumbling up data after selecting columns

I have a (2.3m x 33) size dataframe. As I always do when selecting columns to keep, I use
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
However, this time the data under these columns becomes completely jumbled up on running the code. Entries for row A might be in row D for example. Totally at random.
Has anybody experienced this kind of behavior before? There is nothing out of the ordinary about the data and the df is totally fine before running these lines. Code run before problem begins:
with open('file.dat','r') as f:
df = pd.DataFrame(l.rstrip().split() for l in f)
#rename columns with the first row
df.columns = df.iloc[0]
#drop first row which is now duplicated
df = df.iloc[1:]
#. 33 nan columns - Remove all the nan columns that appeared
df = df.loc[:,df.columns.notnull()]
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
Data suddenly goes from being nicely formatted such as:
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
to something more random like:
A B C D E F G H I
7 9 3 4 5 1 2 8 6
3 2 9 2 1 6 7 8 4
2 1 3 6 5 4 7 9 8

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Categories