Suppose I have a data frame with 3 columns: A, B, C. I want to group by column A, and find the row (for each unique A) with the maximum entry in C, so that I can store that row.A, row.B, row.C into a dictionary elsewhere.
What's the best way to do this without using iterrows?
# generate sample data
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,(10,3)))
df.columns = ['A','B','C']
# sort by C, group by A, take last row of each group
df.sort('C').groupby('A').nth(-1)
Here's another method. If df is the DataFrame, you can write df.groupby('A').apply(lambda d: d.ix[d['C'].argmax()]).
For example,
In [96]: df
Out[96]:
A B C
0 1 0 3
1 3 0 4
2 0 4 5
3 2 4 0
4 3 1 1
5 1 6 2
6 3 6 0
7 4 0 1
8 2 3 4
9 0 5 0
10 7 6 5
11 3 1 2
In [97]: g = df.groupby('A').apply(lambda d: d['C'].argmax())
In [98]: g
Out[98]:
A
0 2
1 0
2 8
3 1
4 7
7 10
dtype: int64
In [99]: df.ix[g.values]
Out[99]:
A B C
2 0 4 5
0 1 0 3
8 2 3 4
1 3 0 4
7 4 0 1
10 7 6 5
Related
I have a dataframe of type:
a = ['a','b','c','a','b','c','a','b','c']
b = [0,1,2,3,4,5,6,7,8]
df = pd.DataFrame({'key':a,'values':b})
key values
0 a 0
1 b 1
2 c 2
3 a 3
4 b 4
5 c 5
6 a 6
7 b 7
8 c 8
I want to move the values in the "values" column to new columns where they have the same "key".
So result:
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
From this question How can I pivot a dataframe?
I've tried:
a=d1.pivot_table(index='key',values='values',aggfunc=list).squeeze()
pd.DataFrame(a.tolist(),index=a.index)
Which gives
0 1 2
key
a 0 3 6
b 1 4 7
c 2 5 8
But I don't want the index to be 'key', I want the index to stay the same.
You can use reset_index.
a = df.pivot_table(index='key',values='values',aggfunc=list).squeeze()
out = pd.DataFrame(a.tolist(),index=a.index).add_prefix('values').reset_index()
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
Another way to do it:
out = (df.pivot_table('values', 'key', df.index // 3)
.add_prefix('values').reset_index())
print(out)
# Output
key values0 values1 values2
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
df["id"] = df.groupby("key").cumcount()
df.pivot(columns="id", index="key").reset_index()
# key values
# id 0 1 2
# 0 a 0 3 6
# 1 b 1 4 7
# 2 c 2 5 8
I would like to obtain the 'Value' column below, from the original df:
A B C Column_To_Use
0 2 3 4 A
1 5 6 7 C
2 8 0 9 B
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
Use DataFrame.lookup:
df['Value'] = df.lookup(df.index, df['Column_To_Use'])
print (df)
A B C Column_To_Use Value
0 2 3 4 A 2
1 5 6 7 C 7
2 8 0 9 B 0
I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
I would like to merge two DataFrames while creating a multilevel column naming scheme denoting which dataframe the rows came from. For example:
In [98]: A=pd.DataFrame(np.arange(9.).reshape(3,3),columns=list('abc'))
In [99]: A
Out[99]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [100]: B=A.copy()
If I use pd.merge(), then I get
In [104]: pd.merge(A,B,left_index=True,right_index=True)
Out[104]:
a_x b_x c_x a_y b_y c_y
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Which is what I expect with that statement, what I would like (but I don't know how to get!) is:
In [104]: <<one or more statements>>
Out[104]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Can this be done without changing the original pd.DataFrame calls? I am reading the data in the dataframes in from .csv files and that might be my problem.
first case can be ordered arbitrarily among A,B (not the columns, just the order A or B)
2nd should preserve ordering
IMHO this is pandonic!
In [5]: concat(dict(A = A, B = B),axis=1)
Out[5]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
In [6]: concat([ A, B ], keys=['A','B'],axis=1)
Out[6]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
Here's one way, which does change A and B:
In [10]: from itertools import cycle
In [11]: A.columns = pd.MultiIndex.from_tuples(zip(cycle('A'), A.columns))
In [12]: A
Out[12]:
A
a b c
0 0 1 2
1 3 4 5
2 6 7 8
In [13]: B.columns = pd.MultiIndex.from_tuples(zip(cycle('B'), B.columns))
In [14]: A.join(B)
Out[14]:
A B
a b c a b c
0 0 1 2 0 1 2
1 3 4 5 3 4 5
2 6 7 8 6 7 8
I actually think this would be a good alternative behaviour, rather than suffixes...
I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.