Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.
Related
I am creating a tool to automate some tasks. These tasks generate two DataFrames, but when concatenating them the columns are messed up as follows:
col2 col4 col3 col1
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
But I need to rearrange them so that they look like this:
col1 col2 col3 col4
0 a A 0 2
1 B A 1 1
2 c B 9 9
3 D NaN 4 8
4 e D 2 7
5 F C 3 4
Can someone help me?
I tried with sort_values, but it didn't work, and I can't find anywhere another way to try to solve the problem.
use following code:
df.sort_index(axis=1)
You can do:
df = df[sorted(df.columns.tolist())].copy()
df = df[['col1', 'col2', 'col3', 'col4']]
Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0
Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0
i want to replace all rows that have "A" in name column
with single row from another df
i got this
data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
that is my single row (the another df)
data2={"col1":[0]
,"col2":[1]
,"col3":[5]
,"col4":[6]
}
df2=pd.DataFrame.from_dict(data2)
df2
that how i want it to look like
data={"col1":[0,0,4,0,7],
"col2":[1,1,4,1,4],
"col3":[5,5,9,5,2],
"col4":[6,6,22,6,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
i try do this df.loc[df["name"]=="A"][df2.columns]=df2
but it did not work
We can try mask + combine_first
df = df.mask(df['name'].eq('A'), df2.loc[0], axis=1).combine_first(df)
df
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8.0
1 0 1 5 6 A 2.0
2 4 4 9 22 V 1.0
3 0 1 5 6 A 3.0
4 7 4 2 5 B 9.0
df.loc[df["name"]=="A"][df2.columns]=df2 is index-chaining and is not expected to work. For details, see the doc.
You can also use boolean indexing like this:
df.loc[df['name']=='A', df2.columns] = df2.values
Output:
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8
1 0 1 5 6 A 2
2 4 4 9 22 V 1
3 0 1 5 6 A 3
4 7 4 2 5 B 9
I need help with formatting my tables. This is a simpler version and I will explain it with an example. If I have a table as follows:
Col1 Col2
A 8
B 2
C 3
A 4
B 5
C 6
A 7
B 1
C 9
I want it to be arranged where highest value of col2 comes first. In this case it is 9 from account C. Therefore all account C values follow, arranged in Col2 order. Next, highest value is shown by account A, so all account A values follow, again arranged in Col2 values order.
The final table should look something like this:
Col1 Col2
C 9
C 6
C 3
A 8
A 7
A 4
B 5
B 2
B 1
What would be the best way to do this. any ideas?
You may need create a help key for sort_values by groupby transform
df['helperkey']=df.groupby('Col1').Col2.transform('max')
df.sort_values(['helperkey','Col2'],ascending=[False,False]).drop('helperkey',1)
Out[102]:
Col1 Col2
8 C 9
5 C 6
2 C 3
0 A 8
6 A 7
3 A 4
4 B 5
1 B 2
7 B 1
There may be a better way, but you could figure out the order, set column Col1 to be an ordered categorical, and sort by Col1 and Col2, in ascending and descending order respectively:
order = df.groupby('Col1').max().sort_values('Col2', ascending=False).index
df['Col1'] = pd.Categorical(df['Col1'], categories=order, ordered=True)
df.sort_values(['Col1', 'Col2'], ascending=[True,False])
Col1 Col2
8 C 9
5 C 6
2 C 3
0 A 8
6 A 7
3 A 4
4 B 5
1 B 2
7 B 1