I need help with formatting my tables. This is a simpler version and I will explain it with an example. If I have a table as follows:
Col1 Col2
A 8
B 2
C 3
A 4
B 5
C 6
A 7
B 1
C 9
I want it to be arranged where highest value of col2 comes first. In this case it is 9 from account C. Therefore all account C values follow, arranged in Col2 order. Next, highest value is shown by account A, so all account A values follow, again arranged in Col2 values order.
The final table should look something like this:
Col1 Col2
C 9
C 6
C 3
A 8
A 7
A 4
B 5
B 2
B 1
What would be the best way to do this. any ideas?
You may need create a help key for sort_values by groupby transform
df['helperkey']=df.groupby('Col1').Col2.transform('max')
df.sort_values(['helperkey','Col2'],ascending=[False,False]).drop('helperkey',1)
Out[102]:
Col1 Col2
8 C 9
5 C 6
2 C 3
0 A 8
6 A 7
3 A 4
4 B 5
1 B 2
7 B 1
There may be a better way, but you could figure out the order, set column Col1 to be an ordered categorical, and sort by Col1 and Col2, in ascending and descending order respectively:
order = df.groupby('Col1').max().sort_values('Col2', ascending=False).index
df['Col1'] = pd.Categorical(df['Col1'], categories=order, ordered=True)
df.sort_values(['Col1', 'Col2'], ascending=[True,False])
Col1 Col2
8 C 9
5 C 6
2 C 3
0 A 8
6 A 7
3 A 4
4 B 5
1 B 2
7 B 1
Related
I am creating a tool to automate some tasks. These tasks generate two DataFrames, but when concatenating them the columns are messed up as follows:
col2 col4 col3 col1
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
But I need to rearrange them so that they look like this:
col1 col2 col3 col4
0 a A 0 2
1 B A 1 1
2 c B 9 9
3 D NaN 4 8
4 e D 2 7
5 F C 3 4
Can someone help me?
I tried with sort_values, but it didn't work, and I can't find anywhere another way to try to solve the problem.
use following code:
df.sort_index(axis=1)
You can do:
df = df[sorted(df.columns.tolist())].copy()
df = df[['col1', 'col2', 'col3', 'col4']]
i want to replace all rows that have "A" in name column
with single row from another df
i got this
data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
that is my single row (the another df)
data2={"col1":[0]
,"col2":[1]
,"col3":[5]
,"col4":[6]
}
df2=pd.DataFrame.from_dict(data2)
df2
that how i want it to look like
data={"col1":[0,0,4,0,7],
"col2":[1,1,4,1,4],
"col3":[5,5,9,5,2],
"col4":[6,6,22,6,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
i try do this df.loc[df["name"]=="A"][df2.columns]=df2
but it did not work
We can try mask + combine_first
df = df.mask(df['name'].eq('A'), df2.loc[0], axis=1).combine_first(df)
df
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8.0
1 0 1 5 6 A 2.0
2 4 4 9 22 V 1.0
3 0 1 5 6 A 3.0
4 7 4 2 5 B 9.0
df.loc[df["name"]=="A"][df2.columns]=df2 is index-chaining and is not expected to work. For details, see the doc.
You can also use boolean indexing like this:
df.loc[df['name']=='A', df2.columns] = df2.values
Output:
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8
1 0 1 5 6 A 2
2 4 4 9 22 V 1
3 0 1 5 6 A 3
4 7 4 2 5 B 9
I have a data frame like this,
df
col1 col2
1 A
2 A
3 B
4 C
5 C
6 C
7 B
8 B
9 A
Now we can see that there is continuous occurrence of A, B and C. I want only the rows where the occurrence is starting. And the other values of the same occurrence will be nan.
The final data frame I am looking for will look like,
df
col1 col2
1 A
2 NA
3 B
4 C
5 NA
6 NA
7 B
8 NA
9 A
I can do it using for loop and comparing, But the execution time will be more. I am looking for pythonic way to do it. Some panda shortcuts may be.
Compare by Series.shifted values and missing values by Series.where or numpy.where:
df['col2'] = df['col2'].where(df['col2'].ne(df['col2'].shift()))
#alternative
#df['col2'] = np.where(df['col2'].ne(df['col2'].shift()), df['col2'], np.nan)
Or by DataFrame.loc with inverted condition by ~:
df.loc[~df['col2'].ne(df['col2'].shift()), 'col2'] = np.nan
Or thanks #Daniel Mesejo - use eq for ==:
df.loc[df['col2'].eq(df['col2'].shift()), 'col2'] = np.nan
print (df)
col1 col2
0 1 A
1 2 NaN
2 3 B
3 4 C
4 5 NaN
5 6 NaN
6 7 B
7 8 NaN
8 9 A
Detail:
print (df['col2'].ne(df['col2'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
Name: col2, dtype: bool
Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.
Given a dataframe df like this:
Col1 Col2
Key
A 4 10
B 7 10
C 3 9
My desired data frame is
A B C
Col1 4 7 3
Col2 10 10 9
Where Col1 and Col2 are the indices.
How would I specify this? I've tried:
In [419]: mydf.T.reset_index(drop=True)
Out[419]:
Key A B C
0 4 7 3
1 10 10 9
But for some reason, the Key remains. I'm not sure what it is, and I'm not sure how to get rid of it. I've also tried mydf.T.reset_index().set_index('index') but it is very unsightly.
we can use DataFrame.rename_axis() here:
In [24]: df.T.rename_axis(None, axis=1)
Out[24]:
A B C
Col1 4 7 3
Col2 10 10 9