I have a dataframe df which looks like this:
col1 col2 col3
A 45 4
A 3 5
B 2 5
I want to make a separate dataframe, df2, which only has the rows where col in df equals A. Hence it should look like:
col1 col2 col3
A 45 4
A 3 5
I just use df2=df1.loc[df1['col1']=='A']. However this returns the error: ValueError: Cannot index with multidimensional key. Any idea what is going wrong here?
What you tried works for me, you can try this:
df2 = df[df.col1 == 'A']
Output
col1 col2 col3
0 A 45 4
1 A 3 5
Edit
Tested on pandas version
pd.__version__
'1.2.4'
Just try:
df.query("col1=='A'")
Related
I have a dataframe which has data like:
col1 col2 col3
1 3 bob
2 1 alice
3 3 bob
4 3 rose
And what I want to do is keep duplicate rows of col2 and discard duplicates with greater than 1 instance of col3's value. Or put another way, duplicates of col2 but only where col3's values are different. So in the above example, what I would end up with is:
col1 col2 col3
1 3 bob
4 3 rose
Alice wouldn't be in the output because obviously there's no second value of col2's '1' - it isn't duplicate. The second entry of Bob (3 3 bob) wouldn't be in the output because while col2's '3' is a duplicate, col3's 'bob' is already in the result set (1 3 bob). (I am aware of the keep= parameter to change the behaviour of keeping first or last, but ignoring it for simplicity.)
Any thoughts? Thank you.
Use a combination of .duplicated(), .drop_duplicates() and the loc accessor
df.loc[df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index,:]
col1 col2 col3
0 1 3 bob
3 4 3 rose
How it works
#Filter all duplicated in col2 using duplicated(False)
df[df['col2'].duplicated(False)]
#Drop duplicates in col3 but retaining first using .drop_duplicates(keep='first')
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first')
#Extract index
df[df['col2'].duplicated(False)].col3.drop_duplicates(keep='first').index
#Finally filter using loc accessor
df.loc[index,all columns]
Try:
df.loc[df.drop_duplicates(['col2', 'col3'])
.duplicated(['col2'], keep=False).loc[lambda x: x].index]
Output:
col1 col2 col3
0 1 3 bob
3 4 3 rose
Details:
Inside df.loc find indexes using
first drop_duplicates to get rid of duplicate records of col2 and
col3
use duplicated with keep = False return True for all records with
duplicate 'col2'
lastly, use loc with lambda to boolean select only those True
indexes
I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6
I have a dataframe like this:
df1:
col1 col2
P 1
Q 3
M 2
I have another dataframe:
df2:
col1 col2
Q 1
M 3
P 9
I want to sort the col1 of df2 based on the order of col1 of df1. So the final dataframe will look like:
df3:
col1 col2
P 1
Q 3
M 9
How to do it using pandas or any other effective method ?
You could set col1 as index in df2 using set_index and index the dataframe using df1.col11 with .loc:
df2.set_index('col1').loc[df1.col1].reset_index()
col1 col2
0 P 9
1 Q 1
2 M 3
Or as #jpp suggests you can also use .reindex instead of .loc:
df2.set_index('col1').reindex(df1.col1).reset_index()
I have panda data frame with 4 column say 'col1', 'col2', 'col3' and 'col4' now I want to group by col1 and col2 and want to take aggregate say below.
Count(col3)/(Count(unique col4)) As result_col
How do I do this? I am using MySql with pandas.
I have tried many things from the internet but not getting an exact solution, that's why I am posting here. Give reason of downvote so I can improve my question.
It seems you need aggregate by size and nunique and then div output columns:
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.