I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6
Related
I have a dataframe df which looks like this:
col1 col2 col3
A 45 4
A 3 5
B 2 5
I want to make a separate dataframe, df2, which only has the rows where col in df equals A. Hence it should look like:
col1 col2 col3
A 45 4
A 3 5
I just use df2=df1.loc[df1['col1']=='A']. However this returns the error: ValueError: Cannot index with multidimensional key. Any idea what is going wrong here?
What you tried works for me, you can try this:
df2 = df[df.col1 == 'A']
Output
col1 col2 col3
0 A 45 4
1 A 3 5
Edit
Tested on pandas version
pd.__version__
'1.2.4'
Just try:
df.query("col1=='A'")
I have a dataframe like this:
df1:
col1 col2
P 1
Q 3
M 2
I have another dataframe:
df2:
col1 col2
Q 1
M 3
P 9
I want to sort the col1 of df2 based on the order of col1 of df1. So the final dataframe will look like:
df3:
col1 col2
P 1
Q 3
M 9
How to do it using pandas or any other effective method ?
You could set col1 as index in df2 using set_index and index the dataframe using df1.col11 with .loc:
df2.set_index('col1').loc[df1.col1].reset_index()
col1 col2
0 P 9
1 Q 1
2 M 3
Or as #jpp suggests you can also use .reindex instead of .loc:
df2.set_index('col1').reindex(df1.col1).reset_index()
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have the follow two dataframes, and I need calculate value column in the df2 based on df1
df1
col1 col2 col3 value
Chicago M 26 54
NY M 20 21
...
df2
col1 col2 col3 value
NY M 20 ? (should be 21 based on above dataframe)
I am doing loop like below which is slow
for index, row in df2.iterrows():
df1[(df1['col1'] == row['col1'])
& (df1['col2'] == df1['col2'])
&(df1['col3'] == df1['col3'])]['value'].values[0]
how to do it more efficiently/fast?
You need merge with left join by columns for compare first:
print (df2)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 ?
df = pd.merge(df2, df1, on=['col1','col2','col3'], how='left', suffixes=('','_'))
It create new column value_1 with matched values. Last use fillna for replace NaNs by original values and last remove helper column value_:
print (df)
col1 col2 col3 value value_
0 LA M 20 20 NaN
1 NY M 20 ? 21.0
df['value'] = df['value_'].fillna(df['value'])
df = df.drop('value_', axis=1)
print (df)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 21
I have panda data frame with 4 column say 'col1', 'col2', 'col3' and 'col4' now I want to group by col1 and col2 and want to take aggregate say below.
Count(col3)/(Count(unique col4)) As result_col
How do I do this? I am using MySql with pandas.
I have tried many things from the internet but not getting an exact solution, that's why I am posting here. Give reason of downvote so I can improve my question.
It seems you need aggregate by size and nunique and then div output columns:
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.