Insert 2 Blank Rows In DF by Group - python

I basically want the solution from this question to be applied to 2 blank rows.
Insert Blank Row In Python Data frame when value in column changes?
I've messed around with the solution but don't understand the code enough to alter it correctly.

You can do:
num_empty_rows = 2
df = (df.groupby('Col1',as_index=False).apply(lambda g: g.append(
pd.DataFrame(data=[['']*len(df.columns)]*num_empty_rows,
columns=df.columns))).reset_index(drop=True).iloc[:-num_empty_rows])
As you can see, after each group df is appended by a dataframe to accommodate num_empty_rows and then at the end reset_index is performed. The last iloc[:-num_empty_rows] is optional i.e. to remove empty rows at the end.
Example input:
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'C'],
'Col2':['s','s','b','b','l'],
'Col3':['b','j','d','a','k'],
'Col4':['d','k','q','d','p']
})
Output:
Col1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4
5 B b a d
6
7
8 C l k p

Related

Add column to pandas dataframe from a reversed dictionary

I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.
I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z

Determining if a Pandas dataframe row has multiple specific values

I have a Pandas data frame represented by the one below:
A B C D
| 1 1 1 3 |
| 1 1 1 2 |
| 2 3 4 5 |
I need to iterate through this data frame, looking for rows where the values in columns A, B, & C match and if that's true check the values in column D for those rows and delete the row with the smaller value. So, in above example would look like this afterwards.
A B C D
| 1 1 1 3 |
| 2 3 4 5 |
I've written the following code, but something isn't right and it's causing an error. It also looks more complicated than it may need to be, so I am wondering if there is a better, more concise way to write this.
for col, row in df.iterrows():
... df1 = df.copy()
... df1.drop(col, inplace = True)
... for col1, row1 in df1.iterrows():
... if df[0].iloc[col] == df1[0].iloc[col1] & df[1].iloc[col] == df1[1].iloc[col1] &
df[2].iloc[col] == df1[2].iloc[col1] & df1[3].iloc[col1] > df[3].iloc[col]:
... df.drop(col, inplace = True)
Here is one solution:
df[~((df[['A', 'B', 'C']].duplicated(keep=False)) & (df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']))]
Explanation:
df[['A', 'B', 'C']].duplicated(keep=False)
returns a mask for rows with duplicated values of ['A', 'B', 'C'] columns
df.groupby(['A', 'B', 'C'])['D'].transform(min)==df['D']
returns a mask for rows that have the minimum value for ['D'] column, for each group of ['A', 'B', 'C']
The combination of these masks, selects all these rows (duplicated ['A', 'B', 'C'] and minimum 'D' for the group. With ~ we select all other rows except from these ones.
Result for the provided input:
A B C D
0 1 1 1 3
2 2 3 4 5
You can groupby all the variables (using groupby(['A', 'B', 'C'])) which have to be equal and then exclude the row with minimum value of D (using func)) if there are multiple unique records to get the boolean indices for the rows which has to be retained
def func(x):
if len(x.unique()) != 1:
return x != x.min()
else:
return x == x
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: func(x))]
A B C D
0 1 1 1 3
2 2 3 4 5
If row having just the maximum group value in D has to be retained. Then you can use the below:
df[df.groupby(['A', 'B', 'C'])['D'].apply(lambda x: x == x.max())]

pandas drop rows that share similar value in other column

I have DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]}) that looks like
Input:
col1 col2
0 a 1
1 b 3
2 c 3
3 d 2
4 e 6
I would like to remove rows from "col1" that share a common value in "col2". The expected output would look something like...
Output:
col1 col2
0 a 1
3 d 2
4 e 6
What would be the process of doing this?
using this short code should do the trick
df.drop_duplicates(subset=['col2'], keep=False)
Explanation
we use drop_duplicates to (ovbiously) drop the duplicates, and we set the column(s) we want to drop from to be col2 as you requested, In order to drop all occurences (and not keep the first occurence of each duplicate for example) we use keep=False.
This will do the trick
from collections import Counter
df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]})
c = Counter(df['col2'])
ls = [k for k,v in c.items() if v==1]
_fltr = df['col2'].isin(ls)
df.loc[_fltr,:]

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

Insert Blank Row In Python Data frame when value in column changes?

I have a dataframe and I'd like to insert a blank row as a separator whenever the value in the first column changes.
For example:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
becomes:
Column 1 Col2 Col3 Col4
A s b d
A s j k
A b d q
B b a d
C l k p
because the value in Column 1 changed
The only way that I figured out how to do this is using VBA as indicated by the correctly marked answer here:
How to automatically insert a blank row after a group of data
But I need to do this in Python.
Any help would be really appreciated!
Create helper DataFrame with index values of last changes, add .5, join together with original by concat, sorting indices by sort_index, create default index by reset_index and lasr remove last row by positions with iloc:
mask = df['Column 1'].ne(df['Column 1'].shift(-1))
df1 = pd.DataFrame('',index=mask.index[mask] + .5, columns=df.columns)
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
print (df)
Column 1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4 B b a d
5
6 C l k p

Categories