How to groupby and update values in pandas? - python

I have a pandas DataFrame that looks similar to the following...
>>> df = pd.DataFrame({
... 'col1':['A','C','B','A','B','C','A'],
... 'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
... 'col3':[0,1,9,4,2,3,5],
... })
>>> df
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 1.0 2
5 C NaN 3
6 A NaN 5
What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.
So that my expected results would look like the following...
>>> df
col1 col2 col3
0 A 1.0 4
1 A 2.0 0
2 A 3.0 5
3 B 1.0 2
4 B 2.0 9
5 C 1.0 1
6 C 2.0 3
I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...
>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
col1 col2_x col2_y col3
0 A 1.0 NaN 0
1 A 1.0 1.0 1
2 A 1.0 NaN 5
3 B 1.0 NaN 9
4 B 1.0 1.0 4
5 C 1.0 1.0 2
6 C 1.0 NaN 3

Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:
df = pd.DataFrame({
'col1':['A','C','B','A','B','B','B'],
'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
'col3':[0,1,9,4,2,3,4],
})
print (df)
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 3.0 2
5 B NaN 3
6 B 0.0 4
df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
.groupby('col1')
.cumcount()
.add(1)
.add(s)
.fillna(df['col2']).astype(int))
print (df)
col1 col2 col3 new
3 A 1.0 4 1
0 A NaN 0 2
6 B 0.0 4 0
4 B 3.0 2 3
2 B NaN 9 4
5 B NaN 3 5
1 C 1.0 1 1

Another way:
df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')

Related

How to replace column values based on other columns in pandas?

Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0

How to impute entire missing values in pandas dataframe with mode/mean?

I know codes forfilling seperately by taking each column as below
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
But i am working on a dataset with 50 rows and there are 20 categorical values which need to be imputed.
Is there a single line code for imputing the entire data set??
Use DataFrame.fillna with DataFrame.mode and select first row because if same maximum occurancies is returned all values:
data = pd.DataFrame({
'A':list('abcdef'),
'col1':[4,5,4,5,5,4],
'col2':[np.nan,8,3,3,2,3],
'col3':[3,3,5,5,np.nan,np.nan],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['col1','col2','col3']
print (data[cols].mode())
col1 col2 col3
0 4 3.0 3.0
1 5 NaN 5.0
data[cols] = data[cols].fillna(data[cols].mode().iloc[0])
print (data)
A col1 col2 col3 E F
0 a 4 3.0 3.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 3.0 5.0 6 a
3 d 5 3.0 5.0 9 b
4 e 5 2.0 3.0 2 b
5 f 4 3.0 3.0 4 b

Pandas divide two dataframe with different sizes

I have a dataframe df1 as:
col1 col2 Val1 Val2
A g 4 6
A d 3 8
B h 5 10
B p 7 14
I have another dataframe df2 as:
col1 Val1 Val2
A 2 3
B 1 4
I want to divide df1 by df2 based on col1, val1 and val2 so that row A from df2 divides both rows A from df1.
My final output of df1.div(df2) is as follows:
col1 col2 Val1 Val2
A g 2 2
A d 1.5 2
B h 5 2.5
B p 7 3.5
Convert col1 and col2 to MultiIndex, also convert col1 in second DataFrame to index and then use DataFrame.div:
df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1')).reset_index()
#alternative with specify level of index
#df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1'), level=0).reset_index()
print (df)
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I think there is a slight mistake in your example. For col Val2, 2nd row - 8/3 should be 2.67. So the final output df1.div(df2) should be :
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
Anyways here is a possible solution:
Construct the 2 dfs
import pandas as pd
df1 = pd.DataFrame(data={'col1':['A','A','B','B'], 'col2': ['g','d','h','p'], 'Val1': [4,3,5,7], 'Val2': [6,8,10,14]}, columns=['col1','col2','Val1','Val2'])
df2 = pd.DataFrame(data={'col1':['A','B'], 'Val1': [2,1], 'Val2': [3,4]}, columns=['col1','Val1','Val2'])
print (df1)
print (df2)
Output:
>>>
col1 col2 Val1 Val2
0 A g 4 6
1 A d 3 8
2 B h 5 10
3 B p 7 14
col1 Val1 Val2
0 A 2 3
1 B 1 4
Now we can just do an INNER JOIN of df1 and df2 on col: col1. If you are not familiar with SQL joins have a look at this: sql-join. We can do join in pandas using the merge() method
## join df1, df2
merged_df = pd.merge(left=df1, right=df2, how='inner', on='col1')
print (merged_df)
Output:
>>>
col1 col2 Val1_x Val2_x Val1_y Val2_y
0 A g 4 6 2 3
1 A d 3 8 2 3
2 B h 5 10 1 4
3 B p 7 14 1 4
Now that we have got the corresponding columns of df1 and df2, we can simply compute the division and delete the redundant columns:
# Val1 = Val1_x/Val1_y, Val2 = Val2_x/Val2_y
merged_df['Val1'] = merged_df['Val1_x']/merged_df['Val1_y']
merged_df['Val2'] = merged_df['Val2_x']/merged_df['Val2_y']
# delete the cols: Val1_x,Val1_y,Val2_x,Val2_y
merged_df.drop(columns=['Val1_x', 'Val1_y', 'Val2_x', 'Val2_y'], inplace=True)
print (merged_df)
Final Output:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I hope this solves your question :)
You can use the pandas.merge() function to execute a database-like join between dataframes, then use the result to divide column values:
# merge against col1 so we get a merged index
merged = pd.merge(df1[["col1"]], df2)
df1[["Val1", "Val2"]] = df1[["Val1", "Val2"]].div(merged[["Val1", "Val2"]])
This produces:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000

How can I compute a shifted expanding mean per group

I want expanding mean of col2 based on groupby('col1'), but I want the mean to not include the row itself (just the rows above it)
dummy = pd.DataFrame({"col1": ["a",'a','a','b','b','b','c','c'],"col2":['1','2','3','4','5','6','7','8'] }, index=list(range(8)))
print(dummy)
dummy['one_liner'] = dummy.groupby('col1').col2.shift().expanding().mean().reset_index(level=0, drop=True)
dummy['two_liner'] = dummy.groupby('col1').col2.shift()
dummy['two_liner'] = dummy.groupby('col1').two_liner.expanding().mean().reset_index(level=0, drop=True)
print(dummy)
---------------------------
here is result of first print statement:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 c 8
here is result of the second print:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.000000 1.0
2 a 3 1.500000 1.5
3 b 4 1.500000 NaN
4 b 5 2.333333 4.0
5 b 6 3.000000 4.5
6 c 7 3.000000 NaN
7 c 8 3.800000 7.0
I would have thought their results would be identical.
two_liner is the expected result. one_liner mixes numbers in between groups.
It took a long time to figure out this solution, can anyone explain the logic? Why does one_liner not give expected results?
You are looking for expanding().mean() and shift() within the groupby():
groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())
df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())
Output:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.0 NaN
2 a 3 1.5 1.0
3 b 4 NaN NaN
4 b 5 4.0 NaN
5 b 6 4.5 4.0
6 c 7 NaN NaN
7 c 8 7.0 NaN
Explanation:
(dummy.groupby('col1').col2.shift() # this shifts col2 within the groups
.expanding().mean() # this ignores the grouping and expanding on the whole series
.reset_index(level=0, drop=True) # this is not really important
)
So that the above chained command is equivalent to
s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)
As you can see, only s1 considers the grouping by col1.

Create a column based on first row of each sorted group in pandas

I want to group a dataframe based on two columns and sort each group based on another column and then create new column for each group based on first row of each group!(col3 is date in my dataset)
dataframe:
col1 col2 col3
A 0 2.0
A 0 1.0
A 0 3.0
A 1 3.0
A 1 4.0
B 0 3.0
B 0 1.0
B 1 1.0
B 1 1.0
output:
col1 col2 col3 col4
A 0 2.0 1.0
A 0 1.0 1.0
A 0 3.0 1.0
A 1 3.0 3.0
A 1 4.0 3.0
B 0 3.0 1.0
B 0 1.0 1.0
B 1 0.0 0.0
B 1 1.0 0.0
I tried this :
active_users = active_users.groupby(['col1', 'col2']).apply(lambda x: x.sort_values('col3')).transform('first')
and got this error:
TypeError: first() missing 1 required positional argument: 'offset'
You can create such column with:
df['col4'] = df.groupby(['col1', 'col2'])['col3'].transform('min')
Since the first value of the sorted items is the minimum. Usually it will be better to use the minimum over the first item of the sorted list, since calculating the minimum can be done in O(n). For sorting it depends on the algorithm itself, since there are lazy sorting algorithms that could sometimes obtain the first element in O(n) as well, but I think 'min' makes it more clear what you aim to do.
For the given sample dataframe we will then obtain:
>>> df = pd.DataFrame({'col1': ['A']*5 + ['B']*4, 'col2': [0,0,0,1,1,0,0,1,1], 'col3': [2,1,3,3,4,3,1,0,1.0]})
>>> df
col1 col2 col3
0 A 0 2.0
1 A 0 1.0
2 A 0 3.0
3 A 1 3.0
4 A 1 4.0
5 B 0 3.0
6 B 0 1.0
7 B 1 0.0
8 B 1 1.0
>>> df['col4'] = df.groupby(['col1', 'col2'])['col3'].transform('min')
>>> df
col1 col2 col3 col4
0 A 0 2.0 1.0
1 A 0 1.0 1.0
2 A 0 3.0 1.0
3 A 1 3.0 3.0
4 A 1 4.0 3.0
5 B 0 3.0 1.0
6 B 0 1.0 1.0
7 B 1 0.0 0.0
8 B 1 1.0 0.0

Categories