How to impute entire missing values in pandas dataframe with mode/mean?

How to impute entire missing values in pandas dataframe with mode/mean? - python

I know codes forfilling seperately by taking each column as below
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
But i am working on a dataset with 50 rows and there are 20 categorical values which need to be imputed.
Is there a single line code for imputing the entire data set??

Use DataFrame.fillna with DataFrame.mode and select first row because if same maximum occurancies is returned all values:
data = pd.DataFrame({
'A':list('abcdef'),
'col1':[4,5,4,5,5,4],
'col2':[np.nan,8,3,3,2,3],
'col3':[3,3,5,5,np.nan,np.nan],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['col1','col2','col3']
print (data[cols].mode())
col1 col2 col3
0 4 3.0 3.0
1 5 NaN 5.0
data[cols] = data[cols].fillna(data[cols].mode().iloc[0])
print (data)
A col1 col2 col3 E F
0 a 4 3.0 3.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 3.0 5.0 6 a
3 d 5 3.0 5.0 9 b
4 e 5 2.0 3.0 2 b
5 f 4 3.0 3.0 4 b

Related

How to replace column values based on other columns in pandas?

Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks

We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0

Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0

Duplicate positions from group

I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?

It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0

You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()

How can I compute a shifted expanding mean per group

I want expanding mean of col2 based on groupby('col1'), but I want the mean to not include the row itself (just the rows above it)
dummy = pd.DataFrame({"col1": ["a",'a','a','b','b','b','c','c'],"col2":['1','2','3','4','5','6','7','8'] }, index=list(range(8)))
print(dummy)
dummy['one_liner'] = dummy.groupby('col1').col2.shift().expanding().mean().reset_index(level=0, drop=True)
dummy['two_liner'] = dummy.groupby('col1').col2.shift()
dummy['two_liner'] = dummy.groupby('col1').two_liner.expanding().mean().reset_index(level=0, drop=True)
print(dummy)
---------------------------
here is result of first print statement:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 c 8
here is result of the second print:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.000000 1.0
2 a 3 1.500000 1.5
3 b 4 1.500000 NaN
4 b 5 2.333333 4.0
5 b 6 3.000000 4.5
6 c 7 3.000000 NaN
7 c 8 3.800000 7.0
I would have thought their results would be identical.
two_liner is the expected result. one_liner mixes numbers in between groups.
It took a long time to figure out this solution, can anyone explain the logic? Why does one_liner not give expected results?

You are looking for expanding().mean() and shift() within the groupby():
groups = df.groupby('col1')
df['one_liner'] = groups.col2.apply(lambda x: x.expanding().mean().shift())
df['two_liner'] = groups.one_liner.apply(lambda x: x.expanding().mean().shift())
Output:
col1 col2 one_liner two_liner
0 a 1 NaN NaN
1 a 2 1.0 NaN
2 a 3 1.5 1.0
3 b 4 NaN NaN
4 b 5 4.0 NaN
5 b 6 4.5 4.0
6 c 7 NaN NaN
7 c 8 7.0 NaN
Explanation:
(dummy.groupby('col1').col2.shift() # this shifts col2 within the groups
.expanding().mean() # this ignores the grouping and expanding on the whole series
.reset_index(level=0, drop=True) # this is not really important
)
So that the above chained command is equivalent to
s1 = dummy.groupby('col1').col2.shift()
s2 = s1.expanding.mean()
s3 = s2.reset_index(level=0, drop=True)
As you can see, only s1 considers the grouping by col1.

Faster re-organisation of my data in a new DataFrame() [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a big data set of 2500000 rows with the following format:
Merkmal == Feature
Auspraegung_Code == Code for the following column
Auspraegung_Text == Actual kind in the Feature
Anzahl == Number of kinds of this Feature
The rest is not from interest/self-explanatory.
My issue is that I'd like to have this DataFrame() with the Auspreagung_Text entries as columns and their Number/Amount (Anzahl column) for each Gitter_ID in each row.
Currently what I do is this:
df_result = pd.DataFrame()
for i,ids in enumerate(Gitter_ids):
auspraegungen = df["Auspraegung_Text"][df["Gitter_ID_100m_neu"]==ids ]
auspraegung_amounts= df["Anzahl"][df["Gitter_ID_100m_neu"]==ids ]
df_result.loc[i,"Cell_id"] = ids
for auspraegung,amounts in zip(auspraegungen,auspraegung_amounts):
df_result.loc[i,auspraegung] = anzahl
Result DataFrame() should look like this:
The code above is working, but is very very slow. How can i optimize the process?
The Data used in this problem is census data from germany.

Try using pandas.pivot_table:
(with dummy data)
>>> x=[[1,2,3, "A"], [3,4,2, "B"], [32, 2,34, "C"], [1,2,5, "B"], [241,24,2, "C"], [214, 2,3,"B"]]
>>> df=pd.DataFrame(data=x, columns=["col1", "col2", "col3", "cat"])
>>> df
col1 col2 col3 cat
0 1 2 3 A
1 3 4 2 B
2 32 2 34 C
3 1 2 5 B
4 241 24 2 C
5 214 2 3 B
>>> pd.pivot_table(df, values=["col1", "col2", "col3"], columns=["cat"])
cat A B C
col1 1.0 72.666667 136.5
col2 2.0 2.666667 13.0
col3 3.0 3.333333 18.0
>>> pd.pivot_table(df, values=["col1", "col2"], index="col3", columns=["cat"])
col1 col2
cat A B C A B C
col3
2 NaN 3.0 241.0 NaN 4.0 24.0
3 1.0 214.0 NaN 2.0 2.0 NaN
5 NaN 1.0 NaN NaN 2.0 NaN
34 NaN NaN 32.0 NaN NaN 2.0
>>> pd.pivot_table(df, values=["col1"], index=["col3", "col2"], columns=["cat"]).reset_index()
col3 col2 col1
cat A B C
0 2 4 NaN 3.0 NaN
1 2 24 NaN NaN 241.0
2 3 2 1.0 214.0 NaN
3 5 2 NaN 1.0 NaN
4 34 2 NaN NaN 32.0

How to groupby and update values in pandas?

I have a pandas DataFrame that looks similar to the following...
>>> df = pd.DataFrame({
... 'col1':['A','C','B','A','B','C','A'],
... 'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
... 'col3':[0,1,9,4,2,3,5],
... })
>>> df
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 1.0 2
5 C NaN 3
6 A NaN 5
What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.
So that my expected results would look like the following...
>>> df
col1 col2 col3
0 A 1.0 4
1 A 2.0 0
2 A 3.0 5
3 B 1.0 2
4 B 2.0 9
5 C 1.0 1
6 C 2.0 3
I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...
>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
col1 col2_x col2_y col3
0 A 1.0 NaN 0
1 A 1.0 1.0 1
2 A 1.0 NaN 5
3 B 1.0 NaN 9
4 B 1.0 1.0 4
5 C 1.0 1.0 2
6 C 1.0 NaN 3

Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:
df = pd.DataFrame({
'col1':['A','C','B','A','B','B','B'],
'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
'col3':[0,1,9,4,2,3,4],
})
print (df)
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 3.0 2
5 B NaN 3
6 B 0.0 4
df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
.groupby('col1')
.cumcount()
.add(1)
.add(s)
.fillna(df['col2']).astype(int))
print (df)
col1 col2 col3 new
3 A 1.0 4 1
0 A NaN 0 2
6 B 0.0 4 0
4 B 3.0 2 3
2 B NaN 9 4
5 B NaN 3 5
1 C 1.0 1 1

Another way:
df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to impute entire missing values in pandas dataframe with mode/mean? - python

Related

How to replace column values based on other columns in pandas?

Duplicate positions from group

How can I compute a shifted expanding mean per group

Faster re-organisation of my data in a new DataFrame() [duplicate]

How to groupby and update values in pandas?

Categories

Resources