Fillna in pandas with respect to similar line [duplicate] - python

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.

One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3

fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.

#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))

Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s

I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))

To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Related

Fill NA values in dataframe by mapping a double indexed groupby object [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Summing multiple columns using a regular expression to select which columns to sum

I would like to perform the following:"
test = pd.DataFrame({'A1':[1,1,1,1],
'A2':[1,2,2,1],
'A3':[1,1,1,1],
'B1':[1,1,1,1],
'B2':[pd.NA, 1,1,1]})
result = pd.DataFrame({'A': test.filter(regex='A').sum(axis=1),
'B': test.filter(regex='B').sum(axis=1)})
I was wondering whether there is a better method to do this, when we have more columns and more "regex"-matches.
Use dict comprehension instead multiple repeat code like:
L = ['A','B']
df = pd.DataFrame({x: test.filter(regex=x).sum(axis=1) for x in L})
Or if possible simplify solution by select only first letters use:
df = test.groupby(lambda x: x[0], axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0
If regexes should ne joined by | and gt all columns substrings use:
vals = test.columns.str.extract('(A|B)', expand=False)
print (vals)
Index(['A', 'A', 'A', 'B', 'B'], dtype='object')
df = test.groupby(vals, axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0

Find the mean of columns with matching column names

I have a dataframe similar to the following but with thousands of rows and columns:
x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2
x y 2 3 2 1 3
x y 5 7 6 2 1
I would like my output to look like this:
ghb_00hr hl_06hr
2.3 2
6 1.5
My goal is to find the average of the matching columns. I have come up with this: temp = df.groupby(name, axis=1).agg('mean') But I am not sure how to define 'name' as the matching columns.
My previous strategy was the following:
name = pd.Series(['_'.join(i.split('_')[:-1])
for i in df.columns[3:]],
index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
avg = pd.concat([df.iloc[:, :3], temp],
axis=1
)
However the number of 'replicates' ranges from 1-4 so grouping by index location isn't an option.
Not sure if there is a better way to do this or if I am on the right track.
An option is to groupby level=0:
(df.set_index(['name','x','y'])
.groupby(level=0, axis=1)
.mean().reset_index()
)
Output:
name x y ghb_00hr ghl_06hr
0 gene1 x y 2.333333 2.0
1 gene2 x y 6.000000 1.5
Update: for the modified question:
d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest
names = d.columns.str.rsplit('_', n=1).str[0]
d.groupby(names, axis=1).mean()
Output:
ghb_00hr ghl_06hr
0 2.333333 2.0
1 6.000000 1.5
You can convert df.columns to set then iterate:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])
for column in set(df.columns):
print(column, df[common_name].mean(axis=1))
will outputs
a 0 2.0
dtype: float64
b 0 5.0
dtype: float64
Use sorted if the order matters:
for column in sorted(set(df.columns)):
From here you can get the output in pretty much any format you want.

How to use groupby max in own groupby function?

I have the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
I am trying to include a new column obtained by dividing 'A' by the highest 'B' it is category ('CAT'). That is, I want to divide 10 by 10, 20 by 4, 10 by 10 and 40 by 4 to obtain the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3], 'C':[1,5,3,10]}
Any suggestions?
I find it easy to do without having to condition/groupby on CAT
d = {'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
df1 = df1.apply(lambda x:x.A/max(df1['B']),axis=1)
but with 'CAT' I am having a hard time.
You could do this in one line; I only broke it into separate lines for more clarity. transform allows replication of the groupby accross the entire dataframe; with that we can get the results for column C :
grouping = df1.groupby("CAT").B.transform("max")
df1['C'] = df1.A.div(grouping)
df1
CAT A B C
0 C1 10 3 1.0
1 C2 20 4 5.0
2 C1 30 10 3.0
3 C2 40 3 10.0
you're pretty much most of the way there with using apply. Depending on how big your actual dataset it, using apply could work out as inefficient, but ignoring that, you can solve your problem by the 'max' function on a filter of the dataframe rather than the df itself.
Or, just to get to the code:
df1['calculation'] = df1.apply(lambda row: row['A'] / max(df1[df1['CAT'] == row['CAT']]['B']), axis=1)

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

Categories