Combine if statement with apply in python - python

New to python. I am trying to figure out the best way to create a column based on other columns. Ideally, the code would be as such.
df['new'] = np.where(df['Country'] == 'CA', df['x'], df['y'])
I do not think this works because it thinks that I am calling the entire column. I tried to do the same thing with apply but was having trouble with syntax.
df['my_col'] = df.apply(
lambda row:
if row.country == 'CA':
row.my_col == row.x
else:
row.my_col == row.y
I feel like there must be an easier way.

Any of these three approaches (np.where, apply, mask) seems to work:
df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'country':['CA','US','CA','UK','CA'], 'x':[1,2,3,4,5], 'y':[6,7,8,9,10]})
print(df)
df['where'] = np.where(df.country=='CA', df.x, df.y)
df['apply'] = df.apply(lambda row: row.x if row.country == 'CA' else row.y, axis=1)
mask = df.country=='CA'
df.loc[mask, 'mask'] = df.loc[mask, 'x']
df.loc[~mask, 'mask'] = df.loc[~mask, 'y']
print(df)
Input:
country x y
0 CA 1 6
1 US 2 7
2 CA 3 8
3 UK 4 9
4 CA 5 10
Output
country x y where apply mask
0 CA 1 6 1 1 1.0
1 US 2 7 7 7 7.0
2 CA 3 8 3 3 3.0
3 UK 4 9 9 9 9.0
4 CA 5 10 5 5 5.0

This might also work for you
data = {
'Country' : ['CA', 'NY', 'NC', 'CA'],
'x' : ['x_column', 'x_column', 'x_column', 'x_column'],
'y' : ['y_column', 'y_column', 'y_column', 'y_column']
}
df = pd.DataFrame(data)
condition_list = [df['Country'] == 'CA']
choice_list = [df['x']]
df['new'] = np.select(condition_list, choice_list, df['y'])
df
Your np.where() looked fine though so I would double check that your columns are labeled correctly.

Related

Fill NA values in dataframe by mapping a double indexed groupby object [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Sum columns based off conditionals - pandas

I'm aiming to sum specific columns in a df where a condition is met. Where Group == Group_A, I want to sum A_4','B_4. However, where Group == Group_B, I want to pass the sum of A_1','B_1 to the same column. I need to pass the function at the same time otherwise I end up with nan values.
df = pd.DataFrame({
'Group_A' : ['Red','Red','Red','Red','Red',],
'Group_B' : ['Blue','Blue','Blue','Blue','Blue',],
'Group' : ['Blue','Blue','Blue','Red','Blue',],
'A_1' : [7,6,8,0,4],
'B_1' : [6,7,11,1,4],
'A_4' : [1,1,1,6,4],
'B_4' : [3,3,3,9,5],
})
df['Sum'] = df.loc[df['Group'] == df['Group_A'],['A_4','B_4']].sum(axis=1)
df['Sum'] = df.loc[df['Group'] == df['Group_B'],['A_1','B_1']].sum(axis=1)
intended output:
Group_A Group_B Group A_1 B_1 A_4 B_4 Sum
0 Red Blue Blue 7 6 1 3 13.0
1 Red Blue Blue 6 7 1 3 13.0
2 Red Blue Blue 8 11 1 3 19.0
3 Red Blue Red 0 1 6 9 15.0
4 Red Blue Blue 4 4 4 5 8.0
Update:
Could subtraction replace the sum method without drawing an error?
sub1 = df.loc[df['Group'] == df['Group_A'], df['A_4'].sub(df['B_4'])]
sub2 = df.loc[df['Group'] == df['Group_B'], df['A_1'].sub(df['B_1'])]
df['Sub'] = Sum1.append(Sum2)
Another option, save the sum to series, and then update the dataframe:
Sum1 = df.loc[df['Group'] == df['Group_A'],['A_4','B_4']].sum(axis=1)
Sum2 = df.loc[df['Group'] == df['Group_B'],['A_1','B_1']].sum(axis=1)
df['Sum']=Sum1.append(Sum2)
as mentioned in comments, if you want to subtract, let's say Bs from As, then you can do something like this:
Sum1 = df.loc[df['Group'] == df['Group_A'],['A_4','B_4']].apply(lambda x: x['A_4'] - x['B_4'],axis=1)
Sum2 = df.loc[df['Group'] == df['Group_B'],['A_1','B_1']].apply(lambda x: x['A_1'] - x['B_1'],axis=1)
df['Sum']=Sum1.append(Sum2)
you can use np.select here which is usually much faster:
import numpy as np
condlist = [
df.Group.eq(df.Group_A),
df.Group.eq(df.Group_B)
]
choicelist = [
df.filter(like='_4').sum(1),
df.filter(like='_1').sum(1)
]
df['sum'] = np.select(condlist, choicelist)
NOTE: Replace filter with required columns if required. And if you've simple else-if condition np.where will also work.
np.where() is tailor made for such scenarios
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Group_A' : ['Red','Red','Red','Red','Red',],
'Group_B' : ['Blue','Blue','Blue','Blue','Blue',],
'Group' : ['Blue','Blue','Blue','Red','Blue',],
'A_1' : [7,6,8,0,4],
'B_1' : [6,7,11,1,4],
'A_4' : [1,1,1,6,4],
'B_4' : [3,3,3,9,5],
})
df['Sum'] = np.where(df["Group"]==df["Group_A"],df["A_4"]+df["B_4"], df["A_1"]+df["B_1"])

Python : Remove all data in a column of a dataframe and keep the last value in the first row

Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val

Fillna in pandas with respect to similar line [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Vectorizing a multiplication and dict mapping on a Pandas DataFrame without iterating?

I have a Pandas DataFrame, df:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
and a dict, mask:
mask = {1:32,2:64,3:100,4:200}
I want my end result to be a DataFrame like this:
A B C
1 1 32
2 2 64
2 3 96
4 4 400
nan nan nan
Right now I am doing this, which seems innefficient:
for idx, row in df.iterrows():
if not math.isnan(row['A']):
if row['A'] != 1:
df.loc[idx, 'C'] = row['B'] * mask[row['A'] - 1]
else:
df.loc[idx, 'C'] = row['B'] * mask[row['A']]
Is there an easy way to vectorize this?
This should work:
df['C'] = df.B * (df.A - (df.A != 1)).map(mask)
Timing
10,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(2000)])
100,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(20000)])
Here is an option using apply, and the get method for dictionary which returns None if the key is not in the dictionary:
df['C'] = df.apply(lambda r: mask.get(r.A) if r.A == 1 else mask.get(r.A - 1), axis = 1) * df.B
df
# A B C
#0 1 1 32
#1 2 2 64
#2 2 3 96
#3 4 4 400
#4 NaN 5 NaN

Categories