Pythons apply a function on dataframe by row - python

I don't understand how "row" arugment should be used when creating a function, when the function has other arguments.
I want to create a function which calculate a new column to my dataframe "file".
This works great :
def imputation(row):
if (row['hour_y']==0) & (row['outlier_idx']==True) :
val=file['HYDRO'].mean()
else :
val=row['HYDRO']
return val
file['minute_corr'] = file.apply(imputation, axis=1)
But this does not work (I added an argument) :
def imputation(row,variable):
if (row['hour_y']==0) & (row['outlier_idx']==True) :
val=file[variable].mean()
else :
val=row[variable]
return val
file['minute_corr'] = file.apply(imputation(,'HYDRO'), axis=1)

Try this vectorized approach:
file['minute_corr'] = np.where((file['hour_y']==0) & file['outlier_idx'],
file['HYDRO'].mean(),
file['HYDRO'])

Using apply function you can also parallelize the computation.
file['minute_corr'] = file.apply(lambda row: (file['HYDRO'].mean() if (row['hour_y']==0) & (row['outlier_idx']==True) else row['HYDRO'] ), axis=1)

The apply method can take positional and keyword arguments:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html
For the last line try:
Try:
file['minute_corr'] = file.apply(imputation,args=('HYDRO',), axis=1)

Related

Python return statement failing to return a list to be written into a pandas DF

For the life of me I cannot figure out why this function is not returning anything. Any insight will be greatly appreciated!
Basically I create a list of string variables that I am preserving in a Pandas DF. I am using the DF to pull the variable to plug into the function via a .apply() method. But my return function yields NONE results in my DF.
def add_combinations_to_directory(comb_tuples, person_id):
meta_list = []
for comb in comb_tuples:
concat_name = generate_normalized_name(comb)
metaphone_tuple = doublemetaphone(concat_name)
meta_list.append(metaphone_tuple[0])
if metaphone_tuple[1] != '':
meta_list.append(metaphone_tuple[1])
if metaphone_tuple[0] in __lookup_dict[0]:
__lookup_dict[0][metaphone_tuple[0]].append(person_id)
else:
__lookup_dict[0][metaphone_tuple[0]] = [person_id]
if metaphone_tuple[1] in __lookup_dict[1]:
__lookup_dict[1][metaphone_tuple[1]].append(person_id)
else:
__lookup_dict[1][metaphone_tuple[1]] = [person_id]
print(meta_list)
return meta_list
def add_person_to_lookup_directory(person_id, name_tuple):
add_combinations_to_directory(name_tuple, person_id)
def create_meta_names(x, id):
add_person_to_lookup_directory(id, x)
other['Meta_names'] = other.apply(lambda x: create_meta_names(x['Owners'], x['place_id']), axis=1)
Figured it out! it was a problem of nested functions. The return value from the add_combinations_to_directory was being returned to the add_person_to_lookup_directory function and not passing through to the dataframe.

Pandas apply function, receiving KeyError 'Column Name'

My dataset has a column called age and I'm trying to count the null values.
I know it can be easily achieved by doing something like len(df) - df['age'].count(). However, I'm playing around with functions and just like to apply the function to calculate the null count.
Here is what I have:
def age_is_null(df):
age_col = df['age']
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
count = df.apply(age_is_null)
print (count)
When I do that, I received an error: KeyError: 'age'.
Can someone tells me why I'm getting that error and what should I change in the code to make it work?
You need DataFrame.pipe or pass DataFrame to function here:
#function should be simplify
def age_is_null(df):
return df['age'].isnull().sum()
count = df.pipe(age_is_null)
print (count)
count = age_is_null(df)
print (count)
Error means if use DataFrame.apply then iterate by columns, so it failed if want select column age.
def func(x):
print (x)
df.apply(func)
EDIT: For selecting column use column name:
def age_is_null(df):
age_col = 'age' <- here
null = df[age_col].isnull()
age_null = df[null]
return len(age_null)
Or pass selected column for mask:
def age_is_null(df):
age_col = df['age']
null = age_col.isnull() <- here
age_null = df[null]
return len(age_null)
Instead of making a function, you can Try this
df[df["age"].isnull() == True].shape
You need to pass dataframe df while calling the function age_is_null.That's why age column is not recognised.
count = df.apply(age_is_null(df))

Rename Columns in Pandas Using Lambda Function Rather Than a Function

I'm trying to rename column headings in my dataframe in pandas using .rename().
Basically, the headings are :
column 1: "Country name[9]"
column 2: "Official state name[5]"
#etc.
I need to remove [number].
I can do that with a function:
def column(string):
for x, v in enumerate(string):
if v == '[':
return string[:x]
But I wanted to know how to convert this to a lambda function so that I can use
df.rename(columns = lambda x: do same as function)
I've never used lambda functions before so I'm not sure of the syntax to get it to work correctly.
First you would have to create function which returns new or old value - never None.
def column(name):
if '[' in name:
return name[:name.index('[')] # new - with change
else:
return name # old - without change
and then you can use it as
df.rename(columns=lambda name:columns(name))
or even simpler
df.rename(columns=columns)
Or you can convert your function to real lambda
df.rename(columns=(lambda name: name[:name.index('[')] if '[' in name else name) )
but sometimes it is more readable to keep def column(name) and use columns=column. And not all constructions can be used in lambda - ie. you can't assing value to variable (I don't know if you can use new operator := ("walrus") in Python 3.8).
Minimal working code
import pandas as pd
data = {
'Country name[9]': [1,2,3],
'Official state name[5]': [4,5,6],
'Other': [7,8,9],
}
df = pd.DataFrame(data)
def column(name):
if '[' in name:
return name[:name.index('[')]
else:
return name
print(df.columns)
df = df.rename(columns=column)
# or
df = df.rename(columns=(lambda name: name[:name.index('[')] if '[' in name else name) )
print(df.columns)

Method Chaining in Pandas: str.replace not working

I would like to read in an excel file, and using method chaining, convert the column names into lower case and replace any white space into _. The following code runs fine
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower))
return df
But the code below does not
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower)
.rename(columns = str.replace(old=" ",new="_")))
return df
After adding the str.replace line I get the following error: No value for argument 'self' in unbound method call. Can someone shed some light on what I can do to fix this error and why the above does not work?
In addition, when I use str.lower() I get the same error. Why does str.lower work but not str.lower()?
Here's a different syntax which I frequently use:
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = pd.read_excel(filename,skiprows=5)
df.columns = df.columns.str.lower().replace(" ", "_")
return df

passing a variable into apply() in pandas

I am having trouble getting the syntax right for applying a function to a dataframe. I am trying to create a new column in a dataframe by joining the strings in two other columns, passing in a separator. I get the error
TypeError: ("apply_join() missing 1 required positional argument: 'sep'", 'occurred at index cases')
If I add sep to the apply_join() function call, that also fails:
File "unite.py", line 37, in unite
tibble_extra = df[cols].apply(apply_join, sep)
NameError: name 'sep' is not defined
import pandas as pd
from io import StringIO
tibble3_csv = """country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583"""
with StringIO(tibble3_csv) as fp:
tibble3 = pd.read_csv(fp)
print(tibble3)
def str_join_elements(x, sep=""):
assert type(sep) is str
return sep.join((str(xi) for xi in x))
def unite(df, cols, new_var, combine=str_join_elements):
def apply_join(x, sep):
joinstr = str_join(x, sep)
return pd.Series({new_var[i]:s for i, s in enumerate(joinstr)})
fixed_vars = df.columns.difference(cols)
tibble = df[fixed_vars].copy()
tibble_extra = df[cols].apply(apply_join)
return pd.concat([tibble, tibble_extra], axis=1)
table3_again = unite(tibble3, ['cases', 'population'], 'rate', combine=lambda x: str_join_elements(x, "/"))
print(table3_again)
Use lambda when you have multiple parameters i.e
df[cols].apply(lambda x: apply_join(x,sep),axis=1)
Or pass parameters with the help of args parameter i.e
df[cols].apply(apply_join,args=[sep],axis=1)
You just add it into the apply statement:
tibble_extra = df[cols].apply(apply_join, sep=...)
Also, you should specify the axis. It may work without it, but its a good habit to prevent errors:
tibble_extra = df[cols].apply(apply_join, sep=..., axis=1(columns) or 0(rows|default))

Categories