passing a variable into apply() in pandas - python

I am having trouble getting the syntax right for applying a function to a dataframe. I am trying to create a new column in a dataframe by joining the strings in two other columns, passing in a separator. I get the error
TypeError: ("apply_join() missing 1 required positional argument: 'sep'", 'occurred at index cases')
If I add sep to the apply_join() function call, that also fails:
File "unite.py", line 37, in unite
tibble_extra = df[cols].apply(apply_join, sep)
NameError: name 'sep' is not defined
import pandas as pd
from io import StringIO
tibble3_csv = """country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583"""
with StringIO(tibble3_csv) as fp:
tibble3 = pd.read_csv(fp)
print(tibble3)
def str_join_elements(x, sep=""):
assert type(sep) is str
return sep.join((str(xi) for xi in x))
def unite(df, cols, new_var, combine=str_join_elements):
def apply_join(x, sep):
joinstr = str_join(x, sep)
return pd.Series({new_var[i]:s for i, s in enumerate(joinstr)})
fixed_vars = df.columns.difference(cols)
tibble = df[fixed_vars].copy()
tibble_extra = df[cols].apply(apply_join)
return pd.concat([tibble, tibble_extra], axis=1)
table3_again = unite(tibble3, ['cases', 'population'], 'rate', combine=lambda x: str_join_elements(x, "/"))
print(table3_again)

Use lambda when you have multiple parameters i.e
df[cols].apply(lambda x: apply_join(x,sep),axis=1)
Or pass parameters with the help of args parameter i.e
df[cols].apply(apply_join,args=[sep],axis=1)

You just add it into the apply statement:
tibble_extra = df[cols].apply(apply_join, sep=...)
Also, you should specify the axis. It may work without it, but its a good habit to prevent errors:
tibble_extra = df[cols].apply(apply_join, sep=..., axis=1(columns) or 0(rows|default))

Related

TypeError:Invalid arguement, not a string or column

I am trying to do some source to target testing in pyspark. First part I am trying to do is a count of columns using a Lean Six Sigma method making sure there are less than 3/1000000 discrepancies in the columns. When I run this though, the if statement throws a:
TypeError: Invalid arguement, not a string or column: -276244 of type <class 'int'>. For columns literals, use 'lit', 'array','struct' or 'create_map' function.
Could anyone help?
import pyspark.sql.functions as f
from pyspark.sql.types import *
good_fields = []
bad_fields = {}
count_issues = {}
columns = list(spark.sql('show columns from tu_historical').toPandas()['col_name'])
for col in columns:
print(col)
df = spark.sql(f'select pid,fnum,{col} from historical_clean')
df1 = spark.sql(f'select pid,fnum,{col} from historical1')
#count issue testing
if abs(df1.count()-df.count()) > df1.count()*.000003:
count_issues[col] = df1.count()-df.count()
test_df = df.join(df1,(df.num == df1.file) & (df1.pid == df.pid),'left').filter(df1[col]!=df[col])
Seems like your columns has a funky value in it.
You may want to use this to get the columns names:
columns = spark.sql('select * from tu_historical limit 0').columns

Method Chaining in Pandas: str.replace not working

I would like to read in an excel file, and using method chaining, convert the column names into lower case and replace any white space into _. The following code runs fine
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower))
return df
But the code below does not
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower)
.rename(columns = str.replace(old=" ",new="_")))
return df
After adding the str.replace line I get the following error: No value for argument 'self' in unbound method call. Can someone shed some light on what I can do to fix this error and why the above does not work?
In addition, when I use str.lower() I get the same error. Why does str.lower work but not str.lower()?
Here's a different syntax which I frequently use:
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = pd.read_excel(filename,skiprows=5)
df.columns = df.columns.str.lower().replace(" ", "_")
return df

How can I modify a Pandas DataFrame by reference?

I'm trying to write a Python function that does One-Hot encoding in-place but I'm having trouble finding a way to do a concat operation in-place at the end. It appears to make a copy of my DataFrame for the concat output and I am unable to assign this to my DataFrame that I passed by reference.
How can this be done?
def one_hot_encode(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
df.drop(col, axis=1, inplace=True)
df[:] = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) # Doesn't take effect outside function
I don't think you can pass function arguments by reference in python (see: How do I pass a variable by reference? )
Instead what you can do is just return the modified df from your function, and assign result to the original df:
def one_hot_encode(df, col: str):
...
return df
...
df=one_hot_encode(df, col)
To make the change take affect outside the function, we have to change the object that was passed in rather than replace its name (inside the function) with a new object.
To assign the new columns, you can use
df[insert_data.columns] = insert_data
instead of the concat.
That doesn't take advantage of your careful insert order though.
To retain your order, we can redindex the data frame.
df.reindex(columns=cols)
where cols is the combined list of columns in order:
cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
Putting it all together,
import pandas as pd
def one_hot_encode(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
cols = list(df.columns)
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
df[insert_data.columns] = insert_data
df.reindex(columns=cols)
df.drop(col, axis=1, inplace=True)
import seaborn
diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode(diamonds, "color")
assert( "color" not in diamonds.columns )
assert( len([c for c in diamonds.columns if c.startswith("color")]) == 8 )
df.insert is inplace--but can only insert one column at a time. It might not be worth the reorder.
def one_hot_encode2(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
cols = list(df.columns)
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
for offset, newcol in enumerate(insert_data.columns):
df.insert(loc=insert_loc+offset, column=newcol, value = insert_data[[newcol]])
df.drop(col, axis=1, inplace=True)
import seaborn
diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode2(diamonds, "color")
assert( "color" not in diamonds.columns )
assert(len([c for c in diamonds.columns if c.startswith("color")]) == 8)
assert([(i) for i,c in enumerate(diamonds.columns) if c.startswith("color")][0] == 2)
The scope of the variables of a function are only inside that function. Simply include a return statement in the end of the function to get your modified dataframe as output. Calling this function will now return your modified dataframe. Also while assigning new (dummy) columns, instead of df[:] use df, as you are changing the dimension of original dataframe.
def one_hot_encode(df, col: str):
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
df.drop(col, axis=1, inplace=True)
df = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1)
return df
Now to see the modified dataframe, call the function and assign it to a new/existing dataframe as below
df=one_hot_encode(df,'<any column name>')

Pythons apply a function on dataframe by row

I don't understand how "row" arugment should be used when creating a function, when the function has other arguments.
I want to create a function which calculate a new column to my dataframe "file".
This works great :
def imputation(row):
if (row['hour_y']==0) & (row['outlier_idx']==True) :
val=file['HYDRO'].mean()
else :
val=row['HYDRO']
return val
file['minute_corr'] = file.apply(imputation, axis=1)
But this does not work (I added an argument) :
def imputation(row,variable):
if (row['hour_y']==0) & (row['outlier_idx']==True) :
val=file[variable].mean()
else :
val=row[variable]
return val
file['minute_corr'] = file.apply(imputation(,'HYDRO'), axis=1)
Try this vectorized approach:
file['minute_corr'] = np.where((file['hour_y']==0) & file['outlier_idx'],
file['HYDRO'].mean(),
file['HYDRO'])
Using apply function you can also parallelize the computation.
file['minute_corr'] = file.apply(lambda row: (file['HYDRO'].mean() if (row['hour_y']==0) & (row['outlier_idx']==True) else row['HYDRO'] ), axis=1)
The apply method can take positional and keyword arguments:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html
For the last line try:
Try:
file['minute_corr'] = file.apply(imputation,args=('HYDRO',), axis=1)

Dataframe datetime column as argument to function

When passing multiple columns of a dataframe as arguments to a function, the datetime column does not want to do the formatting function below. I can manage with the inline solution shown ... But it would be nice to know the reason ... Should I have e.g. used a different date-like data type? Thanks (p.s. Pandas = great)
import pandas as pd
import numpy as np
import datetime as dt
def fmtfn(arg_dttm, arg_int):
retstr = arg_dttm.strftime(':%Y-%m-%d') + '{:0>3}'.format(arg_int)
# bombs with: 'numpy.datetime64' object has no attribute 'strftime'
# retstr = '{:%Y-%m-%d}~{:0>3}'.format(arg_dttm, arg_int)
# bombs with: invalid format specifier
return retstr
def fmtfn2(arg_dtstr, arg_int):
retstr = '{}~{:0>3}'.format(arg_dtstr, arg_int)
return retstr
# The source data.
# I want to add a 3rd column newhg that carries e.g. 2017-06-25~066
# i.e. a concatenation of the other two columns.
df1 = pd.DataFrame({'mydt': ['2017-05-07', '2017-06-25', '2015-08-25'],
'myint': [66, 201, 100]})
df1['mydt'] = pd.to_datetime(df1['mydt'], errors='raise')
# THIS WORKS (without calling a function)
print('\nInline solution')
df1['newhg'] = df1[['mydt', 'myint']].apply(lambda x: '{:%Y-%m-%d}~{:0>3}'.format(x[0], x[1]), axis=1)
print(df1)
# THIS WORKS
print('\nConvert to string first')
df1['mydt2'] = df1['mydt'].apply(lambda x: x.strftime('%Y-%m-%d'))
df1['newhg'] = np.vectorize(fmtfn2)(df1['mydt2'], df1['myint'])
print(df1)
# Bombs in the function - see above
print('\nPass a datetime')
df1['newhg'] = np.vectorize(fmtfn)(df1['mydt'], df1['myint'])
print(df1)
You could have also used the builtin functions from pandas which make it a bit easier to read:
df1['newhg'] = df1.mydt.dt.strftime('%Y-%m-%d') + '~' + df1.myint.astype(str).str.zfill(3)

Categories