Passing Pandas dataframe columns as function arguments - python

I have a dataframe called df. I need to pass columns as arguments to a function.
Outside the function, this code works :
df.colname.fillna(method='ffill')
If I use the following code (ie the same line inside the function and pass df.colname as the argument (colname = df.colname) it does not work. The line is ignored:
def Funct (colname):
colname.fillna(method='ffill')
This works (colname = df.colname):
def Funct (colname):
colname [1:] = colname[1:].fillna(method='ffill')
What's happening?
Does the function change the dataframe object to an array? does this make the code inefficient and is there a better way of doing this?
(Note: This is part of a larger function which I am paraphrasing here for simplicity)

fillna() by default does not update the dataframe in place, instead expecting you to assign it back, such as
def Funct(colname):
return colname.fillna(method='ffill')
df = Funct(df)
It's worth mentioning that fillna() does have an argument inplace= which you could set to True if you need to update in place.

Related

Python Dataframe Modification Function Inconsistent?

I have these 4 functions that I use to modify a dataframe (without returning anything as per my intention).
The first 3 functions work perfectly fine. The dataframe gets modified according to the function, but the 4th (drop_na) function doesn't seem to work.
It's supposed to drop all rows with NA on the specified column name, but it doesn't work. No error is thrown out when I run the function. Any ideas why this happens and how to fix this (without return if possible).
Thanks!
def composite_key(dframe, new_key, key1, key2):
dframe[new_key] = dframe[key1]+"-"+dframe[key2].astype(str)
def drop_col(dframe, colnames):
dframe = dframe.drop_duplicates(subset=colnames, keep='first')
def split_column(dframe, arg: list):
dframe[arg[0]] = dframe[arg[1]].str.split(',', n=-1, expand=True).loc[:, :(len(arg[0])-1)]
def drop_na(dframe, colname):
dframe = dframe.loc[dframe[colname].notna()]
Usually for dropping na values for a specific column you use this subset
df.dropna(subset=['col_name'])

How do I apply a function over a column?

I have created a function I would like to apply over a given dataframe column. Is there an apply function so that I can create a new column and apply my created function?
Example code:
dat = pd.DataFrame({'title': ['cat', 'dog', 'lion','turtle']})
Manual method that works:
print(calc_similarity(chosen_article,str(df['title'][1]),model_word2vec))
print(calc_similarity(chosen_article,str(df['title'][2]),model_word2vec))
Attempt to apply over dataframe column:
dat['similarity']= calc_similarity(chosen_article, str(df['title']), model_word2vec)
The issue I have been running into is that the function outputs the same result over the entirety of the newly created column.
I have tried apply() as follows:
dat['similarity'] = dat['title'].apply(lambda x: calc_similarity(chosen_article, str(x), model_word2vec))
and
dat['similarity'] = dat['title'].astype(str).apply(lambda x: calc_similarity(chosen_article, x, model_word2vec))
Which result in a ZeroDivisionError which i am not understanding since I am not passing empty strings
Function being used:
def calc_similarity(input1, input2, vectors):
s1words = set(vocab_check(vectors, input1.split()))
s2words = set(vocab_check(vectors, input2.split()))
output = vectors.n_similarity(s1words, s2words)
return output
It sounds like you are having difficulty applying a function while passing additional keyword arguments. Here's how you can execute that:
# By default, function will use values for first arg.
# You can specify kwargs in the apply method though
df['similarity'] = df['title'].apply(
calc_similarity,
input2=chosen_article,
vectors=model_word2vec
)

Custom Aggregate Function in Python

I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})
Now if, I want to calculate the the average of the value column using the agg in Panadas, it would be:
df.agg({'value': 'mean'})
which results in a scaler value of 2.5 as shown in the following:
However, if I define the following custom mean function:
def my_mean(vec):
return np.mean(vec)
and use it in the following code:
df.agg({'value': my_mean})
I would get the following result:
So, the question here is, what should I do to get the same result as default mean aggregate function. One more thing to note that, if I use the mean function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean function in my custom function. Any help would be much appreciated!
df my_mean2(vec):
return vec.mean()
When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean, np.sum, etc It'll treat it as a transform and acts like df.apply().
The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:
def my_mean(vals):
print(type(vals))
try:
vals.shape
except:
raise TypeError()
return np.mean(vals)
>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'>
value 2.5
dtype: float64
You see, at first pandas tries to call the function on each row (df.apply), but my_mean raises a type error and in the second attempt it'll pass the whole column as a Series object. Comment the try...except part out and you'll see my_mean will be called on each row with an int argument.
more on the first part:
my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)
df.agg({'value': my_mean1})
df.agg({'value': my_mean2})
Although my_mean2 and np.mean are essentially the same, since my_mean2 is np.mean evaluates to false, it'll go down the df.apply route while my_mean1 will work as expected.

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

Not deleting rows as part of function - python

Please keep in mind i am coming from an R background (quite novice as well).
I am trying to create a UDF to format a data.frame df in Python, according to some defined rules. The first part deletes the first 4 rows of the data.frame and the second adds my desired column names. My function looks like this:
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
dfFormatF(df)
When i run it like this, its not working (neither dropping the first rows nor renaming). When i remove the x=x.iloc[4:], the second part x.columns = ['Name1', 'Name2', 'Name3'] is working properly and the column names are renamed. Additionally, if i run the removal outside the function, such as:
def dfFormatF(x):
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
df=df.iloc[4:]
dfFormatF(df)
before i call my function i get the full expected result (first removal of the first rows and then the desired column naming).
Any ideas as to why it is not working as part of the function, but it does outside of it?
Any help is much appreciated.
Thanks in advance.
The issue here is that the changes only inside the scope of dfFormatF(). Once you exit that function, all changes are lost because you do not return the result and you do not assign the result to something in the module-level scope. It's worth taking a step back to understand this in a general sense (this is not a Pandas-specific thing).
Instead, pass your DF to the function, make the transformations you want to that DF, return the result, and then assign that result back to the name you passed to the function.
Note This is a big thing in Pandas. What we emulate here is the inplace=True functionality. There are lots of things you can do to DataFrames and if you don't use inplace=True then those changes will be lost. If you stick with the default inplace=False then you must assign the result back to a variable (with the same or a different name, up to you).
import pandas as pd
starting_df = pd.DataFrame(range(10), columns=['test'])
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1']
print('Inside the function')
print(x.head())
return x
dfFormatF(starting_df)
print('Outside the function')
print(starting_df) # Note, unchanged
# Take 2
starting_df = dfFormatF(starting_df)
print('Reassigning changes back')
print starting_df.head()

Categories