How do I apply a function over a column? - python

I have created a function I would like to apply over a given dataframe column. Is there an apply function so that I can create a new column and apply my created function?
Example code:
dat = pd.DataFrame({'title': ['cat', 'dog', 'lion','turtle']})
Manual method that works:
print(calc_similarity(chosen_article,str(df['title'][1]),model_word2vec))
print(calc_similarity(chosen_article,str(df['title'][2]),model_word2vec))
Attempt to apply over dataframe column:
dat['similarity']= calc_similarity(chosen_article, str(df['title']), model_word2vec)
The issue I have been running into is that the function outputs the same result over the entirety of the newly created column.
I have tried apply() as follows:
dat['similarity'] = dat['title'].apply(lambda x: calc_similarity(chosen_article, str(x), model_word2vec))
and
dat['similarity'] = dat['title'].astype(str).apply(lambda x: calc_similarity(chosen_article, x, model_word2vec))
Which result in a ZeroDivisionError which i am not understanding since I am not passing empty strings
Function being used:
def calc_similarity(input1, input2, vectors):
s1words = set(vocab_check(vectors, input1.split()))
s2words = set(vocab_check(vectors, input2.split()))
output = vectors.n_similarity(s1words, s2words)
return output

It sounds like you are having difficulty applying a function while passing additional keyword arguments. Here's how you can execute that:
# By default, function will use values for first arg.
# You can specify kwargs in the apply method though
df['similarity'] = df['title'].apply(
calc_similarity,
input2=chosen_article,
vectors=model_word2vec
)

Related

Functionto create a DF

I want to create a DF from another DF using a function like this:
def create_df_region(df,region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced['ind_{region}'.format()].value
Problem is: ind_{} can assume values like ind_s, ind_n, ind_no and I want to pass these values when creating the DF because n means norh, s means south and so on.
then, to create the df:
df_south = create_df_region(df_reduced, s)
when s mean the south beacuse in the df_reduced i have columns ind_s, ind_s...
How can I do it as the way i am trying abive is not working.
You need to return the newly created dataframe at the end of the function,
use .values instead of .value and use f-string for retrieving the source column name, as follows:
def create_df_region(df, region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced[f'ind_{region}'].values # use .values instead of .value
return df
Also, when you call the function, you need to pass a string 's' instead of the variable name s as follows:
df_south = create_df_region(df_reduced, 's')
Use f'ind_{region}' instead .format():
def create_df_region(df_reduced,region):
df = pd.DataFrame(index=df_reduced.index)
df['Cons'] = df_reduced[f'ind_{region}'].value
*I've also changed the first parameter of the function from df to df_reduced to make sense.

Passing Pandas dataframe columns as function arguments

I have a dataframe called df. I need to pass columns as arguments to a function.
Outside the function, this code works :
df.colname.fillna(method='ffill')
If I use the following code (ie the same line inside the function and pass df.colname as the argument (colname = df.colname) it does not work. The line is ignored:
def Funct (colname):
colname.fillna(method='ffill')
This works (colname = df.colname):
def Funct (colname):
colname [1:] = colname[1:].fillna(method='ffill')
What's happening?
Does the function change the dataframe object to an array? does this make the code inefficient and is there a better way of doing this?
(Note: This is part of a larger function which I am paraphrasing here for simplicity)
fillna() by default does not update the dataframe in place, instead expecting you to assign it back, such as
def Funct(colname):
return colname.fillna(method='ffill')
df = Funct(df)
It's worth mentioning that fillna() does have an argument inplace= which you could set to True if you need to update in place.

Why are my variable not accessible after a function?

I can't figure out why my function isn't providing the changes to the variables after I execute the function. Or why the variables are accessible after the function. I'm provided a dataframe and telling the fucntion the column to compare. I want the function to include the matching values are the original dataframe and create a separate dataframe that I can see just the matches. When I run the code I can see the dataframe and matching dataframe after running the function, but when I tried to call the matching dataframe after python doesn't recognize the variable as define and the original dataframe isn't modified when I look at it again. I've tried to call them both as global variables at the beginning of the function, but that didn't work either.
def scorer_tester_function(dataframe, score_type, source, compare, limit_num):
match = []
match_index = []
similarity = []
org_index = []
match_df = pd.DataFrame()
for i in zip(source.index, source):
position = list(source.index)
print(str(position.index(i[0])) + " of " + str(len(position)))
if pd.isnull(i[1]):
org_index.append(i[0])
match.append(np.nan)
similarity.append(np.nan)
match_index.append(np.nan)
else:
ratio = process.extract( i[1], compare, limit=limit_num,
scorer=scorer_dict[score_type])
org_index.append(i[0])
match.append(ratio[0][0])
similarity.append(ratio[0][1])
match_index.append(ratio[0][2])
match_df['org_index'] = pd.Series(org_index)
match_df['match'] = pd.Series(match)
match_df['match_index'] = pd.Series(match_index)
match_df['match_score'] = pd.Series(similarity)
match_df.set_index('org_index', inplace=True)
dataframe = pd.concat([dataframe, match_df], axis=1)
return match_df, dataframe
I'm calling the function list this:
scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
My expectation is that I can access match_df and def_ven and I would be able to see and further manipulate these variables, but when called the original dataframe df_ven is unchanged and match_df returns a variable not defined error.
return doesn't inject local variables into the caller's scope; it makes the function call evaluate to their values.
If you write
a, b = scorer_tester_function(df_ven, 'WR', df_ven['Name 1'].sample(2), df_emp['Name 2'], 1)
then a will have the value of match_df from inside the function and b will have the value of dataframe, but the names match_df and dataframe go out of scope after the function returns; they do not exist outside of it.

Pandas Series.apply - use arguments from another Series?

I have the following statement:
>>> df['result'] = df['value'].apply(myfunc, args=(x,y,z))
The Python function myfunc was written before I started using Pandas and is set up to take single values. The arguments x and z are fixed and can easily be passed as a variable or literal, but I have a column in my DataFrame that represents the y parameter, so I'm looking for a way to use that row's value for each row (they differ from row to row).
i.e. df['y'] is a series of values that I'd like to send in to myfunc
My workaround is as follows:
values = list(df['value'])
y = list(df['y'])
df['result'] = pd.Series([myfunc(values[i],x,y[i],z) for i in range(0,len(values))])
Any better approaches?
EDIT
Using functools.partial has a gotcha that was able to work out. If your call does not stick to keyword arguments then it appears to resort to positional and then you may run into the 'myfunc() got multiple values for...' error.
I modified the answer from coldspeed:
# Function myfunc takes named arguments arg1, arg2, arg3 and arg4
# The values for arg2 and arg4 don't change so I'll set them when
# defining the partial (assume x and z have values set)
myfunc_p = partial(myfunc, arg2=x, arg4=z)
df['result'] = [myfunc_p(arg1=w, arg3=y) for w, y in zip(df['value'], df['y'])]
You could also apply over the rows with a lambda like so:
df['result'] = df.apply(lambda row: myfunc(row['value'], y=row['y'], x=x, z=z), axis=1)
I think what you're doing is fine. I'd maybe make a couple of improvements:
from functools import partial
myfunc_p = partial(myfunc, x=x, z=z)
df['result'] = [myfunc_p(v, y) for v, y in zip(df['value'], df['y'])]
You don't need to wrap the list in a pd.Series call, and you can clean up your function call by fixing two of the arguments with functools.partial.
There's also the other option using np.vectorize (disclaimer, this does not actually vectorize the function, just hides the loop) for more concise code, but in most cases the list comprehension should be faster.
myfunc_v = np.vectorize(partial(myfunc, x=x, z=z))
df['result'] = myfunc_v(df['value'], df['y'])

Creating Dataframe column using apply() in Python

Say you have a Table T with Columns A and B with numerical values. I want to create a new column C that gives me the ratio of A/B. I know the easy way to do this.
T['C']=T['A']/T['B']
But I want to try using the apply() function to a new copy of Table T. I have the following function below to execute this for any tables.
def ratio(T):
X=T.copy()
def ratio(a,b):
return a/b
X['C']=X['C'].apply(ratio,'A','B')
return X
I get the KeyError: 'C' error. How do I properly get 'C' to exist in order to apply it/
You could simplify this with lambda:
X = T.copy()
X['C'] = T.apply(lambda row: row.A/row.B, axis=1)

Categories