Please keep in mind i am coming from an R background (quite novice as well).
I am trying to create a UDF to format a data.frame df in Python, according to some defined rules. The first part deletes the first 4 rows of the data.frame and the second adds my desired column names. My function looks like this:
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
dfFormatF(df)
When i run it like this, its not working (neither dropping the first rows nor renaming). When i remove the x=x.iloc[4:], the second part x.columns = ['Name1', 'Name2', 'Name3'] is working properly and the column names are renamed. Additionally, if i run the removal outside the function, such as:
def dfFormatF(x):
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
df=df.iloc[4:]
dfFormatF(df)
before i call my function i get the full expected result (first removal of the first rows and then the desired column naming).
Any ideas as to why it is not working as part of the function, but it does outside of it?
Any help is much appreciated.
Thanks in advance.
The issue here is that the changes only inside the scope of dfFormatF(). Once you exit that function, all changes are lost because you do not return the result and you do not assign the result to something in the module-level scope. It's worth taking a step back to understand this in a general sense (this is not a Pandas-specific thing).
Instead, pass your DF to the function, make the transformations you want to that DF, return the result, and then assign that result back to the name you passed to the function.
Note This is a big thing in Pandas. What we emulate here is the inplace=True functionality. There are lots of things you can do to DataFrames and if you don't use inplace=True then those changes will be lost. If you stick with the default inplace=False then you must assign the result back to a variable (with the same or a different name, up to you).
import pandas as pd
starting_df = pd.DataFrame(range(10), columns=['test'])
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1']
print('Inside the function')
print(x.head())
return x
dfFormatF(starting_df)
print('Outside the function')
print(starting_df) # Note, unchanged
# Take 2
starting_df = dfFormatF(starting_df)
print('Reassigning changes back')
print starting_df.head()
Related
I'm having difficulty applying my knowledge of defining functions with def to my own function.
I want to create a function where I can filter my data frame based on my 1. columns I'd like to drop + their axis 2. using .dropna
I've used it on one of my data frames like this :
total_adj_gross = ((gross.drop(columns = ['genre','rating', 'total_gross'], axis = 1)).dropna())
I've also used it on another data frame like this :
vill = (characters.drop(columns = ['hero','song'], axis = 1)).dropna(axis = 0)
Can I make a function using def so I can easily do this to any data frame?
if so would I go about it like this
def filtered_df(data, col_name, N=1):
frame = data.drop(columns = [col_name], axis = N)
frame.dropna(axis = N)
return frame
I can already feel like this above function would go wrong because what if I have different N's like in my vill object?
BTW I am a very new beginner as you can tell -- I haven't been exposed to any complex functions any help would be appreciated!
Update since I dont know how to make a code in comments:
Thank you all for your help in creating my function
but now how do I insert this in my code?
Do I have to make a script (.py) then call my function?
can I test in within my actual code?
right now if I just copy + paste any code in, and fill the column name I get an error saying the specific column code "is not found in the axis"
Based on what you want to achieve, you don't need to pass any axis parameter. Also, you want to pass a list of columns as a parameter to drop the different columns (axis=1 for drop() and axis=0 for dropna(), which is the default parameter value). And finally, dropna() is not in place by default. You have to store the returned value into a frame like you did the line above.
Your function should look like that:
def filtered_df(data, col_names):
frame = data.drop(columns = col_names, axis = 1)
result = frame.dropna()
return result
Overall, code looks good. I'd suggest 3 minor changes:-
Pass columns names as list. Do not convert them to list within the functions
Pass 2 variables for working with axis. From what i see in your eg, your axis values changes for drop and dropna. Not sure about your need for it. If you want 2 diff axis values for drop() and dropna() then please use 2 diff variables, probably like drop_axis and dropna_axis.
assigning modified frame / single line operation
So, code would look something like this:-
def filtered_df(data, col_name, drop_axis=1, dropna_axis=0):
frame = data.drop(columns = col_name, axis = drop_axis).dropna(axis = dropna_axis)
return frame
Your call to it can look like:
modified_df = filtered_df(data, ["x_col","y_col"], 0, 0)
I have created a function that creates a pandas dataframe where I have created a new column that combines the first/middle/last name of an employee. I am then calling the function based on the python index(EmployeeID). I am able to run this function successfully for one employee. I am having trouble updating the function to be able to run multiple EmployeeIDs at once. Let's say I wanted to run 3 employee IDs through the function. How would I update this function to allow for that?
def getFullName(EmpID):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
In general, if you want an argument to be able to consist of one or more records, you can use a list or tuple to represent it.
In practice for this example, because python is dynamically typed and because the .loc function of the pandas Dataframes can also take a list of values as arguments, you dont have to change anything. Just pass a list of employee ids as EmpID.
Without knowing how the EmpIDs look like, it is hard to give an example.
But you can try it out, by calling your function with
getFullName(EmpID)
and with
getFullName([EmpID, EmpID])
The first call should print you the record once and the the second line should print you the record twice. You can replace EmpID with any working id (see df.index).
The documentation I liked above has some minimal examples to play around with.
PS: There is a bit of danger in passing a list to .loc. If you pass an EmpID that does not exist, pandas will currently only give a warning (in future version it will give a KeyError. For any unknown EmpID it will create a new row in the result with NaNs as values. From the documentation example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
df.loc[['viper', 'sidewinder']]
Will return
max_speed shield
viper 4 5
sidewinder 7 8
Calling it with missing indices:
print(df.loc[['viper', 'does not exist']])
Will produce
max_speed shield
viper 4.0 5.0
does not exist NaN NaN
You could add in an array of EmpIDs.
empID_list = [empID01, empID02, empID03]
Then you would need to use a for loop:
for empID in empID_list:
doStuff()
Or you just use your fuction as the function in the for loop.
for empID in empID_list:
getFullName(empID)
Let's say you have this list of employee IDs:
empIDs = [empID1, empID2, empID3]
You need to then pass this list as an argument instead of a single employee ID.
def getFullName(empIDs):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
for EmpID in empIDs:
if EmpID in df.index:
rec=df.loc[EmpID,'EmployeeName']
print(rec)
else:
print("UNKNOWN")
One way or another the if EmpID in df.index: will need to be rewritten. I suggest you pass a list called employee_ids as the input, then do the following (the first two lines are to wrap a single ID in a list, it is only needed if you still want to be able to pass a single ID):
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
In the old days, df.loc would accept missing labels and just not return anything, but in recent versions it raises an error. reindex will give you a row for every ID in employee_ids, with NaN as the value if the ID wasn't in the index. We therefore select the column EmployeeName and then drop the missing values with dropna.
Now, the only thing left to do is handle the output. The DataFrame has a (boolean) attribute called empty, which can be used to check whether any IDs were found. Otherwise we'll want to print the values of recs, which is a Series.
Thus:
def getFullName(employee_ids):
df = pd.read_excel('Employees.xls', 'Sheet0', index_col='EmployeeID', usecols=['EmployeeID','FirstName','MiddleName','LastName'] ,na_values=[""])
X = df[["FirstName","MiddleName","LastName"]]
df['EmployeeName'] = X.fillna('').apply(lambda x: x.LastName+", "+x.FirstName+" "+str(x.MiddleName), axis=1)
if not isinstance(employee_ids, list):
employee_ids = [employee_ids] # this ensures you can still pass single IDs
rec=df.reindex(employee_ids).EmployeeName.dropna()
if rec.empty:
print("UNKNOWN")
else:
print(rec.values)
(as an aside, you may like to know that a python convention is to use snake_case for function and variable names and CamelCase for class names)
We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.
But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?
Does modifying the df variable alters the parent/main/global variable as well?
Consider the following example:
import pandas as pd
def add_Col(df):
df["New Column"] = 10 * 3
def mod_Col(df):
df["Existing Column"] = df["Existing Column"] ** 2
data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])
add_Col(df)
mod_col(df)
df
When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col?
Did invoking add_Col function create a copy of df or only a pointer?
What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?
It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.
On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.
If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.
Changes the original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df.loc[1,0] = 7
mutate_data(df)
print(df)
# 0
#0 1
#1 7
#2 4
Will not change original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df = df*2
mutate_data(df)
print(df)
# 0
#0 1
#1 2
#2 4
What should you do?
If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.
def add_column(df):
df['new_column'] = 7
return df
df = add_column(df)
#┃ ┃
#┗ on lhs & rhs ┛
In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.
However, that may have unintended consequences if you plan to write to a new object
df1 = add_column(df)
# | |
# New Obj Function still modifies this though!
A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.
def add_column_maintain_original(df):
df = df.copy()
df['new_column'] = 7
return df
Another possibility is to pass a copy to the function:
df1 = add_column(df.copy())
yes the function will indeed change the data frame itself without creating a copy of it. You should be careful of it because you might end up having columns changed without you noticing.
In my opinion the best practice depend on use cases and using .copy() will indeed have an impact on your memory.
If for instance you are creating a pipeline with some dataframe as input you do not want to change the input dataframe itself. While if you are just processing a dataframe and you are splitting the processing in different function you can write the function how you did it
I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.
I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign documentation recommends doing this:
df.assign(ln_A = lambda x: np.log(x.A))
# or
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).
So is there a reason I should stop using my old method in favour of df.assign?
The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.
The premise on assign is that it returns:
A new DataFrame with the new columns in addition to all the existing columns.
And also you cannot do anything in-place to change the original dataframe.
The callable must not change input DataFrame (though pandas doesn't check it).
On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.
So is there a reason I should stop using my old method in favour of df.assign?
I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.