Basic python defining function - python

I'm having difficulty applying my knowledge of defining functions with def to my own function.
I want to create a function where I can filter my data frame based on my 1. columns I'd like to drop + their axis 2. using .dropna
I've used it on one of my data frames like this :
total_adj_gross = ((gross.drop(columns = ['genre','rating', 'total_gross'], axis = 1)).dropna())
I've also used it on another data frame like this :
vill = (characters.drop(columns = ['hero','song'], axis = 1)).dropna(axis = 0)
Can I make a function using def so I can easily do this to any data frame?
if so would I go about it like this
def filtered_df(data, col_name, N=1):
frame = data.drop(columns = [col_name], axis = N)
frame.dropna(axis = N)
return frame
I can already feel like this above function would go wrong because what if I have different N's like in my vill object?
BTW I am a very new beginner as you can tell -- I haven't been exposed to any complex functions any help would be appreciated!
Update since I dont know how to make a code in comments:
Thank you all for your help in creating my function
but now how do I insert this in my code?
Do I have to make a script (.py) then call my function?
can I test in within my actual code?
right now if I just copy + paste any code in, and fill the column name I get an error saying the specific column code "is not found in the axis"

Based on what you want to achieve, you don't need to pass any axis parameter. Also, you want to pass a list of columns as a parameter to drop the different columns (axis=1 for drop() and axis=0 for dropna(), which is the default parameter value). And finally, dropna() is not in place by default. You have to store the returned value into a frame like you did the line above.
Your function should look like that:
def filtered_df(data, col_names):
frame = data.drop(columns = col_names, axis = 1)
result = frame.dropna()
return result

Overall, code looks good. I'd suggest 3 minor changes:-
Pass columns names as list. Do not convert them to list within the functions
Pass 2 variables for working with axis. From what i see in your eg, your axis values changes for drop and dropna. Not sure about your need for it. If you want 2 diff axis values for drop() and dropna() then please use 2 diff variables, probably like drop_axis and dropna_axis.
assigning modified frame / single line operation
So, code would look something like this:-
def filtered_df(data, col_name, drop_axis=1, dropna_axis=0):
frame = data.drop(columns = col_name, axis = drop_axis).dropna(axis = dropna_axis)
return frame
Your call to it can look like:
modified_df = filtered_df(data, ["x_col","y_col"], 0, 0)

Related

Edit all cells of column in Pandas DataFrame conditionally using original value of each cell

I have to edit all cells of one column (here column named "Links") in Pandas DataFrame conditionally using original value of each cell.
I know how to modify each cell of column, but don't know how to edit cell using original value of cell and make modification conditionally.
I have simple sample of Data Frame:
I am interested in last column "Links".
If cell endwswith .html I need to change it to:
<A> <original value> </A>
for example:
/l/sf-49ers/456346aaa.html
if it is number I need to make:
some-domain.info/number
for example:
some-domain.info/343
If it is text (string):
I need to put it in B tags:
for example:
"Baltimore Rayens"
If it is None I need to replace it with text "No link specified".
I have used this syntax:
def change_links(df):
conditions = (..........)
values = [.......................]
df['Links'] = np.select(conditions, values)
return df
but this does not work for me.
As I understand you correct, you need to use apply function:
def change_links(link):
if 'html' in link:
newLink=0 #here change your link
elif link[-1].isdigit():
newLink='some-domain.info/'+str(link) #or do what you nedd
else:
newLink=1 #here add your tags
return newLink
df['newLink'] = df['Links'].apply(lambda x(change_links(x)))
Theres a formatting probleme in your question, but you can achieve what you want with the function apply()
def myFunction(value):
# For example for a int
new_value = "some-domain.info/" + str(value)
return new_value
df['Links'] = df.['Links'].apply(myFunction)
I could complete the answer with more information.
Since your question is kinda messy, I am answering your question based on what I have understood. You have two options in front of you:
apply
map
in either cases you simply need to do something like what follows:
def myCustomFunc(valueOfRow):
# you need to change the value inside this function.
return valueOfRow + "/more-link"
df["Link"] = df["Link"].apply(myCustomFunc)
If you are interested in map, you can use map function instead of apply in the abovementioned cell.

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

How to assign objects to a columns using the .between fuction

I am trying to label a data frame by on-peak, mid-peak, off-peak etc. I managed to get the values I want to assign in this 'Mid-Peak', df['Peak'][df['func'] == 'Winter_Weekend']. However, when I include the .between_time I get the error: SyntaxError: can't assign to function call. I am not sure how to fix this. My goal is for the code code to work like this. Do I need another function or a do I need to change the syntax? Thank you for the help.
df['Peak'][df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False) = 'Mid-Peak'
In general, you can't assign a result to a function call, so need a different syntax. You could try
selection = df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False)
selection["Peak"] = "Mid-Peak"
But this doesn't update your original df, only the rows copied into selection.
To update the original dataframe, one way is to use loc to select both rows and a column, and .index to apply the between_time selection to the original dataframe:
ww = df["func"] == "Winter_Weekend"
df.loc[df[ww].between_time('16:00', '21:00', include_end=False).index, "Peak"] = "Mid-Peak"
I would recommend leveraging np.where() here, as follows:
df['Peak'] = np.where(df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False), 'Mid-Peak', df['Peak'])

Not deleting rows as part of function - python

Please keep in mind i am coming from an R background (quite novice as well).
I am trying to create a UDF to format a data.frame df in Python, according to some defined rules. The first part deletes the first 4 rows of the data.frame and the second adds my desired column names. My function looks like this:
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
dfFormatF(df)
When i run it like this, its not working (neither dropping the first rows nor renaming). When i remove the x=x.iloc[4:], the second part x.columns = ['Name1', 'Name2', 'Name3'] is working properly and the column names are renamed. Additionally, if i run the removal outside the function, such as:
def dfFormatF(x):
#Assign column headers
x.columns = ['Name1', 'Name2', 'Name3']
df=df.iloc[4:]
dfFormatF(df)
before i call my function i get the full expected result (first removal of the first rows and then the desired column naming).
Any ideas as to why it is not working as part of the function, but it does outside of it?
Any help is much appreciated.
Thanks in advance.
The issue here is that the changes only inside the scope of dfFormatF(). Once you exit that function, all changes are lost because you do not return the result and you do not assign the result to something in the module-level scope. It's worth taking a step back to understand this in a general sense (this is not a Pandas-specific thing).
Instead, pass your DF to the function, make the transformations you want to that DF, return the result, and then assign that result back to the name you passed to the function.
Note This is a big thing in Pandas. What we emulate here is the inplace=True functionality. There are lots of things you can do to DataFrames and if you don't use inplace=True then those changes will be lost. If you stick with the default inplace=False then you must assign the result back to a variable (with the same or a different name, up to you).
import pandas as pd
starting_df = pd.DataFrame(range(10), columns=['test'])
def dfFormatF(x):
#Remove 4 first lines
x = x.iloc[4:]
#Assign column headers
x.columns = ['Name1']
print('Inside the function')
print(x.head())
return x
dfFormatF(starting_df)
print('Outside the function')
print(starting_df) # Note, unchanged
# Take 2
starting_df = dfFormatF(starting_df)
print('Reassigning changes back')
print starting_df.head()

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)

Categories