Considering a function (apply_this_function) that will be applied to a dataframe:
# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}
# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself
In df_throway.appy(.), where does the function take the "passed_row" parameter? Or what value is this function taking? My assumption is that by the structure of apply(), the function takes values from row i starting at 1?
I am referring to the information obtained here
When you apply a function to a DataFrame with axis=1, then
this function is called for each row from the source DataFrame
and by convention its parameter is called row.
In your case this function returns (from each call) the original
row (actually a Series object), with 2 new elements added.
Then apply method collects these rows, concatenates them
and the result is a DataFrame with 2 new columns.
You wrote takes values from row i starting at 1. I would change it to
takes values from each row.
Writing starting at 1 can lead to misunderstandings, since when your
DataFrame has a default index, its values start from 0 (not from 1).
In addition, I would like to propose 2 corrections to your code:
Create your DataFrame passing data (your code sample does not
contain creation of df):
df_throwaway = pd.DataFrame(data)
Define your function as:
def apply_this_function(row):
row['new_col'] = True
row['added'] = pd.Timestamp.now()
return row
i.e.:
name the parameter as just row (everybody knows that this row
has been passed by apply method),
instead of datetime.datetime.now() use pd.Timestamp.now()
i.e. a native pandasonic type and its method.
Related
I have a dataframe that consist of two columns:
df = pd.DataFrame({"Country":['Taiwan', 'Malaysia', 'Taiwan', 'Taiwan', 'Malaysia'], 'Rating':[10, 9, 0, 5, 7]})
I made the first function to return the difference of rating between the two countries.
def difference_of_rating_average(dataframe, column_name="Country"):
taiwan = []
malaysia = []
for index , row in taiwan_and_malaysia.iterrows():
if row[column_name] == "Taiwan":
taiwan.append(row["Stars"])
else:
malaysia.append(row["Stars"])
return abs((sum(taiwan)/len(taiwan)) - (sum(malaysia)/len(malaysia)))
Then I make the second function with dataframe as input parameter. The second function would make an additional column, named "Shuffle", which contains the shuffled data of Country column in the input dataframe. In the end, the second function is expected to return the first function with input of dataframe and "Shuffle" column.
def one_simulated_difference(table):
table1 = pd.concat([table["Country"],table["Stars"]],axis=1,keys=['Country','Stars'])
shuffled_labels = table1["Country"].sample(frac=1).values
shuffled_table = table1
shuffled_table["Shuffle"] = shuffled_labels
return difference_of_rating_average(table1,"Shuffle")
However, when I run the second function, I got an error:
KeyError: 'Shuffle'
Which probably means the first function doesn't recognize the "Shuffle" column resulted from the the second function. I have check the name, upper and lower case, and all just fine.
What is the problem in this code?
Did you mean to do
for index , row in dataframe.iterrows():
rather than
for index , row in taiwan_and_malaysia.iterrows():
It looks like you are trying to iterate through a different table than the one you provide as an argument.
We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.
But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?
Does modifying the df variable alters the parent/main/global variable as well?
Consider the following example:
import pandas as pd
def add_Col(df):
df["New Column"] = 10 * 3
def mod_Col(df):
df["Existing Column"] = df["Existing Column"] ** 2
data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])
add_Col(df)
mod_col(df)
df
When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col?
Did invoking add_Col function create a copy of df or only a pointer?
What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?
It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.
On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.
If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.
Changes the original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df.loc[1,0] = 7
mutate_data(df)
print(df)
# 0
#0 1
#1 7
#2 4
Will not change original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df = df*2
mutate_data(df)
print(df)
# 0
#0 1
#1 2
#2 4
What should you do?
If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.
def add_column(df):
df['new_column'] = 7
return df
df = add_column(df)
#┃ ┃
#┗ on lhs & rhs ┛
In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.
However, that may have unintended consequences if you plan to write to a new object
df1 = add_column(df)
# | |
# New Obj Function still modifies this though!
A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.
def add_column_maintain_original(df):
df = df.copy()
df['new_column'] = 7
return df
Another possibility is to pass a copy to the function:
df1 = add_column(df.copy())
yes the function will indeed change the data frame itself without creating a copy of it. You should be careful of it because you might end up having columns changed without you noticing.
In my opinion the best practice depend on use cases and using .copy() will indeed have an impact on your memory.
If for instance you are creating a pipeline with some dataframe as input you do not want to change the input dataframe itself. While if you are just processing a dataframe and you are splitting the processing in different function you can write the function how you did it
I have a quite specific question pertaining to how ".loc" function works on the backend when 1. applied directly to a daraframe (ex. df.loc[]) as opposed to being used in a defined method and then applied using "df.apply()".
Here is the MultiIndex dataframe structure I am working with.
[My DataFrame 1]
#Sample Function
def sample(df):
for i in df:
val = df.loc['deep_impressions'] > 0
return val.sum()
df.apply(sample, axis=1)
The above code uses .loc without row/column indication by simply passing the outer column label and when applied to the DataFrame, returns the correct output, which is the sum of the 2 columns under te "deep_impressions" outer column index.
However, when applying the same logic not using a defined method, I must explicitly state that all rows, and only "deep_impressions" columns are to be summed.
df.loc[:,'deep_impressions'] > 0
df.sum(axis=1)
df
Why doesn't python require me to explicitly state (.loc[:,"deep_impressions]) when used in a defined method? How does it work on the backend?
I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64
If I have a csv file rows where one column has ordereddicts in them, how do I create a new column extract a single element of each ordereddict using python (3.+)/ pandas(.18)?
Here's an example. My column, attributes, has billingPostalCodes hidden in ordereddicts. All I care about is creating a column with the billingPostalCodes.
Here's what my data looks like now:
import pandas as pd
from datetime import datetime
import csv
from collections import OrderedDict
df = pd.read_csv('sf_account_sites.csv')
print(df)
yields:
id attributes
1 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
2 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'55555')])
...
I know on an individual level if I do this:
dict = OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
print(dict['BillingPostalCode'])
I'll get 85020 back as a result.
What do I have to get it to look like this?
id zip_codes
1 85020
2 55555
...
Do I have to use an apply function? A for loop? I've tried a lot of different things but I can't get anything to work on the dataframe.
Thanks in advance, and let me know if I need to be more specific.
This took me a while to work out, but the problem is resolved by doing the following:
df.apply(lambda row: row["attributes"]["BillingPostalCode"], axis = 1)
The trick here is to note that axis = 1 forces pandas to iterate through every row, rather than each column (which is the default setting, as seen in the docs).
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None,
args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1). Return type
depends on whether passed function aggregates, or the reduce argument
if the DataFrame is empty.
Parameters:
func : function Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
From there, it is a simple matter to first extract the relevant column - in this case attributes - and then from there extract only the BillingPostalCode.
You'll need to format the resulting DataFrame to have the correct column names.