I am posting this and hoping I will get a convincing answer.
df is my dataframe. I want to know what is being passed to min_max in apply function. When I print row inside min_max I don't get a dataframe same as I get outside it
import numpy as np
def min_max(row):
print(row)
print()
data = row[['POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']]
return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis=1)
df.apply simply calls/invokes provided function, in your case min_max function for each objects in input axis. From documentation of apply function, axis=1 represents row wise operation and axis=0 represents column wise operation
Thus, in your case, it will invoke min_max function for each row of dataframe.
For further elaboration.
def print_funt(row):
pdb.set_trace()
print(row)
df = pd.DataFrame({'Temp1':[62,62,50,62,50,62,62],
'Temp2':[66,66,69,66,69,66,66],
'Temp3':[52,62,52,62,52,62,52],
'Target':[0.24,0.28,0.25,0.28,0.25,0.28,0.24]})
print(df)
df.apply(print_funt, axis=1)
output of apply function at first iteration
Related
I have a Python class that takes a geopandas Series or Dataframe to initialize (specifically working with geopandas, but I imagine it to be the same solution as pandas). This class has attributes/methods that utilize the various columns in the series/dataframe. Outside of this, I have a dataframe with many rows. I would like to iterate through (ideally in an efficient/parallel manner as each row is independent of each other) this dataframe, and call a method in the class for each row (aka Series). And append the results as a column to the dataframe. But I am having trouble with this. With the standard list comprehension/pandas apply() methods, I can call like this e.g.:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2['date']))
But if said function (or in my case, class) needs the entire gdf, and I call like this:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2))
It does not work because 'my_function()' takes a dataframe or series, while what is being sent to it is the column names (strings) of gdf2.
How can I apply a function to all rows in a dataframe if said function takes an entire dataframe/series and not just select column(s)? In my specific case, since it's a method in a class, I would like to do this, or something similar to call this method on all rows in a dataframe:
gdf1['function_return_col'] = list(map((lambda f: my_class(f).my_method()), gdf2))
Or am I just thinking of this in the entirely wrong way?
Have you tried using pandas dataframe method called "apply".
Here is an example of using it for both row axis and column axis.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
df1 = df.apply(np.sum, axis=0)
print(df1)
df1 = df.apply(np.sum, axis=1)
print(df1)
I create a simple function to replace a certain column in df by row:
def replace(df):
for index, row in df.iterrows():
row['ALARM_TEXT'] = row['ALARM_TEXT'].replace('\'','')
return df
But the input df has not been changed after I call the function. Is there something wrong with it?
We usually do
df['ALARM_TEXT'] = df['ALARM_TEXT'].str.replace('\'','')
What is a more elegant way of implementing below?
I want to apply a function: my_function to a dataframe where each row of the dataframe contains the parameters of the function. Then I want to write the output of the function back to the dataframe row.
results = pd.DataFrame()
for row in input_panel.iterrows():
(index, row_contents) = row
row_contents['target'] = my_function(*list(row_contents))
results = pd.concat([results, row_contents])
We'll iterate through the values and build a DataFrame at the end.
results = pd.DataFrame([my_function(*x) for x in input_panel.values.tolist()])
The less recommended method is using DataFrame.apply:
results = input_panel.apply(lambda x: my_function(*x))
The only advantage of apply is less typing.
I have a problem annoying me for some time now. I have written a function that should, based on the row values of a dataframe, create a new dataframe filled with values based on a condition in the function. My function looks like this:
def intI():
df_ = pd.DataFrame()
df_ = df_.fillna(0)
for index, row in Anno.iterrows():
genes=row['AR_Genes'].split(',')
df=pd.DataFrame()
if 'intI1' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
elif 'intI2' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
else:
df['Year']=row['Year']
df['Integrase']= 0
df_=df_.append(df)
return df_
when I call it like this Newdf=Anno['AR_Genes'].apply(intI()), I get the following error:
TypeError: 'DataFrame' object is not callable
I really do not understand why it does not work. I have done similar things before, but there seems to be a difference that I do not get. Can anybody explain what is wrong here?
*******************EDIT*****************************
Anno in the function is the dataframe that the function shal be run on. It contains a string, for example a,b,c,ad,c
DataFrame.apply takes a function which applies to all rows/columns of the DataFrame. That error occurs because your function returns a DataFrame which you then pass to apply.
Why do you do use .fillna(0) on a newly created, empty, DataFrame?
Would not this work? Newdf = intI()
I have an apply function that operates on each row in my dataframe. The result of that apply function is a new value. This new value is intended to go in a new column for that row.
So, after applying this function to all of the rows in the dataframe, there will be an entirely new column in that dataframe.
How do I do this in pandas?
Two ways primarily:
df['new_column'] = df.apply(my_fxn, axis=1)
or
df = df.assign(new_column=df.apply(my_fxn, axis=1))
If you need to use other arguments, you can pass them to the apply function, but sometimes it's easier (for me) to just use a lambda:
df['new_column'] = df.apply(lambda row: my_fxn(row, global_dict), axis=1)
Additionally, if your function can operate on arrays in a vectorized fashion, you could just do:
df['new_column'] = my_fxn(df['col1'], df['col2'])