Using pandas apply() function on a dataframe to create a new dataframe - python

I have a problem annoying me for some time now. I have written a function that should, based on the row values of a dataframe, create a new dataframe filled with values based on a condition in the function. My function looks like this:
def intI():
df_ = pd.DataFrame()
df_ = df_.fillna(0)
for index, row in Anno.iterrows():
genes=row['AR_Genes'].split(',')
df=pd.DataFrame()
if 'intI1' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
elif 'intI2' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
else:
df['Year']=row['Year']
df['Integrase']= 0
df_=df_.append(df)
return df_
when I call it like this Newdf=Anno['AR_Genes'].apply(intI()), I get the following error:
TypeError: 'DataFrame' object is not callable
I really do not understand why it does not work. I have done similar things before, but there seems to be a difference that I do not get. Can anybody explain what is wrong here?
*******************EDIT*****************************
Anno in the function is the dataframe that the function shal be run on. It contains a string, for example a,b,c,ad,c

DataFrame.apply takes a function which applies to all rows/columns of the DataFrame. That error occurs because your function returns a DataFrame which you then pass to apply.
Why do you do use .fillna(0) on a newly created, empty, DataFrame?
Would not this work? Newdf = intI()

Related

Pandas Dataframe: Function doesn't preserve my custom column order when returning df

I followed this code from user Lala la (https://stackoverflow.com/a/55803252/19896454)
to put 3 columns at the front and leave the rest with no changes. It works well inside the function but when returns the dataframe, it loses column order.
My desperate solution was to put the code on the main program...
Other functions in my code are able to return modified versions of the dataframe with no problem.
Any ideas what is happening?
Thanks!
def define_columns_order(df):
cols_to_order = ['LINE_ID','PARENT.CATEGORY', 'CATEGORY']
new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
df = df[new_columns]
return df
try using return(df.reindex(new_columns, axis=1)) and keep in mind DataFrame modifications are not in place, unless you specify inplace=True, therefore you need to explicitly assign the result returned by your function to your df variable

Apply function to a list of columns in a loop and output dataframe

I have a function that is set up like this:
def function(df, apply_col, static_col):
do something
return df
#calling the function
df = function(df, 'col_a', 'st_col')
This works fine. But, I want to apply this function to a list of columns by name. I tried something like this:
col_list = ['col_a', 'col_b', 'col_c'....'col_n']
for i in list:
df = function(df, i, 'st_col')
I get a TypeError: 'list' object is not callable
I would like to apply this function to a dataframe in a loop with the static column staying the same and returning a resulting dataframe with all the columns having the function applied to them. Any ideas are appreciated!
I think your issue is with the line for i in list: Since list is defined as a python object it should read for i in col_list:

Applying function to dataframe column?

I have the following function (one-hot encoding function that takes a column as an input). I basically want to apply it to a column in my dataframe, but can't seem to understand what's going wrong.
def dummies(dataframe, col):
dataframe[col] = pd.Categorical(dataframe[col])
pd.concat([dataframe,pd.get_dummies(dataframe[col],prefix = 'c')],axis=1)
df1 = df['X'].apply(dummies)
Guessing something is wrong with how I'm calling it?
you need to make sure you're returning a value from the function, currently you are not..also when you apply a function to a column you are basically passing the value of each row in the column into the function, so your function is set up wrong..typically you'd do it like this:
def function1(value):
new_value = value*2 #some operation
return new_value
then:
df['X'].apply(function1)
currently your function is set up to take entire df, and the name of a column, so likely your function might work if you call it like this:
df1 = dummies(df, 'X')
but you still need to add a return statement
If you want to apply it to that one column you don't need to make a new dataframe. This is the correct syntax. Please read the docs.
df['X'] = df['X'].apply(lambda x : dummies(x))

How to use df.loc and if condtions in python pandas to delete a row

I wanted to use the if condition and df.loc[..] to compare two values in the same column.
If the previous value is higher then next one, I want to delete the complete row.
This what I tried and my example:
import pandas as pd
data = [('cycle',[1,1,2,2,3,3,4,4]),
('A',[0.1,0.5,0.2,0.6,0.15,0.43,0.13,0.59]),
('B',[ 500, 600, 510,580,512,575,499,598]),
('time',[0.0,0.2,0.5,0.4,0.6,0.7,0.5,0.8]),]
df = pd.DataFrame.from_items(data)
df = df.drop(df.loc[i,'time']<df.loc[i-1,'time'].index)
print(df)
and I got the following error :
TypeError: '<' not supported between instances of 'numpy.ndarray' and
'str'
Help is very is much appreciated
Try this:
df.drop(df.loc[df.time< df.time.shift()].index, inplace=True)
One problem is you are applying .index on the second df, before the comparison. You might try something like this:
df = df.drop((df.loc[i,'time'] < df.loc[i-1,'time']).index)
Try using pd.DataFrame.shift
Using shift:
df[df.time > df.time.shift()]
df.time.shift will return the original series where the index has been incremented by 1, so you are able to compare it to the original series. Each value will be compared to the one immediately below it. You can also set the fill_value parameter to determine the behavior of the first index:
df[df.time > df.time.shift(fill_value=0)]

New column using apply function on other columns in dataframe

I have a dataframe where three of the columns are coordinates of data ('H_x', 'H_y' and 'H_z'). I want to calculate radius-vector of the data and add it as a new column in my dataframe. But I have some kind of problem with pandas apply function.
My code is:
def radvec(x, y, z):
rv=np.sqrt(x**2+y**2+z**2)
return rv
halo_field['rh_field']=halo_field.apply(lambda row: radvec(row['H_x'], row['H_y'], row['H_z']), axis=1)
The error I'm getting is:
group_sh.py:78: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
halo_field['rh_field']=halo_field.apply(lambda row: radvec(row['H_x'], row['H_y'], row['H_z']), axis=1)
I get column that I want, but I'm still confused with this error message.
I'm aware there are similar questions here, but I couldn't find how to solve my problem. I'm fairly new to python. Can you help?
Edit: halo_field is a slice of another dataframe:
halo_field = halo_res[halo_res.N_subs==1]
The problem is you're working with a slice, which can be ambiguous:
halo_field = halo_res[halo_res.N_subs==1]
You have two options:
Work on a copy
You can explicitly copy your dataframe to avoid the warning and ensure your original dataframe is unaffected:
halo_field = halo_res[halo_res.N_subs==1].copy()
halo_field['rh_field'] = halo_field.apply(...)
Work on the original dataframe conditionally
Use pd.DataFrame.loc with a Boolean mask to update your original dataframe:
mask = halo_res['N_subs'] == 1
halo_res.loc[mask, 'rh_field'] = halo_res.loc[mask, 'rh_field'].apply(...)
Don't use apply
As a side note, in either scenario you can avoid apply for your function. For example:
halo_field['rh_field'] = (halo_field[['H_x', 'H_y', 'H_z']]**2).sum(1)**0.5

Categories