Apply function to two columns of a Pandas dataframe - python

I'm finding several answers to this question, but none that seem to address or solve the error that pops up when I apply them. Per e.g. this answer I have a dataframe df and a function my_func(string_1,string_2) and I'm attempting to create a new column with the following:
df.['new_column'] = df.apply(lambda x: my_func(x['old_col_1'],x['old_col_2']),axis=1)
I'm getting an error originating inside my_func telling me that old_col_1 is type float and not a string as expected. In particular, the first line of my_func is old_col_1 = old_col_1.lower(), and the error is
AttributeError: 'float' object has no attribute 'lower'
By including debug statements using dataframe printouts I've verified old_col_1 and old_col_2 are indeed both strings. If I explicitly cast them to strings when passing as arguments, then my_func behaves as you would expect if it were being fed numeric data cast as strings, though the column values are decidedly not numeric.
Per this answer I've even explicitly ensured these columns are not being "intelligently" cast incorrectly when creating the dataframe:
df = pd.read_excel(file_name, sheetname,header=0,converters={'old_col_1':str,'old_col_2':str})
The function my_func works very well when it's called on its own. All this is making me suspect that the indices or some other numeric data from the dataframe is being passed, and not (exclusively) the column values.
Other implementations seem to give the same problem. For instance,
df['new_column'] = np.vectorize(my_func)(df['old_col_1'],df['old_col_2'])
produces the same error. Variations (e.g. using df['old_col_1'].to_numpy() or df['old_col_1'].values in place of df['old_col_1']) don't change this.

Is it possible that you have a np.nan/None/null data in your columns? If so you might be getting an error similar to the one that is caused with this data
data = {
'Column1' : ['1', '2', np.nan, '3']
}
df = pd.DataFrame(data)
df['Column1'] = df['Column1'].apply(lambda x : x.lower())
df

Related

Why Pandas custs the zero in the first position?

I load data to dataframe:
dfzips = pd.read_excel(filename_zips, dtype='object')
Dataframe has column with value: 00590
After load dataframe I got this as 590.
I have tried dtype='object'. Does not help me.
Have you tried using str instead of object?
if you use str (string) it maintains the zeros at the beginning.
It could be good to specify the column name you would like to change to str or object (without quotations).
1)
dfzips = pd.read_excel(filename_zips,dtype=str)
It even supports this a dict mapping where the keys constitute the column names and values the data types to be set when you want to change it. You didnt specify the column name so i just put it as "column_1"
dfzips = pd.read_excel(filename_zips,dtype={"column_1":str})
This is the well known issue in pandas library.
try below code :
dfzips = pd.read_excel(filename_zips, dtype={'name of the column': pd.np.str})
Or:
try writing converters while reading excel.
example : df = pd.read_excel(file, dtype='string', header=headers,
converters={'name of the column': decimal_converter})
function decimal_converter:
def decimal_converter(value):
try:
return str(float(value))
except ValueError:
return value
converters={'column name': function}
You can modify converter function according to your requirement.
Try above solutions. I hope it should work. Good Day

DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version warning

I appending a new row to an existing pandas dataframe as follows:
df= df.append(pd.Series(), ignore_index=True)
This is resulting in the subject DeprecationWarning.
The existing df has a mix of string, float and dateime.date datatypes (8 columns totals).
Is there a way to explicitly specify the columns types in the df.append?
I have looked here and here but I still have no solution. Please advise if there is a better way to append a row to the end of an existing dataframe without triggering this warning.
You can try this
Type_new = pd.Series([],dtype=pd.StringDtype())
This will create a blank data frame for us.
You can add dtype to your code.
pd.Series(dtype='float64')
df = df.append(pd.Series(dtype = 'object'), ignore_index=True)
If the accepted solution still results in :
'ValueError: No objects to concatenate'
Try this solution from FutureWarning: on `df['col'].apply(p.Series)` :
(lambda x: pd.Series(x, dtype="float"))

Clarification about Replace and Dtypes with Pandas

This is a strange problem. I don't have the possibility to produce a MVE.
I have two dataset in pandas. They contain some Series that can have three values: "Yes", "No", NaN. These Series have Dtype Object at this moment.
I want to remove the NaNs from them, and to prepare them to be used by ML algorithms, so I do this:
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
In final_df1 the Dtype of the Series I mentioned before becomes automatically int64 after dropping the NaN values and replacing the values. In final_df2, this does not happen. They contain the same values (0 and 1) so I really do not understand this.
In order to create a Minimum Viable Example, I tried to
Isolate the Series, do the transformation on them one by one and check the results
Take only a small portion of the Dataframes
Save the DFs on disk and work on them from another script to recreate the problem
But, in any of those attempts, the result was different. Either both DFs ended up with Series having Object Dtype, or both with Int64 Dtype.
For me, this is important, because later on I need the dummies of those DFs, and if some Int series are Object series on the other DF, the columns will not match. This problem is easy to solve, I just need to cast explicitly, but still I have one doubt and I would need to confirm it:
If I replace the content of an Object Series (without NaNs) with numbers, is there a random possibility of this Series being cast to Int64?
I see this as the only explanation for what I am facing
Thanks in advance. If you find any way to clarify my question, please edit or comment
EDIT 1: Screenshots from Spyder
This is the code. I am printing in console the most essential relevant data: Dtype, values and number of Nulls
This is the output before the Drop/Replace. Well, I could have printed something more nice to read, but the idea is simple: before the Drop/Replace they both have null values, they both have "Yes" and "No" values, they both are object type Series.
Aaaaand this is after the Drop/Replace. As you can see, they both have no nulls now, they both have 1/0, but one of them is an object Series, the other is an int64 Series.
I really do not understand: they were the same type before!
Here is sample to reproduce.
If you change col_1 '0' to 0 it will change the dtype
import pandas as pd
import numpy as np
data = {'col_1': ['Yes', 'No', np.nan, '0'], 'col_2': [np.nan, 'Yes', 'Yes', 'No']}
df=pd.DataFrame.from_dict(data)
d1=df[['col_1']]
d2=df[['col_2']]
print(d1.dtypes)
print(d2.dtypes)
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
print(final_df1.dtypes)
print(final_df2.dtypes)
you can also convert the datatypes in the final_df definition
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)

Excel Column Converter with a specific Column Does not works

I tried to code the program that allows the user enter the column and sort the column and replace the cell to the other entered information but I probably get syntact errors
I tried to search but I could not find any solution
import pandas as pd
data = pd.read_csv('List')
df = pd.DataFrame(data, columns = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'])
findL = ['example']
replaceL = ['convert']
col = 'C';
df[col] = df[col].replace(findL, replaceL)
TypeError: Cannot compare types 'ndarray(dtype=float64)' and 'str'
I seems that your df[col] and findLand replaceLdo not have the same datatype. Try to run df[col] = df[col].astype(str) beofre you run df[col]=df[col].replace(findL, replaceL)and it should work
If the column/s you are dealing with has blank entries in it, you have to specify the na_filter parameter in .read_csv() method to be False.
That way, it will take all the column entries with blank/empty values as str and thus the not empty ones as str as well.
Doing the .replace() method using this will not give a TypeError as you will be parsing through both columns as strings and not 'ndarray(dtype=float64) and str.

Using pandas apply() function on a dataframe to create a new dataframe

I have a problem annoying me for some time now. I have written a function that should, based on the row values of a dataframe, create a new dataframe filled with values based on a condition in the function. My function looks like this:
def intI():
df_ = pd.DataFrame()
df_ = df_.fillna(0)
for index, row in Anno.iterrows():
genes=row['AR_Genes'].split(',')
df=pd.DataFrame()
if 'intI1' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
elif 'intI2' in genes:
df['Year']=row['Year']
df['Integrase']= 1
df_=df_.append(df)
else:
df['Year']=row['Year']
df['Integrase']= 0
df_=df_.append(df)
return df_
when I call it like this Newdf=Anno['AR_Genes'].apply(intI()), I get the following error:
TypeError: 'DataFrame' object is not callable
I really do not understand why it does not work. I have done similar things before, but there seems to be a difference that I do not get. Can anybody explain what is wrong here?
*******************EDIT*****************************
Anno in the function is the dataframe that the function shal be run on. It contains a string, for example a,b,c,ad,c
DataFrame.apply takes a function which applies to all rows/columns of the DataFrame. That error occurs because your function returns a DataFrame which you then pass to apply.
Why do you do use .fillna(0) on a newly created, empty, DataFrame?
Would not this work? Newdf = intI()

Categories