missing in function applied to pandas dataframe column - python

I'm trying to apply a function to my 'age' and 'area' columns in order to get the results that I show in the column 'wanted'.
Unfortunately this funtion gives me errors. I know that there are other methods in Pandas, like iloc, but I would like to understand this particular situation.
raw_data = {'age': [-1, np.nan, 10, 300, 20],'area': ['N','S','W',np.nan,np.nan],
'wanted': ['A',np.nan,'A',np.nan,np.nan]}
df = pd.DataFrame(raw_data, columns = ['age','area','wanted'])
df
def my_funct(df) :
if df["age"].isnull() :
return np.nan
elif df["area"].notnull():
return 'A'
else:
return np.nan
df["target"] = df.apply(lambda df:my_funct(df) ,axis = 1)

In your example, the problem is when you pass a row to your function, by referencing df['age'], it gives you a float, which doesn't have a method called isnull(). To check if a float is null, you can use the pd.isna function. Similar case for notna().
def my_funct(df) :
if pd.isna(df["age"]) :
return np.nan
elif pd.notna(df["area"]):
return 'A'
else:
return np.nan
df["target"] = df.apply(lambda x: my_funct(x) ,axis = 1)

Related

Remove less than character '<' and return half the numeric component styled to show changes

I need to clean up some data. For items in a dataframe that are of the format '<x' I want to return 'x/2' so if the cell contents is '<10' it should be replaced with '5', if the cell contents is '<0.006' it should be replace with 0.003 etc. I want changed cells to be formatted red and bold. I have the following code which operates in two steps and each step does what I want (almost) but I get a TypeError: 'float' object is not iterable when I try and chain them using : fixed_df=df.style.apply(color_less_than,axis=None).applymap(lessthan)
Note that the actual dataset may be thousands of rows and will contain mixed and Dummy data and code :
import pandas as pd
df = pd.DataFrame({'A': ['<10', '20', 'foo', '<30', '40'],
'B': ['baz', '<dlkj', 'bar', 'foo', '<5']})
def color_less_than(x):
c1 = 'color: red; font-weight: bold'
c2 = ''
df1 = pd.DataFrame(c2, index=x.index, columns=x.columns)
for col in x.columns:
mask = x[col].str.startswith("<")
#display(mask)
df1.loc[mask, col] = c1
return df1
def lessthan(x):
#for x in df:
if isinstance(x, np.generic):
return x.item()
elif type(x) is int:
return x
elif type(x) is float:
return x
elif type(x) is str and x[0]=="<":
try:
return float(x[1:])/2
except:
return x
elif type(x) is str and len(x)<10:
try:
return float(x)
except:
return x
else:
return x
coloured=df.style.apply(color_less_than,axis=None)
halved=df.applymap(lessthan)
display(coloured)
display(halved)
Note that the df item <dlkj does not display at all after applying color_less_than and I don't know why, I want it to be returned unformatted as it should not be changed (it's a string and cant be 'halved'). I have been trying to use the boolean mask to do both the calculation and the formatting but I can't get it to work.
This code will looped through the entire dataset and change any value containing '<' + integer||float to (int||float/2). I will then also check to see if the value is a string such as 'dlkj' and then add the color/bold style to the cell. Might have to test the line of code though, I did not attempt to do it.
for col in df:
for value in df[col].values:
if '<' in value:
num = value.split('<')[1]
try:
df[col] = df[col].replace([value], int(num)/2)
except ValueError:
try:
df[col] = df[col].replace([value], float(num)/2)
except ValueError:
print(num) # <-- should be your '<dlkj' value
# not sure if this line of code will work or not, wasnt able to test it
#df.style.set_properties(subset=df[col][value],**{'color': 'red', 'font-weight': 'bold'})
Without the style mapping, the desired output DF can be reached like so:
df = pd.DataFrame({'A': ['<10', '20', 'foo', '<30', '40'],
'B': ['baz', '<dlkj', 'bar', 'foo', '<5']})
for col in df.columns:
mask = df[col].str.match('<[0-9]+$|<[0-9]+[.][0-9]+$')
tmp = pd.to_numeric(df[col].str.slice(1), errors='coerce')
df[col] = np.where(mask, tmp/2, df[col])
print(df)
# A B
# 0 5.0 baz
# 1 20 <dlkj
# 2 foo bar
# 3 15.0 foo
# 4 40 2.5

Pandas / Python: Groupby.apply() with function dictionary

I'm trying to implement something like this:
def RR(x):
x['A'] = x['A'] +1
return x
def Locked(x):
x['A'] = x['A'] + 2
return x
func_mapper = {"RR": RR, "Locked": Locked}
df = pd.DataFrame({'A':[1,1], 'LookupVal':['RR','Locked'],'ID':[1,2]})
df= df.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.first()](x))
Output for column A would be 2, 6
where x.LookupVal is a column of strings (it will have the same value within each groupby("ID")) that I want to pass as the key to the dictionary lookup.
Any suggestions how to implement this??
Thanks!
The first is not what you think it is. It is for timeseries data and it requires an offset parameter. I think you are mistaken with groupby first
You can use iloc[0] to get the first value:
slice_b.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.iloc[0]](x))

Setting the values of a pandas df column based on ranges of values of another df column

I have a df that looks like this:
df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})
I would like to create a columns 'c' that looks at values of 'a' to determine what operation to do to 'b' and display it in new column 'c'.
I have a solution that uses iterrow, however, my real df is large and iterrows is inefficient.
What I would like to do is do this operation in a vectorized form.
My 'slow' solution is:
df['c'] = 0
for index, row in df.iterrows():
if row['a'] <=-2:
row['c'] = row['b']*np.sqrt(row[b]*row[a])
if row['a'] > -2 and row['a'] < 2:
row['c'] = np.log(row['b'])
if row['a'] >= 2:
row['c'] = row['b']**3
Use np.select. It's a vectorized operation.
conditions = [
df['a'] <= -2,
(df['a'] > -2) & (df['a'] < 2),
df['a'] >= 2
]
values = [
df['b'] * np.sqrt(df['b'] * df['a'])
np.log(df['b']),
df['b']**3
]
df['c'] = np.select(conditions, values, default=0)
You can use and .apply across multiple columns in a pandas (specifying axis=1) with a lambda function to get the job done. Not sure if the speed is ok. See this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})
def func(a_, b_):
if a_<=-2:
return b_*(b_*a_)**0.5
elif a_<2:
return np.log(b_)
else:
return b_**3.
df['c'] = df[['a','b']].apply(lambda x: func(x[0], x[1]), axis=1)
One method is to index by conditions and then operate on just those rows. Something like this:
df['c'] = np.nan
indices = [
df['a'] <= -2,
(df['a'] > -2) & (df['a'] < 2),
df['a'] >= 2
]
ops = [
lambda x: x['b'] * np.sqrt(x['b'] * x['a']),
lambda x: np.log(x['b']),
lambda x: x['b']**3
]
for ix, op in zip(indices, ops):
df.loc[ix, 'c'] = op(df)
df['c'] = df.apply(lambda x: my_func(x), 1)
def my_func(x):
if x['a'] <= -2:
return x['b']*np.sqrt(x[b]*x[a])
# write other conditions as needed
The df.apply function iterates over each row of the dataframe and applies the function passed(i.e lambda function). The second argument is axis which is set to 1 which means it will iterate over rows and row values will be passed into the lambda function. By default it is 0, in this case it will iterate over columns.
Lastly, you need to return a value which will be set as column 'c' value.

Creating a Pandas column from a nested dictionary

I have a nested dictionary called datastore containing keys m, n, o and finally 'target_a', 'target_b', or 'target c' (these contain the values). Additionally, I have a pandas dataframe df, which contains a number of columns. Three of these columns, 'r', 's', and 't', contain values that can be used as keys to find the values in the dictionary.
With the code below, I have attempted to do this using a lambda function, however, it requires calling the function three times, which seems pretty inefficient! Is there better way of doing this? Any help would be much appreciated.
def find_targets(m, n, o):
if m == 0:
return [1.5, 1.5, 1.5]
else:
a = datastore[m][n][o]['target_a']
b = datastore[m][n][o]['target_b']
c = datastore[m][n][o]['target_c']
return [a, b, c]
df['a'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[0],axis=1)
df['b'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[1],axis=1)
df['c'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[2],axis=1)
You can have your apply return a pd.Series, and then do the assignment in one pass using df.merge
Here's an example, modifying your function to return a pd.Series, but you can find other solutions aswell, keeping your finding function as you defined it and transforming it to series in the lambda expression.
def find_targets(m, n, o):
if m == 0:
return pd.Series({'a':1.5, 'b':1.5, 'c':1.5})
else:
a = d[m][n][o]['target_a']
b = d[m][n][o]['target_b']
c = d[m][n][o]['target_c']
return pd.Series({'a':a, 'b':b, 'c':c})
df.merge(df.apply(lambda x: find_targets(x['r'], x['s'], x['t']), axis=1), left_index=True, right_index=True)
If you make your find targets return a dictionary and in your lambda convert it to a pandas.Series, apply will create the rows for you and return a dataframe with the columns you want.
def find_targets(m, n, o):
if m == 0:
return {'a': 1.5, 'b': 1.5, 'c': 1.5}
else:
targets = {}
targets['a'] = datastore[m][n][o]['target_a']
targets['b'] = datastore[m][n][o]['target_b']
targets['c'] = datastore[m][n][o]['target_c']
return targets
abc_df = df.apply(lambda x: pd.Series(find_targets(x['r'], x['s'], x['t'])), axis=1)
df = pd.concat((df, abc_df), axis=1)
If you can't change the find_targets function you could still zip it with the keys you need:
abc_dict = dict(zip('abc', old_find_targets(...)))

Error when merging pandas df. TypeError: ("object of type 'float' has no len()", 'occurred at index D')

The code below effectively merges all values in a pandas df row before any 4 letter string. This only applies to rows directly underneath X in Col A.
df = pd.DataFrame({
'A' : ['X','Foo','No','','X','Big','No'],
'B' : ['','Bar','Merge','','','Cat','Merge'],
'C' : ['','Fubu','XXXX','','','BgCt','YYYY'],
})
maskX = df.iloc[:,0].apply(lambda x: x=='X')
maskX.index += 1
maskX = pd.concat([pd.Series([False]), maskX])
maskX = maskX.drop(len(maskX)-1)
mask = (df.iloc[:, 1:].applymap(len) == 4).cumsum(1) == 0
for i,v in maskX.items():
mask.iloc[i,:] = mask.iloc[i,:].apply(lambda x: x and v)
df.A[maskX] = df.A + df.iloc[:, 1:][mask].fillna('').apply(lambda x: x.sum(), 1)
df.iloc[:, 1:] = df.iloc[:, 1:][~mask].fillna('')
This works fine unless there's values other than strings in the df. So if include floats or integers it returns an error to that Column. e.g
df = pd.DataFrame({
'A' : ['X','Foo','No','','X','Big','No'],
'B' : ['','Bar','Merge','','','Cat','Merge'],
'C' : ['','Fubu','XXXX','','','BgCt','YYYY'],
'D' : ['','',1.0,2.0,3.0,'',''],
})
TypeError: ("object of type 'float' has no len()", 'occurred at index D')
I'm not quite sure why because the merge only occurs to the row beneath X in Col A. None of which contains floats?
applymap applies the function len to each element of the dataframe. Since floating-point numbers do not have length, the function cannot be applied to them. If you still want to know their "length," convert them to strings:
df.iloc[:, 1:].astype(str).applymap(len)
However, be advised that the function str is not guaranteed to produce a particular string representation of a float. For example, len(str(5.0000)) is 3, not 6, as you might expect.

Categories