This code snippet works well:
df['art_kennz'] = df.apply(lambda x:myFunction(x.art_kennz), axis=1)
However, here I have hard coded the column name art_kennz on both places: df['art_kennz'] and x.art_kennz. Now, I want to modify the script such that I have a list of column names and the df.apply runs for all those columns. So I tried this:
cols_with_spaces = ['art_kennz', 'fk_wg_sch']
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda x: myFunction(x.col_name)
, axis=1)
but this gives an error that:
AttributeError: 'Series' object has no attribute 'col_name'
because of x.col_name. Here, col_name is supposed to be the element from the for loop. What would be the correct syntax for this?
Try:
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda x: myFunction(x[col_name])
Explanation: You can access the Serie using attribute syntax e.g x.art_kennz, but since col_name is a variable containing a string that represent the attribute, bracket syntax is the correct way.
In this case x.art_kennz you use string but in for-loop you have variables you can not use .variables.
try this: (In this approach you iterate row by row)
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda row: myFunction(row[col_name]), axis=1)
If you want to iterate columns by columns you can try this:
for col_name in cols_with_spaces:
df[col_name] = df[col_name].apply(myFunction)
Related
I'm working on a project where I'm would like to use 2 lambda functions to find a match in another column. I created a dummy df with the following code below:
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
Now I would like to find column A matches in column B.
df['match'] = df.apply(lambda x: x['B'].find(x['A']), axis=1).ge(0)
Now I would like to add an extra check where I'm also checking if column C values appear in column D:
df['match'] = df.apply(lambda x: x['D'].find(x['C']), axis=1).ge(0)
I'm searching for a solution where I can combine these 2 lines of code that is a one-liner that could be combined with an '&' operator for example. I hope this helps.
You can use and operator instead.
df['match'] = df.apply(lambda x: (x['B'] == x['A']) and (x['D'] == x['C']), axis=1).ge(0)
I have a loop logic using iterrows but the performance is bad
result = []
for index, row in df_test.iterrows():
result.append(product_recommendation_model.predict(df_test.iloc[[index]]))
submission = pd.DataFrame({'ID': df_test['ID'],
'Result': result
})
display(submission)
I would like to rewrite it with using apply lambda but I have no idea how to get the full data frame.
a = df_test.apply(lambda x: product_recommendation_model.predict(df_test.iloc[[x]]) ,axis=1)
Can anyone help me please? Thanks.
I think this works for you
df_new = df_test.apply(lambda row: pd.Series([row['ID'],product_recommendation_model.predict(row)] ,axis=1)
df_new.columns = ['ID','Result']
Note: You can also pass argument to your prediction like row[column_name] if you want to pass only one column value to predict, row will send all column values of a row.
Finally, I can run it with the below code.
df_test.apply(lambda i: product_recommendation_model.predict(i.to_frame().T), axis=1)
I'm trying to pre-process some data for machine learning purposes. I'm currently trying to clean up some NaN values and replace them with 'unknown' and a prefix or suffix which is based on the column name.
The reason for this is when I'm use one hot encoding, I can't have multiple columns with the same name being fed into xgboost.
So what I have is the following
df = df.apply(lambda x: x.replace(np.nan, 'unknown'))
And I'd like to replace all instances of NaN in the df with 'unknown_columname'. Is there any easy or simple way to do this?
Try df = df.apply(lambda x: x.replace(np.nan, f'unknown_{x.name}')).
You can also use df = df.apply(lambda x: x.fillna(f'unknown_{x.name}').
First let's create the backup array to be filled whenever we have a missing value
s = np.core.defchararray.add('unknown',df.columns.values)
Then we can simply replace each NaN with the right value from s:
cols = df.columns.values
for col_name in cols:
df.col_name.fillna(s, inplace=True)
I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))
I have been checking each value of each row and if all of them are null, I delete the row with something like this:
df = pandas.concat([df[:2], df[3:]])
But, I am thinking there's got to be a better way to do this. I have been trying to use a mask or doing something like this:
rows_to_keep = df.apply(
lambda row :
any([if val is None for val in row ])
, axis=1)
I also tried something like this (suggested on another stack overflow question)
pandas.DataFrame.dropna()
but don't see any differences in my printed dataframe.
dropna returns a new DataFrame, you probably just want:
df = df.dropna()
or
df.dropna(inplace=True)
If you have a more complicated mask, rows_to_keep, you can do:
df = df[rows_to_keep]