I'm working on a project where I'm would like to use 2 lambda functions to find a match in another column. I created a dummy df with the following code below:
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
Now I would like to find column A matches in column B.
df['match'] = df.apply(lambda x: x['B'].find(x['A']), axis=1).ge(0)
Now I would like to add an extra check where I'm also checking if column C values appear in column D:
df['match'] = df.apply(lambda x: x['D'].find(x['C']), axis=1).ge(0)
I'm searching for a solution where I can combine these 2 lines of code that is a one-liner that could be combined with an '&' operator for example. I hope this helps.
You can use and operator instead.
df['match'] = df.apply(lambda x: (x['B'] == x['A']) and (x['D'] == x['C']), axis=1).ge(0)
Related
This code snippet works well:
df['art_kennz'] = df.apply(lambda x:myFunction(x.art_kennz), axis=1)
However, here I have hard coded the column name art_kennz on both places: df['art_kennz'] and x.art_kennz. Now, I want to modify the script such that I have a list of column names and the df.apply runs for all those columns. So I tried this:
cols_with_spaces = ['art_kennz', 'fk_wg_sch']
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda x: myFunction(x.col_name)
, axis=1)
but this gives an error that:
AttributeError: 'Series' object has no attribute 'col_name'
because of x.col_name. Here, col_name is supposed to be the element from the for loop. What would be the correct syntax for this?
Try:
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda x: myFunction(x[col_name])
Explanation: You can access the Serie using attribute syntax e.g x.art_kennz, but since col_name is a variable containing a string that represent the attribute, bracket syntax is the correct way.
In this case x.art_kennz you use string but in for-loop you have variables you can not use .variables.
try this: (In this approach you iterate row by row)
for col_name in cols_with_spaces:
df[col_name] = df.apply(lambda row: myFunction(row[col_name]), axis=1)
If you want to iterate columns by columns you can try this:
for col_name in cols_with_spaces:
df[col_name] = df[col_name].apply(myFunction)
I'm trying to clean a file using pandas chaining and I have come to a point where I only need to clean one column and leave the rest as is, is there a way to accomplished this using pandas chaining using apply or pipe.
I have tried the following which works, but I only would like to replace the dash in one specific column and leave the the rest as is since the dash in other columns is appropriate.
df = (dataFrame
.dropna()
.pipe(lambda x: x.replace("-", "", regex=True))
)
I have also tried this, which doesn't work since it only returns the seatnumber column.
df = (dataFrame
.dropna()
.pipe(lambda x: x['seatnumber'].replace("-", "", regex=True))
)
Thanks is advance.
One way is to assign a column with the same name of the column of interest:
df = (dataFrame
.dropna()
.assign(seatnumber=lambda x: x.seatnumber.replace("-", "", regex=True))
)
where the dataframe at that point is passed to the lambda as x.
Let us pass a dict
df = (dataFrame
.dropna().replace({"seatnumber" : {"-":""}}, regex=True)
)
I want to split the values of the columns "words" and "frequency" into multiple rows of the dataframe df.
[1]: Problem https://i.stack.imgur.com/7i1p6.png
I use the following piece of code to manipulate the data:
df = (df.set_index(["document"]).apply(lambda x: x.str.split(",").explode()).reset_index())
The problem I have identified is that the values in column "words" and "frequency" are in brackets e.g. (word1, word2, word3, wordn). The output after execution of the code is NaN.
The following solution is sought:
[2]: Solution: https://i.stack.imgur.com/XQqo1.png
you were close! The problem might be in the reset of indices... For a csv file looking like:
"document","words","frequency"
"document 1","(cat,dog,bird)","(12,34,354)"
"document 2","(berlin,new_york,paris)","(1,13,254)"
import pandas as pd
df = pd.read_csv(csv_file)
df2 = df.apply(lambda x: x.str.split(",").explode())
df3 = df2.apply(lambda x: x.str.replace("(","").explode())
df4 = df3.apply(lambda x: x.str.replace(")","").explode())
print(df4)
Maybe you can do it with only one function (not with a lambda one)
I'm trying to pre-process some data for machine learning purposes. I'm currently trying to clean up some NaN values and replace them with 'unknown' and a prefix or suffix which is based on the column name.
The reason for this is when I'm use one hot encoding, I can't have multiple columns with the same name being fed into xgboost.
So what I have is the following
df = df.apply(lambda x: x.replace(np.nan, 'unknown'))
And I'd like to replace all instances of NaN in the df with 'unknown_columname'. Is there any easy or simple way to do this?
Try df = df.apply(lambda x: x.replace(np.nan, f'unknown_{x.name}')).
You can also use df = df.apply(lambda x: x.fillna(f'unknown_{x.name}').
First let's create the backup array to be filled whenever we have a missing value
s = np.core.defchararray.add('unknown',df.columns.values)
Then we can simply replace each NaN with the right value from s:
cols = df.columns.values
for col_name in cols:
df.col_name.fillna(s, inplace=True)
I have one dataframe like below - and I would like to test if Column Number is in the Column List of Numbers for each row record.
Eventually I would expect to get the Result Column as like below:
Is there any better way to get the expected result in Python Pandas?
Thanks!
Use list comprehension with in statement:
df['Result'] = [b in a for a, b in df[['List of Numbers','Number']].values]
Similar idea with zip:
df['Result'] = [b in a for a, b in zip(df['List of Numbers'],df['Number'])]
Or solution with DataFrame.apply:
df['Result'] = df.apply(lambda x: x['Number'] in x['List of Numbers'], axis=1)
EDIT: Change df['Result'] to mask for any solutions above and filter by boolean indexing :
mask = df.apply(lambda x: x['Number'] in x['List of Numbers'], axis=1)
df1 = df[mask]