Pandas: how to match multiple pattern (OR) with np.where - python

I would like to know if it is possible with np.where in pandas
to match multiple patterns with a kind of 'OR' argument
For exemple i try to create a new column in my DataFrame called 'kind'
and for each rows to fill it with "test" if the value in another column called 'label'
match any of the listed patterns otherwise to fill with "control".
I'm using this:
df['kind'] = np.where(df['label'] == 'B85_C', 'test', 'control')
And it is working well with 1 pattern
What i'm looking after is something like this:
df['kind'] = np.where(df['label'] == 'B85_C'OR'B85_N' ,'test', 'control')
Any ideas how to perform that or if there is alternatives?
Thanks

You can either use the bitwise or:
(df['label'] == 'B85_C') | (df['label'] == 'B85_N')
or you can use the isin method:
df['label'].isin(['B85_C', 'B85_N'])

Related

Trying to Pass Pandas DataFrame to a Function and Return a Modified DataFrame

I'm trying to pass different pandas dataframes to a function that does some string modification (usually str.replace operation on columns based on mapping tables stored in CSV files) and return the modified dataframes. And I'm encountering errors especially with handling the dataframe as a parameter.
The mapping table in CSV is structured as follows:
From(Str)
To(Str)
Regex(True/False)
A
A2
B
B2
CD (.*) FG
CD FG
True
My code looks as something like this:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for index in range(df_mt.shape[0]):
# If regex is true
if df_mt.iloc[index][2] is True:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=df_mt.iloc[index][0], value = df_mt.iloc[index][1], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(df_mt.iloc[index][0], df_mt.iloc[index][1])
return df_p
df_new1 = apply_mapping_table1(df_old1, 'Target_Column1', 'MappingTable1.csv')
df_new2 = apply_mapping_table2(df_old2, 'Target_Column2', 'MappingTable2.csv')
I'm getting 'IndexError: single positional indexer is out-of-bounds' for 'df_mt.iloc[index][2]' and haven't gone to the portion where the actual replacement is happening. Any suggestions to make it work or even a better way to do the dataframe string replacements based on mapping tables?
You can use the .iterrows() function to iterate through lookup table rows. Generally, the .iterrows() function is slow, but in this case because the lookup table should be a small manageable table it will be completely fine.
You can adapt your give function as I did in the following snippet:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for _, row in df_mt.iterrows():
# If regex is true
if row['Regex(True/False)']:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=row['From(Str)'], value=row['To(Str)'], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(row['From(Str)'], row['To(Str)'])
return df_p

LOC function on pandas column

I create a code see below:
check = ['jonge man']
data.loc[
data['Zoekterm'].str.contains(
f"{'|'.join(check)}"
),"Zoekterm_new",
data['Zoekterm']
I get Too many indexers erorr
What did i wrong
Use DataFrame.loc with second argument for column name like:
data.loc[data['Zoekterm'].str.contains(f"{'|'.join(check)}"), "Zoekterm_new"]
If need assign values add = - so for Zoekterm_new are added data from Zoekterm if matching condition, else NaN:
data.loc[data['Zoekterm'].str.contains(f"{'|'.join(check)}"), "Zoekterm_new"] = data['Zoekterm']
working like:
data["Zoekterm_new"] = np.where(data['Zoekterm'].str.contains(f"{'|'.join(check)}"), data['Zoekterm'], np.NaN)

Add new column to Pandas dataframe using conditional values from another column

I would like to add a new column retailer_relationship, to my dataframe.
I would like each row value of this new column to be 'TRUE' if the retailer column value starts with any items within the list retailer_relationship, and 'FALSE' otherwise.
What I've tried:
list_of_relationships = ("retailer1","retailer2","retailer3")
for i in df.index:
for relationship in list_of_relationships:
if df.iloc[i]['retailer'].str.startswith(relationship):
df.at[i, 'retailer_relationship'] = "TRUE"
else:
df.at[i, 'retailer_relationship'] = "FALSE"
You can use a regular expression combining the ^ special character, which specifies the beginning of the string, with another regex matching every element of retailer_relationship, since startswith does not accept regexes:
import re
regex = re.compile('^' + '|'.join(list_of_relationships))
df['retailer_relationship'] = df['retailer'].str.contains(regex).map({True: 'TRUE', False: 'FALSE'})
Since you want the literal strings 'TRUE' and 'FALSE', we can then use map to convert the booleans to strings.
An alternative method that is slightly faster, though I don't think that'll matter:
df['retailer_relationship'] = df['retailer'].str.contains(regex).transform(str).str.upper()
See if this works for you. It would help to share a sample of your df or a dummy data representing it.
df.loc['retailer_relationship'] = False
df.loc[df['retailer'].isin(retailer_relationship),'retailer_relationship'] = True
You still can using startswith in pandas
df['retailer_relationship'] = df['retailer'].str.startswith(tuple(retailer_relationship))

How to search for specific text within a Pandas dataframe column?

I am wanting to identify all instances within my Pandas csv file that contains text for a specific column, in this case the 'Notes' column, where there are any instances the word 'excercise' is mentioned. Once the rows are identified that contain the 'excercise' keyword in the 'Notes' columnn, I want to create a new column called 'ExcerciseDay' that then has a 1 if the 'excercise' condition was met or a 0 if it was not. I am having trouble because the text can contain long string values in the 'Notes' column (i.e. 'Excercise, Morning Workout,Alcohol Consumed, Coffee Consumed') and I still want it to identify 'excercise' even if it is within a longer string.
I tried the function below in order to identify all text that contains the word 'exercise' in the 'Notes' column. No rows are selected when I use this function and I know it is likely because of the * operator but I want to show the logic. There is probably a much more efficient way to do this but I am still relatively new to programming and python.
def IdentifyExercise(row):
if row['Notes'] == '*exercise*':
return 1
elif row['Notes'] != '*exercise*':
return 0
JoinedTables['ExerciseDay'] = JoinedTables.apply(lambda row : IdentifyExercise(row), axis=1)
Convert boolean Series created by str.contains to int by astype:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int)
For not case sensitive:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False)
.astype(int)
You can also use np.where:
JoinedTables['ExerciseDay'] = \
np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)
Another way would be:
JoinedTables['ExerciseDay'] =[1 if "exercise" in x else 0 for x in JoinedTables['Notes']]
(Probably not the fastest solution)

Pandas Python: take a subset of df by row labels while using re.IGNORECASE

I have df which looks like this:
print df_raw
Name exp1
Name
UnweightedBase 1364
Base 1349
BFC_q5a1 34.18%
BFC_q5a2 2.93%
BFC_q5a3 1.86%
BFC_q5a4 1.93%
BFC_q5a5 0.84%
I want to build subset from the dataframe above using row labels however, I was like to use re.IGNORECASE but I'm not sure how.
without re.IGNORECASE the code looks like this:
subset_df = df_raw.loc[df_raw.index.isin(['BFC_q5a4', 'BFC_q5a5'])]
How can I change my code to make use of re.IGNORECASE for the code below:
subset_df = df_raw.loc[df_raw.index.isin(['bFc_q5A4', 'BfC_Q5a5'])]
note - I don't want to use str.lower or str.upper to do this.
Thanks!
I don't know of any neat way to search index labels in a case-insensitive way (df.filter is useful but doesn't appear to be able to ignore case unfortunately).
To get around this, you could make use of the series method pd.Series.str.contains which can ignore case:
subset_df = df[pd.Series(df.index).str.contains(regex, case=False).values]
The index is turned in a Series and then regex matching is applied. regex in this case could be something like 'bFc_q5A4|BfC_Q5a5'. Case is ignored (using case=False).

Categories