Add new column to Pandas dataframe using conditional values from another column - python

I would like to add a new column retailer_relationship, to my dataframe.
I would like each row value of this new column to be 'TRUE' if the retailer column value starts with any items within the list retailer_relationship, and 'FALSE' otherwise.
What I've tried:
list_of_relationships = ("retailer1","retailer2","retailer3")
for i in df.index:
for relationship in list_of_relationships:
if df.iloc[i]['retailer'].str.startswith(relationship):
df.at[i, 'retailer_relationship'] = "TRUE"
else:
df.at[i, 'retailer_relationship'] = "FALSE"

You can use a regular expression combining the ^ special character, which specifies the beginning of the string, with another regex matching every element of retailer_relationship, since startswith does not accept regexes:
import re
regex = re.compile('^' + '|'.join(list_of_relationships))
df['retailer_relationship'] = df['retailer'].str.contains(regex).map({True: 'TRUE', False: 'FALSE'})
Since you want the literal strings 'TRUE' and 'FALSE', we can then use map to convert the booleans to strings.
An alternative method that is slightly faster, though I don't think that'll matter:
df['retailer_relationship'] = df['retailer'].str.contains(regex).transform(str).str.upper()

See if this works for you. It would help to share a sample of your df or a dummy data representing it.
df.loc['retailer_relationship'] = False
df.loc[df['retailer'].isin(retailer_relationship),'retailer_relationship'] = True

You still can using startswith in pandas
df['retailer_relationship'] = df['retailer'].str.startswith(tuple(retailer_relationship))

Related

Trying to Pass Pandas DataFrame to a Function and Return a Modified DataFrame

I'm trying to pass different pandas dataframes to a function that does some string modification (usually str.replace operation on columns based on mapping tables stored in CSV files) and return the modified dataframes. And I'm encountering errors especially with handling the dataframe as a parameter.
The mapping table in CSV is structured as follows:
From(Str)
To(Str)
Regex(True/False)
A
A2
B
B2
CD (.*) FG
CD FG
True
My code looks as something like this:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for index in range(df_mt.shape[0]):
# If regex is true
if df_mt.iloc[index][2] is True:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=df_mt.iloc[index][0], value = df_mt.iloc[index][1], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(df_mt.iloc[index][0], df_mt.iloc[index][1])
return df_p
df_new1 = apply_mapping_table1(df_old1, 'Target_Column1', 'MappingTable1.csv')
df_new2 = apply_mapping_table2(df_old2, 'Target_Column2', 'MappingTable2.csv')
I'm getting 'IndexError: single positional indexer is out-of-bounds' for 'df_mt.iloc[index][2]' and haven't gone to the portion where the actual replacement is happening. Any suggestions to make it work or even a better way to do the dataframe string replacements based on mapping tables?
You can use the .iterrows() function to iterate through lookup table rows. Generally, the .iterrows() function is slow, but in this case because the lookup table should be a small manageable table it will be completely fine.
You can adapt your give function as I did in the following snippet:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for _, row in df_mt.iterrows():
# If regex is true
if row['Regex(True/False)']:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=row['From(Str)'], value=row['To(Str)'], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(row['From(Str)'], row['To(Str)'])
return df_p

Conditional Strip / Replace based on length of string

I need to remove the space from a Dataframe of UK postcodes, but only those that contain seven characters.
Client Postcode lat long
4 CF1 1DA 51.479690 -3.182190
42640 CF951AF 51.481196 -3.171039
Is it possible to add a len() element to:
df['Client Postcode'] = df1['Client Postcode'].str.replace(" ","")
Here are two ways to conditionally change or create a new column:
First, numpy.where -
this function lets you return value x or y depending on a condition. In your case, return either the original postcode or the postcode without " " depending on the number of characters.
condition = df1['Client Postcode'].str.len()==7
df1['Client Postcode Clean'] = np.where(condition, df1['Client Postcode'].str.replace("", ""), df1['Client Postcode'])
You can use this method to either create a new column (like I did above) or change the original column.
Another way would be to use pandas slicing. You can use the loc accessor to find the rows you want to change and overwrite them.
condition = df1['Client Postcode'].str.len()==7
df1.loc[condition, 'Client Postcode'] = df1.loc[condition, 'postcode'].str.replace(" ","")
This method is harder to use to create a new column as it will return NaNs for rows that do not satisfy the condition.
Just to offer up one more alternative, one could iterate through the dataframe and scrub the post code as in the following code snippet.
import pandas as pd
df = pd.DataFrame([['CF1 1DA', 51.479690, -3.182190], ['CF951AF', 51.481196, -3.171039]], columns=['Client Postcode', 'Lat.', 'Long.'])
for i in range(len(df.index) - 1):
if (len(df['Client Postcode'][i]) == 7):
df['Client Postcode'] = df['Client Postcode'].str.replace(" ","")
print(df)
Hope that helps.
Regards.

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

How to search for specific text within a Pandas dataframe column?

I am wanting to identify all instances within my Pandas csv file that contains text for a specific column, in this case the 'Notes' column, where there are any instances the word 'excercise' is mentioned. Once the rows are identified that contain the 'excercise' keyword in the 'Notes' columnn, I want to create a new column called 'ExcerciseDay' that then has a 1 if the 'excercise' condition was met or a 0 if it was not. I am having trouble because the text can contain long string values in the 'Notes' column (i.e. 'Excercise, Morning Workout,Alcohol Consumed, Coffee Consumed') and I still want it to identify 'excercise' even if it is within a longer string.
I tried the function below in order to identify all text that contains the word 'exercise' in the 'Notes' column. No rows are selected when I use this function and I know it is likely because of the * operator but I want to show the logic. There is probably a much more efficient way to do this but I am still relatively new to programming and python.
def IdentifyExercise(row):
if row['Notes'] == '*exercise*':
return 1
elif row['Notes'] != '*exercise*':
return 0
JoinedTables['ExerciseDay'] = JoinedTables.apply(lambda row : IdentifyExercise(row), axis=1)
Convert boolean Series created by str.contains to int by astype:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise').astype(int)
For not case sensitive:
JoinedTables['ExerciseDay'] = JoinedTables['Notes'].str.contains('exercise', case=False)
.astype(int)
You can also use np.where:
JoinedTables['ExerciseDay'] = \
np.where(JoinedTables['Notes'].str.contains('exercise'), 1, 0)
Another way would be:
JoinedTables['ExerciseDay'] =[1 if "exercise" in x else 0 for x in JoinedTables['Notes']]
(Probably not the fastest solution)

Pandas: how to match multiple pattern (OR) with np.where

I would like to know if it is possible with np.where in pandas
to match multiple patterns with a kind of 'OR' argument
For exemple i try to create a new column in my DataFrame called 'kind'
and for each rows to fill it with "test" if the value in another column called 'label'
match any of the listed patterns otherwise to fill with "control".
I'm using this:
df['kind'] = np.where(df['label'] == 'B85_C', 'test', 'control')
And it is working well with 1 pattern
What i'm looking after is something like this:
df['kind'] = np.where(df['label'] == 'B85_C'OR'B85_N' ,'test', 'control')
Any ideas how to perform that or if there is alternatives?
Thanks
You can either use the bitwise or:
(df['label'] == 'B85_C') | (df['label'] == 'B85_N')
or you can use the isin method:
df['label'].isin(['B85_C', 'B85_N'])

Categories