replacing multiple strings in a column with a list - python

I have a column named "New_category" and I have to replace multiple strings into the column with the word "brandies". I have created the following list:
brandies = ["apricot brandies", "cherry brandies", "peach brandies"]
I want to write a formula, which will replace "apricot brandies", "cherry brandies", "peach brandies" in the column with the word "brandies".
This is what i have tried but it does not work.
iowa['New_category'] = iowa['New_Category'].str.replace('brandies','Brandies')

When using str.replace use a pattern that captures the whole string.
iowa['New_category'] = iowa['New_category'].str.replace('^.*brandies.*$', 'Brandies')

Related

dataframe select a word on the text

i have a datafame in input and i want to extract in column "localisation" the word in this list ["SECTION 11","CÔNE","BELLY"] and i have to create new column "word"
in the dataframe. If the word of the list exist in the the column "localisation" i fill the word in the column created "word".otherwise i put the full text in the column "word"
This my dataframe
I create new column "word"
I selected the line containing the words from the list
I fill in the "word" column with the keywords found from the list
["SECTION 11","CÔNE","BELLY"]
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
my problem I can not put the full text if the word of the list is not found in the column "localization". I have null values ​​in the lines or I have to put the full text
You need to use .fillna with the df["localisation"] as argument:
df["word"]=df["localisation"].str.extract(r"\b(SECTION 11|CÔNE|BELLY)\b", expand=False).fillna(df["localisation"])
Note also that I suggested r"\b(SECTION 11|CÔNE|BELLY)\b", a regex with word boundaries to only match your alternatives as whole words. Note that word boundaries \b are Unicode-aware in Python re that is used behind the scenes in Pandas.
If you do not need the whole word search, you may keep using r"SECTION 11|CÔNE|BELLY".

np.where for string containing specific word

I want to write code that adds value in a new column depending on the condition. For example, I have df, it has sentences as values. I want to add a new column, called "Marketing" if the string in the certain column contains the word "Marketing". Example:
df['Marketing'] =np.where(df['Funding_%'] == 'Marketing', "marketing", 'rest')
The problem is that expression above works only with an exact match, not if a string contains the word.
Data Sample (with error where all rows are marked as marketing):
Thanks
You need to match the string with the substring so use .str.contains()
Your example does string matching with another string that's why getting the issue. You need to find whther the substring exist in the string or not so use that.
df['Marketing'] =np.where(df['Funding_%'].str.contains('Marketing', regex= True), "marketing", 'rest')

Splitting the column values based on a delimiter (Pandas)

I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!

How to Check and Filter out Row in Dataframe if "." exists in df Cell

I tried to reference this SF answer: How to check if character exists in DataFrame cell
It gave a seemingly good solution but it doesn't appear to work for a period character "." Which of course is the character I'm trying to filter out on.
df_intials = df['Name'].str.contains('.')
Is there something specific about filterting through a dataframe that every value in the column has a "."?
When I convert to a list, and write a simple function to append strings with the character "." to it works correctly.
pd.Series.str.contains uses regex expressions as default, so you can either use the escape character backslack or parameter regex=False:
Try
df_intials = df['Name'].str.contains('\.')
or
df_intials = df['Name'].str.contains('.', regex=False)

How To Remove \x95 chars from text - Pandas?

I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.
Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))

Categories