np.where for string containing specific word - python

I want to write code that adds value in a new column depending on the condition. For example, I have df, it has sentences as values. I want to add a new column, called "Marketing" if the string in the certain column contains the word "Marketing". Example:
df['Marketing'] =np.where(df['Funding_%'] == 'Marketing', "marketing", 'rest')
The problem is that expression above works only with an exact match, not if a string contains the word.
Data Sample (with error where all rows are marked as marketing):
Thanks

You need to match the string with the substring so use .str.contains()
Your example does string matching with another string that's why getting the issue. You need to find whther the substring exist in the string or not so use that.
df['Marketing'] =np.where(df['Funding_%'].str.contains('Marketing', regex= True), "marketing", 'rest')

Related

dataframe select a word on the text

i have a datafame in input and i want to extract in column "localisation" the word in this list ["SECTION 11","CÔNE","BELLY"] and i have to create new column "word"
in the dataframe. If the word of the list exist in the the column "localisation" i fill the word in the column created "word".otherwise i put the full text in the column "word"
This my dataframe
I create new column "word"
I selected the line containing the words from the list
I fill in the "word" column with the keywords found from the list
["SECTION 11","CÔNE","BELLY"]
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
my problem I can not put the full text if the word of the list is not found in the column "localization". I have null values ​​in the lines or I have to put the full text
You need to use .fillna with the df["localisation"] as argument:
df["word"]=df["localisation"].str.extract(r"\b(SECTION 11|CÔNE|BELLY)\b", expand=False).fillna(df["localisation"])
Note also that I suggested r"\b(SECTION 11|CÔNE|BELLY)\b", a regex with word boundaries to only match your alternatives as whole words. Note that word boundaries \b are Unicode-aware in Python re that is used behind the scenes in Pandas.
If you do not need the whole word search, you may keep using r"SECTION 11|CÔNE|BELLY".

Splitting the column values based on a delimiter (Pandas)

I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!

How to Check and Filter out Row in Dataframe if "." exists in df Cell

I tried to reference this SF answer: How to check if character exists in DataFrame cell
It gave a seemingly good solution but it doesn't appear to work for a period character "." Which of course is the character I'm trying to filter out on.
df_intials = df['Name'].str.contains('.')
Is there something specific about filterting through a dataframe that every value in the column has a "."?
When I convert to a list, and write a simple function to append strings with the character "." to it works correctly.
pd.Series.str.contains uses regex expressions as default, so you can either use the escape character backslack or parameter regex=False:
Try
df_intials = df['Name'].str.contains('\.')
or
df_intials = df['Name'].str.contains('.', regex=False)

Pattern match and create new string based on number at end of string

I have data like the sample data below, where I have string values that have a number at the end of the string. I would like to pattern match the string inside the parenthesis and use the number at the end to indicate the order old field value occurs in a new string concatenated with "/". The output example is below, any suggestions are welcome.
Sample Data:
SampleDf=pd.DataFrame([['sum(field1)'],['count(field2)'],['Sum(value1)'],['Max(field3)']],columns=['ReportField'])
Sample Output:
OutputDf=pd.DataFrame([['sum(field1)/count(field2)/Max(field3)'],['Sum(value1)']],columns=['ratio'])
Following the approach from the previous question, you could extract the string and the number from the parenthesis as separate columns, sort by the number, group by the string and then aggregate the original field by joining them with /:
(pd.concat([
SampleDf,
SampleDf.ReportField.str.extract(r"\((.*?)(\d+)\)", expand=True)
], axis=1).sort_values(1)
.groupby(0).ReportField.agg({'ratio': "/".join}).reset_index(drop=True))
​
# ratio
#0 sum(field1)/count(field2)/Max(field3)
#1 Sum(value1)

replacing multiple strings in a column with a list

I have a column named "New_category" and I have to replace multiple strings into the column with the word "brandies". I have created the following list:
brandies = ["apricot brandies", "cherry brandies", "peach brandies"]
I want to write a formula, which will replace "apricot brandies", "cherry brandies", "peach brandies" in the column with the word "brandies".
This is what i have tried but it does not work.
iowa['New_category'] = iowa['New_Category'].str.replace('brandies','Brandies')
When using str.replace use a pattern that captures the whole string.
iowa['New_category'] = iowa['New_category'].str.replace('^.*brandies.*$', 'Brandies')

Categories