dataframe select a word on the text - python

i have a datafame in input and i want to extract in column "localisation" the word in this list ["SECTION 11","CÔNE","BELLY"] and i have to create new column "word"
in the dataframe. If the word of the list exist in the the column "localisation" i fill the word in the column created "word".otherwise i put the full text in the column "word"
This my dataframe
I create new column "word"
I selected the line containing the words from the list
I fill in the "word" column with the keywords found from the list
["SECTION 11","CÔNE","BELLY"]
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
my problem I can not put the full text if the word of the list is not found in the column "localization". I have null values ​​in the lines or I have to put the full text

You need to use .fillna with the df["localisation"] as argument:
df["word"]=df["localisation"].str.extract(r"\b(SECTION 11|CÔNE|BELLY)\b", expand=False).fillna(df["localisation"])
Note also that I suggested r"\b(SECTION 11|CÔNE|BELLY)\b", a regex with word boundaries to only match your alternatives as whole words. Note that word boundaries \b are Unicode-aware in Python re that is used behind the scenes in Pandas.
If you do not need the whole word search, you may keep using r"SECTION 11|CÔNE|BELLY".

Related

Delete rows based on the multiple words stored in a list as fast as possible

I have a dataframe which consist a column named Keyword. There are around 1M Keywords. I want to delete all the rows where the Keywords consist of the words I stored in the list.
Here is some words stored in the list:
excluded_words = ['nz','ca']
I have tried the follwing code:
df[~df['Keyword'].str.contains('|'.join(exclude_words), regex = True)]
This code is blazing fast. Doing its job but with a little issue.
It is deleting any keywords which contains any words including "ca". I want to delete only those keywords where "ca" is a seperate word.
For example let's say we have two below Keywords
cast iron sump pump
sump pump repair service near ca
The first keyword shouldn't be deleted as "ca" is just a part of the keyword "cast", not just a word itself. Where the second keyword should be surely deleted as "ca" is a word there.
How to modify the code so that it can deal with it? Thank you in advance.
You can surround each word to exclude with r'\b', a raw Python string which represents the regular expression special sequence for a word boundary (re.py docs):
excluded_words = ['nz', 'ca']
excluded_words = [r'\b' + x + r'\b' for x in excluded_words]
df[~df['Keyword'].str.contains('|'.join(excluded_words), regex=True)]

np.where for string containing specific word

I want to write code that adds value in a new column depending on the condition. For example, I have df, it has sentences as values. I want to add a new column, called "Marketing" if the string in the certain column contains the word "Marketing". Example:
df['Marketing'] =np.where(df['Funding_%'] == 'Marketing', "marketing", 'rest')
The problem is that expression above works only with an exact match, not if a string contains the word.
Data Sample (with error where all rows are marked as marketing):
Thanks
You need to match the string with the substring so use .str.contains()
Your example does string matching with another string that's why getting the issue. You need to find whther the substring exist in the string or not so use that.
df['Marketing'] =np.where(df['Funding_%'].str.contains('Marketing', regex= True), "marketing", 'rest')

Splitting the column values based on a delimiter (Pandas)

I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!

Compare strings in dataframe column according to levenshtein distance with words in a list

After using PyTesseract I have a dataframe with words(they are in Greek but it does not matter).
I have also created a list(words_list) that is my custom dictionary ,containing specific words of the topic that I am examining.
What I want to do is,to compare every word in df["no_punctuation"] with every word in the list and
If the levenshtein distance between the pair of the words is lower than 4, I want to replace the word in the dataframe with the according word from the list
Otherwise,I want to leave the cell empty
Essentially,it is a step for my own spellchecker however I can not make it work so far.
An image of the dataframe and the list are attached.
dataframe
list
What I have tried so far is this :
for j in range (0,len(df2)):
for word in words_list:
if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
df2["new"][j]=word
else:
df2["new"][j]=""
And as it is presented in the dataframe image,it corrects only the word "generali" and leaves all the rest cells empty.However,there are many other cells that should be completed too.
I have also tried the below,however it produces only empty cells.
df2['new']=df2["no_punctuation"].apply(lambda x:"" if (enchant.utils.levenshtein(text,word)>=4 for word in words_list) else word )
I think I am close,but still something is wrong.Any ideas?
The reason for empty cells is the else condition you provided. So, for all comparisons with Levenshtein distance >4 the empty string is entered.
Removing the else condition will definitely solve your problem.
Also, define a new column outside the loop.
df2["new"][j]==""
for j in range (len(df2)):
for word in words_list:
if (enchant.utils.levenshtein(df2["no_punctuation"][j],word)<4):
df2["new"][j]=word

replacing multiple strings in a column with a list

I have a column named "New_category" and I have to replace multiple strings into the column with the word "brandies". I have created the following list:
brandies = ["apricot brandies", "cherry brandies", "peach brandies"]
I want to write a formula, which will replace "apricot brandies", "cherry brandies", "peach brandies" in the column with the word "brandies".
This is what i have tried but it does not work.
iowa['New_category'] = iowa['New_Category'].str.replace('brandies','Brandies')
When using str.replace use a pattern that captures the whole string.
iowa['New_category'] = iowa['New_category'].str.replace('^.*brandies.*$', 'Brandies')

Categories