include one word and exclude another python - python

Background: I have the following dataframe
import pandas as pd
d = {'text': ["paid", "paid and volunteer", "other phrase"]}
df = pd.DataFrame(data=d)
df['text'].apply(str)
Output:
text
0 paid
1 paid and volunteer
2 other phrase
Goal:
1) check each row to determine if paid is present and return a boolean (return True if paid is anywhere in the text column and False if paid is not present. But I would like to exclude the word volunteer. If volunteer is present, the result should be false.
2) create a new column with the results
Desired Output:
text result
0 paid true
1 paid and volunteer false
2 other phrase false
Problem: I am using the following code
df['result'] = df['text'].astype(str).str.contains('paid') #but not volunteer
I checked How to negate specific word in regex? and it shows how to exclude a word but I am not sure how to include in my code
Question:
How do I alter my code to achieve 1) and 2) of my goal

Using lambda:
df['result'] = df['text'].apply(lambda row: True if ('paid' in row) and ('volunteer' not in row) else False)

You can use a logical and to check for both conditions.
(df.text.str.contains('paid')) & (~df.text.str.contains('volunteer'))
Out[14]:
0 True
1 False
2 False
Name: text, dtype: bool

Related

Python check if a dataframe single contains contains a keyword

I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.
Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True
Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True
try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)
You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)

Check if there is a substring that matches a string from a list

Bit of a beginner question here; I currently have a pandas df with one column containing various different strings. I have some more columns which are currently empty. Example of first few rows below;
Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,
I've also created a set of lists as follows;
Cost = ['Cost']
Productivity = ['Productivity']
Security = ['Security','CCTV','Camera']
Essentially what I want to do is I want to go through each column and check whether the string in the "Risk" column on the same row contains a substring that matches one of the strings in the list. The ideal output would be as follows;
Risk,Cost,Productivity,Security
"Unforeseen cost due to security issues",TRUE,FALSE,TRUE
"Unexpected drop in Productivity",FALSE,TRUE,FALSE
I've tried a few different methods so far, such as
any(Cost in Risk for Cost in Costs)
However, I'm not sure if there's a way to avoid the any() function being case sensitive, and I'm not sure how to apply this to a whole column.. I did try
df['Cost'] = any(Cost in df['Risk'] for Cost in Costs)
but that returned a column full of "FALSE". Any nudge in the right direction would be hugely appreciated! Thank you
We can create a regex pattern corresponding to each of the list Cost, Security and Productivity, then using str.contains test for the occurrences of each of the regex pattern in the strings of column Risk
for c in ('Cost', 'Productivity', 'Security'):
df[c] = df['Risk'].str.contains(fr"(?i)\b(?:{'|'.join(locals()[c])})\b")
Risk Cost Productivity Security
0 Unforeseen cost due to CCTV failures True False True
1 Unexpected drop in Productivity False True False
Firstly create/define a function:
def check():
res=[]
for x in Search:
res.append(df['Risk'].str.split(' ',expand=True).isin(x).any(1))
return pd.DataFrame(res).T
Finally:
df[['Cost','Productivity','Security']]=check()
Output of df:
Risk Cost Productivity Security
0 Unforeseen cost due to CCTV failures False False True
1 Unexpected drop in Productivity False True False
I would make everything lowercase to get all matches regardsless of case, and then turn both the sentence and the words to check into sets, then check if there are any matches:
from io import StringIO
txt = '''Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,'''
df = pd.read_csv(
StringIO(txt),
sep=',',
index_col=None,
header=0
)
df['Risk'] = df['Risk'].str.lower()
df.columns = [item.lower() for item in df.columns]
print(df)
key_dict = {
'cost': set([item.lower() for item in ['Cost']]),
'productivity': set([item.lower() for item in ['Productivity']]),
'security': set([item.lower() for item in ['Security','CCTV','Camera']])
}
for idx in df.index:
word_set = set(df.loc[idx, 'risk'].split())
print(word_set)
for col in key_dict:
if len(word_set & key_dict[col]) > 0:
df.loc[idx, col] = True
else:
df.loc[idx, col] = False
risk cost productivity security
0 unforeseen cost due to cctv failures True False True
1 unexpected drop in productivity False True False

How to replace the entry of a column with different name by recognizing a pattern?

I have a column let's say 'Match Place' in which there are entries like 'MANU # POR', 'MANU vs. UTA', 'MANU # IND', 'MANU vs. GRE' etc. So my columns have 3 things in its entry, the 1st name is MANU i.e, 1st country code, 2nd is #/vs. and 3rd is again 2nd country name. So what I wanna do is if '#' comes in any entry of my column I want is to be changed to 'away' and if 'vs.' comes in replace whole entry to 'home' like 'MANU # POR' should be changed to 'away' and 'MANU vs. GRE' should be changed to 'home'
although I wrote some code to do so using for, if, else but it's taking a way too long time to compute it and my total rows are 30697
so is there any other way to reduce time
below I'm showing you my code
pls help
for i in range(len(df)):
if is_na(df['home/away'][i]) == True:
temp = (df['home/away'][i]).split()
if temp[1] == '#':
df['home/away'][i] = 'away'
else:
df['home/away'][i] = 'home
You can use np.select to assign multiple conditions:
s=df['Match Place'].str.split().str[1] #select the middle element
c1,c2=s.eq('#'),s.eq('vs.') #assign conditions
np.select([c1,c2],['away','home']) #assign this to the desired column
#array(['away', 'home', 'away', 'home'], dtype='<U11')
use np.where to with contains to check any substring exist or not
import numpy as np
df = pd.DataFrame(data={"col1":["manu vs. abc","manu # pro"]})
df['type'] = np.where(df['col1'].str.contains("#"),"away","home")
col1 type
0 manu vs. abc home
1 manu # pro away
You can use .str.contains(..) [pandas-doc] to check if the string contains an #, and then use .map(..) [pandas-doc] to fill in values accordingly. For example:
>>> df
match
0 MANU # POR
1 MANU vs. UTA
2 MANU # IND
3 MANU vs. GRE
>>> df['match'].str.contains('#').map({False: 'home', True: 'away'})
0 away
1 home
2 away
3 home
Name: match, dtype: object
A fun usage of replace more info check link
df['match'].replace({'#':0},regex=True).astype(bool).map({False: 'away', True: 'home'})
0 away
1 home
2 away
3 home
Name: match, dtype: object

Create new column returning true/false if names in two columns match using regex

I have a dataframe where I am trying to match the columns string values of two columns to create a new column that returns true if the two column values match or false if they don't.
Want to use match and regex, remove all non-alphanumeric characters and use lowercase to match the names
pattern = re.compile('[^a-zA-Z]')
Name A Name B
0 yGZ,) ygz.
1 (CGI) C.G.I
2 Exto exto.
3 Golden UTF
I was thinking of trying something like this:
dataframe['Name A', 'Name B'].str.match(pattern, flags= re.IGNORECASE)
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False
Can use pd.DataFrame.replace to clean your strings, and then compare using eq. Of course, if you wish to maintain a copy of your original df, just assign the returned data frame to a new variable ;}
df = df.replace("[^a-zA-Z0-9]", '', regex=True)
Then
df['Result'] = df['Name A'].str.lower().eq(df['Name B'].str.lower())
Outputs
Name A Name B Result
0 yGZ ygz True
1 CGI CGI True
2 Exto exto True
3 Golden UTF False
You can use str.replace to remove punctuation (also see another post of mine, Fast punctuation removal with pandas), then
u = df.apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())
df['Result'] = u['Name A'] == u['Name B']
df
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False

Checking if '?' is present anywhere in string data frame python

Background: I have the following dataframe:
import pandas as pd
d = {'text': ["yeah!", "tomorrow? let's go", "today will do"]}
df = pd.DataFrame(data=d)
df['text'].apply(str)
Output:
text
0 yeah!
1 tomorrow? let's go
2 today will do
Goal:
1) check each row to determine if '?' is present and return a boolean (return True if ? is anywhere in the text column and False if no ? is present
2) create a new column with the results
Desired output:
text result
0 yeah! False
1 tomorrow? let's go True
2 today will do False
Problem:
I am using the code below
df['Result'] = df.text.apply(lambda t: t[-1]) is "?"
Actual Output:
text result
0 yeah! False
1 tomorrow? let's go False
2 today will do False
Question: How do I alter my code to achieve 1) of my goal?
In regex ? is special character, so need escape it or use regex=False in contains:
df['result'] = df['text'].astype(str).str.contains('\?')
Or:
df['result'] = df['text'].astype(str).str.contains('?', regex=False)
Or:
df['result'] = df['text'].apply(lambda x: '?' in x )
print (df)
text result
0 yeah! False
1 tomorrow? let's go True
2 today will do False

Categories