Background: I have the following dataframe:
import pandas as pd
d = {'text': ["yeah!", "tomorrow? let's go", "today will do"]}
df = pd.DataFrame(data=d)
df['text'].apply(str)
Output:
text
0 yeah!
1 tomorrow? let's go
2 today will do
Goal:
1) check each row to determine if '?' is present and return a boolean (return True if ? is anywhere in the text column and False if no ? is present
2) create a new column with the results
Desired output:
text result
0 yeah! False
1 tomorrow? let's go True
2 today will do False
Problem:
I am using the code below
df['Result'] = df.text.apply(lambda t: t[-1]) is "?"
Actual Output:
text result
0 yeah! False
1 tomorrow? let's go False
2 today will do False
Question: How do I alter my code to achieve 1) of my goal?
In regex ? is special character, so need escape it or use regex=False in contains:
df['result'] = df['text'].astype(str).str.contains('\?')
Or:
df['result'] = df['text'].astype(str).str.contains('?', regex=False)
Or:
df['result'] = df['text'].apply(lambda x: '?' in x )
print (df)
text result
0 yeah! False
1 tomorrow? let's go True
2 today will do False
Related
I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.
Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True
Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True
try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)
You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)
I want to compare 2 column. If same will True if not same will False like this:
filtering
lemmatization
check
[hello, world]
[hello, world]
True
[grape, durian]
[apple, grape]
False
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])
Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False is Series.equals return scalar, so here False
I have a pandas dataframe,
df = pd.DataFrame({"Id": [77000581079,77000458432,77000458433,77000458434,77000691973], "Code": ['FO07930', 'FO73597','FO03177','FO73596','FOZZZZZ']})
I want to check the value of each row in column Code to see if it matches str FOZZZZ
If the operation is False then I would like to concatenate Id value to Code value
So the expected output will be:
Id Code
0 77000581079 FO0793077000581079
1 77000458432 FO7359777000458432
2 77000458433 FO0317777000458433
3 77000458434 FO7359677000458434
4 77000691973 FOZZZZZ
Ive tried
df['Id'] = df['Id'].astype(str)
for x in df['Id']:
if x == 'FOZZZZ':
pass
else:
df['Id']+df['Code']
Which I thought would run over each row in Column Code to check if it is =
to 'FOZZZZ' if not then concatenate the columns but no joy..
df.loc[df['Code']!='FOZZZZZ', 'Code'] = df['Code'] + df['Id'].astype(str)
Use pandas.Series.where with eq:
s = df["Code"]
df["Code"] = s.where(s.eq("FOZZZZZ"), s + df["Id"].astype(str))
print(df)
Output:
Code Id
0 FO0793077000581079 77000581079
1 FO7359777000458432 77000458432
2 FO0317777000458433 77000458433
3 FO7359677000458434 77000458434
4 FOZZZZZ 77000691973
Try np.where(condition, solution if condition is true, solution if condition is false). Use .isin(to check) if FOZZZZZ exists and reverse using ~ to build a boolean query to be used as condition.
df['Code']=np.where(~df['Code'].isin(['FOZZZZZ']), df.Id.astype(str)+df.Code,df.Code)
Id Code
0 77000581079 77000581079FO07930
1 77000458432 77000458432FO73597
2 77000458433 77000458433FO03177
3 77000458434 77000458434FO73596
4 77000691973 FOZZZZZ
Or you could try using loc:
df['Code'] = df['Code'] + df['Id'].astype(str)
df.loc[df['Code'].str.contains('FOZZZZZ'), 'Code'] = 'FOZZZZZ'
print(df)
Output:
Code Id
0 FO0793077000581079 77000581079
1 FO7359777000458432 77000458432
2 FO0317777000458433 77000458433
3 FO7359677000458434 77000458434
4 FOZZZZZ 77000691973
I have a dataframe where I am trying to match the columns string values of two columns to create a new column that returns true if the two column values match or false if they don't.
Want to use match and regex, remove all non-alphanumeric characters and use lowercase to match the names
pattern = re.compile('[^a-zA-Z]')
Name A Name B
0 yGZ,) ygz.
1 (CGI) C.G.I
2 Exto exto.
3 Golden UTF
I was thinking of trying something like this:
dataframe['Name A', 'Name B'].str.match(pattern, flags= re.IGNORECASE)
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False
Can use pd.DataFrame.replace to clean your strings, and then compare using eq. Of course, if you wish to maintain a copy of your original df, just assign the returned data frame to a new variable ;}
df = df.replace("[^a-zA-Z0-9]", '', regex=True)
Then
df['Result'] = df['Name A'].str.lower().eq(df['Name B'].str.lower())
Outputs
Name A Name B Result
0 yGZ ygz True
1 CGI CGI True
2 Exto exto True
3 Golden UTF False
You can use str.replace to remove punctuation (also see another post of mine, Fast punctuation removal with pandas), then
u = df.apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())
df['Result'] = u['Name A'] == u['Name B']
df
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False
Background: I have the following dataframe
import pandas as pd
d = {'text': ["paid", "paid and volunteer", "other phrase"]}
df = pd.DataFrame(data=d)
df['text'].apply(str)
Output:
text
0 paid
1 paid and volunteer
2 other phrase
Goal:
1) check each row to determine if paid is present and return a boolean (return True if paid is anywhere in the text column and False if paid is not present. But I would like to exclude the word volunteer. If volunteer is present, the result should be false.
2) create a new column with the results
Desired Output:
text result
0 paid true
1 paid and volunteer false
2 other phrase false
Problem: I am using the following code
df['result'] = df['text'].astype(str).str.contains('paid') #but not volunteer
I checked How to negate specific word in regex? and it shows how to exclude a word but I am not sure how to include in my code
Question:
How do I alter my code to achieve 1) and 2) of my goal
Using lambda:
df['result'] = df['text'].apply(lambda row: True if ('paid' in row) and ('volunteer' not in row) else False)
You can use a logical and to check for both conditions.
(df.text.str.contains('paid')) & (~df.text.str.contains('volunteer'))
Out[14]:
0 True
1 False
2 False
Name: text, dtype: bool