Python check if a dataframe single contains contains a keyword

Python check if a dataframe single contains contains a keyword - python

I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.

Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True

Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True

try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)

You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)

Related

Compare two column Pandas row per row

I want to compare 2 column. If same will True if not same will False like this:
filtering
lemmatization
check
[hello, world]
[hello, world]
True
[grape, durian]
[apple, grape]
False
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])

Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False is Series.equals return scalar, so here False

String manipulation for classification

I have a list of links such as:
Website
www.uk_nation.co.uk
www.nation_ny.com
www.unitednation.com
www.nation.of.freedom.es
www.freedom.org
and so on.
The above is how the column of my datadrame looks like.
As you can see, they have in common the word 'nation'.
I would like to label/group them and add one column in my dataframe to respond with a boolean (True/false; e.g. columns: Nation? option: True/False).
Website Nation?
www.uk_nation.co.uk True
www.nation_ny.com True
www.unitednation.com True
www.nation.of.freedom.es True
www.freedom.org False
I would need to do this in order to classify websites in a easier (and possible quicker) way.
Do you have any suggestions on how to do it?
Any help will be welcome.

Try str.contains
df['Nation']=df.Website.str.upper().str.contains('NATION')
0 True
1 True
2 True
3 True
4 False
Name: Website, dtype: bool

this is my suggestion:
import pandas as pd
df = pd.DataFrame({'Website': ['www.uk_nation.co.uk',
'www.nation_ny.com',
'www.unitednation.com',
'www.nation.of.freedom.es',
'www.freedom.org']})
df['Nation?'] = df['Website'].str.contains("nation")
print(df)
output:
Website Nation?
0 www.uk_nation.co.uk True
1 www.nation_ny.com True
2 www.unitednation.com True
3 www.nation.of.freedom.es True
4 www.freedom.org False

This should do it:
df['Nation?']= df['website'].apply(lambda x: 'nation' in x.lower())

Pandas DataFrame - duplicated() does not identify duplicate values

EDIT: I have stripped down the file to the bits that are problematic
raw_data = {"link":
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = print(df.iloc[12][0])
b = print(df.iloc[13][0])
if a == b:
print("equal")
#duplicate check #2
df.duplicated()
For the first test I get the following output implying there is a duplicate
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
For the second test it seems there are no duplicates
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
Original post:
Trying to identify duplicate values from the "Link" column of attached file:
data file
import pandas as pd
data = pd.read_csv(r"...\consolidated.csv", sep=",")
df = pd.DataFrame(data)
del df['Unnamed: 0']
duplicate_rows = df[df.duplicated(["Link"], keep="first")]
pd.DataFrame(duplicate_rows)
#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])
#if a == b:
# print("equal")
Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them
I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!

I had the same problem and converting the columns of the dataframe to "str" helped.
eg.
df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]

First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.
As for the deletion of duplicates, check the drop_duplicates method in the Documentation
Hope this helps.

Create new column returning true/false if names in two columns match using regex

I have a dataframe where I am trying to match the columns string values of two columns to create a new column that returns true if the two column values match or false if they don't.
Want to use match and regex, remove all non-alphanumeric characters and use lowercase to match the names
pattern = re.compile('[^a-zA-Z]')
Name A Name B
0 yGZ,) ygz.
1 (CGI) C.G.I
2 Exto exto.
3 Golden UTF
I was thinking of trying something like this:
dataframe['Name A', 'Name B'].str.match(pattern, flags= re.IGNORECASE)
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False

Can use pd.DataFrame.replace to clean your strings, and then compare using eq. Of course, if you wish to maintain a copy of your original df, just assign the returned data frame to a new variable ;}
df = df.replace("[^a-zA-Z0-9]", '', regex=True)
Then
df['Result'] = df['Name A'].str.lower().eq(df['Name B'].str.lower())
Outputs
Name A Name B Result
0 yGZ ygz True
1 CGI CGI True
2 Exto exto True
3 Golden UTF False

You can use str.replace to remove punctuation (also see another post of mine, Fast punctuation removal with pandas), then
u = df.apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())
df['Result'] = u['Name A'] == u['Name B']
df
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False

Return rows that match certain Japanese characters in a Series

I have a pandas dataframe with several columns in Japanese.
I would like to run a search that returns rows that contain certain Japanese characters.
ex.
find_str = 'バッグ'
I know I can't just use things like:
df[df.col1.str.contains(find_str)] or df[df.col1 == find_str]
How would I go about this? Like what encoding would I need to use, etc.?
name
0 ヴァラ
1 ALEXANDER WANG(アレキサンダーワン) クラッチバッグ パイソン【中古】
2 ミューズトゥ
3 ミューズトゥ
4 ローディーロック
5 バブーシュカクリスタルGG
I'd run something simple like:
df[df.name.str.contains('ゥ')]
which should return rows 2 and 3 but instead I get an empty result

For me working:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('file.csv', encoding='utf-8')
find_str = u'バッグ'
m = df['name'].str.contains(find_str)
print (m)
0 False
1 True
2 False
3 False
4 False
5 False
Name: name, dtype: bool

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python check if a dataframe single contains contains a keyword - python

Try: for index, row in ldf_first_page.iterrows(): if llst_transaction_keywords in row: return True

try: df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)

You can try using any(). for index, row in ldf_first_page.iterrows(): row.str.lower().any(llst_transaction_keywords)

Related

Compare two column Pandas row per row

String manipulation for classification

Pandas DataFrame - duplicated() does not identify duplicate values

Create new column returning true/false if names in two columns match using regex

Return rows that match certain Japanese characters in a Series

Categories

Resources