Compare two column Pandas row per row - python

I want to compare 2 column. If same will True if not same will False like this:
filtering
lemmatization
check
[hello, world]
[hello, world]
True
[grape, durian]
[apple, grape]
False
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])

Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False is Series.equals return scalar, so here False

Related

Python check if a dataframe single contains contains a keyword

I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.
Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True
Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True
try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)
You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)

Pandas DataFrame - duplicated() does not identify duplicate values

EDIT: I have stripped down the file to the bits that are problematic
raw_data = {"link":
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = print(df.iloc[12][0])
b = print(df.iloc[13][0])
if a == b:
print("equal")
#duplicate check #2
df.duplicated()
For the first test I get the following output implying there is a duplicate
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
For the second test it seems there are no duplicates
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
Original post:
Trying to identify duplicate values from the "Link" column of attached file:
data file
import pandas as pd
data = pd.read_csv(r"...\consolidated.csv", sep=",")
df = pd.DataFrame(data)
del df['Unnamed: 0']
duplicate_rows = df[df.duplicated(["Link"], keep="first")]
pd.DataFrame(duplicate_rows)
#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])
#if a == b:
# print("equal")
Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them
I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!
I had the same problem and converting the columns of the dataframe to "str" helped.
eg.
df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]
First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.
As for the deletion of duplicates, check the drop_duplicates method in the Documentation
Hope this helps.

Create new column returning true/false if names in two columns match using regex

I have a dataframe where I am trying to match the columns string values of two columns to create a new column that returns true if the two column values match or false if they don't.
Want to use match and regex, remove all non-alphanumeric characters and use lowercase to match the names
pattern = re.compile('[^a-zA-Z]')
Name A Name B
0 yGZ,) ygz.
1 (CGI) C.G.I
2 Exto exto.
3 Golden UTF
I was thinking of trying something like this:
dataframe['Name A', 'Name B'].str.match(pattern, flags= re.IGNORECASE)
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False
Can use pd.DataFrame.replace to clean your strings, and then compare using eq. Of course, if you wish to maintain a copy of your original df, just assign the returned data frame to a new variable ;}
df = df.replace("[^a-zA-Z0-9]", '', regex=True)
Then
df['Result'] = df['Name A'].str.lower().eq(df['Name B'].str.lower())
Outputs
Name A Name B Result
0 yGZ ygz True
1 CGI CGI True
2 Exto exto True
3 Golden UTF False
You can use str.replace to remove punctuation (also see another post of mine, Fast punctuation removal with pandas), then
u = df.apply(lambda x: x.str.replace(r'[^\w]', '').str.lower())
df['Result'] = u['Name A'] == u['Name B']
df
Name A Name B Result
0 yGZ,) ygz. True
1 (CGI) C.G.I True
2 Exto exto. True
3 Golden UTF False

Conditional String matching in pandas

I have the following dataframe a
a=pd.DataFrame([[1,'bayern'],[2,'bayern_leverkusen'],[3,'Chelsea'],
[4,'manunited'],[5,'westhamunited'],[6,'mancity']]
,columns=['no','club'])
I would like to iterate column club in such a way that every values in club iterates with all the other values in club and select only those where there is matching of 4 or more consecutive characters.
For eq bayern and bayern_leverkusen should be filtered because they contain same substring bayern. Similarily, manunited and westhamunited should be filtered because they contain they contain same substring united.
mancity should not be filtered because the matching substring man is only 3.
Expected Output:
no club
0 1 bayern
1 2 bayern_leverkusen
3 4 manunited
4 5 westhamunited
import itertools
import pandas as pd
selector = pd.Series(False,index = a.index)
for first_index,second_index in itertools.combinations(a.index,2):
club1 = a['club'][first_index]
club2 = a['club'][second_index]
for start in range(len(club1)-3):
if club1[start:start+3] in club2:
selector[first] = True
selector[second] = True
break
new_df = a.loc[selector]

Return rows that match certain Japanese characters in a Series

I have a pandas dataframe with several columns in Japanese.
I would like to run a search that returns rows that contain certain Japanese characters.
ex.
find_str = 'バッグ'
I know I can't just use things like:
df[df.col1.str.contains(find_str)] or df[df.col1 == find_str]
How would I go about this? Like what encoding would I need to use, etc.?
name
0 ヴァラ
1 ALEXANDER WANG(アレキサンダーワン) クラッチバッグ パイソン【中古】
2 ミューズトゥ
3 ミューズトゥ
4 ローディーロック
5 バブーシュカクリスタルGG
I'd run something simple like:
df[df.name.str.contains('ゥ')]
which should return rows 2 and 3 but instead I get an empty result
For me working:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('file.csv', encoding='utf-8')
find_str = u'バッグ'
m = df['name'].str.contains(find_str)
print (m)
0 False
1 True
2 False
3 False
4 False
5 False
Name: name, dtype: bool

Categories