String manipulation for classification - python

I have a list of links such as:
Website
www.uk_nation.co.uk
www.nation_ny.com
www.unitednation.com
www.nation.of.freedom.es
www.freedom.org
and so on.
The above is how the column of my datadrame looks like.
As you can see, they have in common the word 'nation'.
I would like to label/group them and add one column in my dataframe to respond with a boolean (True/false; e.g. columns: Nation? option: True/False).
Website Nation?
www.uk_nation.co.uk True
www.nation_ny.com True
www.unitednation.com True
www.nation.of.freedom.es True
www.freedom.org False
I would need to do this in order to classify websites in a easier (and possible quicker) way.
Do you have any suggestions on how to do it?
Any help will be welcome.

Try str.contains
df['Nation']=df.Website.str.upper().str.contains('NATION')
0 True
1 True
2 True
3 True
4 False
Name: Website, dtype: bool

this is my suggestion:
import pandas as pd
df = pd.DataFrame({'Website': ['www.uk_nation.co.uk',
'www.nation_ny.com',
'www.unitednation.com',
'www.nation.of.freedom.es',
'www.freedom.org']})
df['Nation?'] = df['Website'].str.contains("nation")
print(df)
output:
Website Nation?
0 www.uk_nation.co.uk True
1 www.nation_ny.com True
2 www.unitednation.com True
3 www.nation.of.freedom.es True
4 www.freedom.org False

This should do it:
df['Nation?']= df['website'].apply(lambda x: 'nation' in x.lower())

Related

Python check if a dataframe single contains contains a keyword

I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.
Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True
Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True
try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)
You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)

How would I groupby and see if all members of the group meet a certain condition?

I want to groupby and see if all members in the group meet a certain condition. Here's a dummy example:
x = ['Mike','Mike','Mike','Bob','Bob','Phil']
y = ['Attended','Attended','Attended','Attended','Not attend','Not attend']
df = pd.DataFrame({'name':x,'attendance':y})
And what I want to do is return a 3x2 dataframe that shows for each name, who was always in attendance. It should look like below:
new_df = pd.DataFrame({'name':['Mike','Bob','Phil'],'all_attended':[True,False,False]})
Whats the best way to do this?
Thanks so much.
Let's try
out = (df['attendance'].eq('Attended')
.groupby(df['name']).all()
.to_frame('all_attended').reset_index())
print(out)
name all_attended
0 Bob False
1 Mike True
2 Phil False
one way could be:
df.groupby('name')['attendance'].apply(lambda x: True if x.unique().all()=='Attended' else False)
name
Bob False
Mike True
Phil False
Name: attendance, dtype: bool
I would say away from strings for data that does not need to be a string:
z = [s == 'Attended' for s in y]
df = pd.DataFrame({'name': x, 'attended': z})
Now you can check if all the elements for a given group are True:
>>> df.groupby('name')['attendance'].all()
name
Bob False
Mike True
Phil False
Name: attendance, dtype: bool
If something can only be a 0 or 1, using a string introduces the possibility of errors because someone might type Atended instead of Attended, for example.

Compare two column Pandas row per row

I want to compare 2 column. If same will True if not same will False like this:
filtering
lemmatization
check
[hello, world]
[hello, world]
True
[grape, durian]
[apple, grape]
False
The output from my code is all False. But, the data actually is different. Why?
You can get my data github
import pandas as pd
dc = pd.read_excel('./data clean (spaCy).xlsx')
dc['check'] = dc['filtering'].equals(dc['lemmatization'])
Here is difference between columns - in one column missing '' around strings, possible solution is convert both columns to lists, for comapre use Series.eq (working like ==):
import ast
dc = pd.read_excel('data clean (spaCy).xlsx')
#removed trailing [] and split by ` ,`
dc['filtering'] = dc['filtering'].str.strip('[]').str.split(', ')
#there are string separators, so working literal_eval
dc['lemmatization'] = dc['lemmatization'].apply(ast.literal_eval)
#compare
dc['check'] = dc['filtering'].eq(dc['lemmatization'])
print (dc.head())
label filtering \
0 2 [ppkm, ya]
1 2 [mohon, informasi, pgs, pasar, turi, ppkm, buk...
2 2 [rumah, ppkm]
3 1 [pangkal, penanganan, pandemi, indonesia, terk...
4 1 [ppkm, mikro, anjing]
lemmatization check
0 [ppkm, ya] True
1 [mohon, informasi, pgs, pasar, turi, ppkm, buk... True
2 [rumah, ppkm] True
3 [pangkal, tangan, pandemi, indonesia, kesan, s... False
4 [ppkm, mikro, anjing] True
Reason for False is Series.equals return scalar, so here False

Pandas DataFrame - duplicated() does not identify duplicate values

EDIT: I have stripped down the file to the bits that are problematic
raw_data = {"link":
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = print(df.iloc[12][0])
b = print(df.iloc[13][0])
if a == b:
print("equal")
#duplicate check #2
df.duplicated()
For the first test I get the following output implying there is a duplicate
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
For the second test it seems there are no duplicates
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
Original post:
Trying to identify duplicate values from the "Link" column of attached file:
data file
import pandas as pd
data = pd.read_csv(r"...\consolidated.csv", sep=",")
df = pd.DataFrame(data)
del df['Unnamed: 0']
duplicate_rows = df[df.duplicated(["Link"], keep="first")]
pd.DataFrame(duplicate_rows)
#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])
#if a == b:
# print("equal")
Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them
I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!
I had the same problem and converting the columns of the dataframe to "str" helped.
eg.
df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]
First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.
As for the deletion of duplicates, check the drop_duplicates method in the Documentation
Hope this helps.

include one word and exclude another python

Background: I have the following dataframe
import pandas as pd
d = {'text': ["paid", "paid and volunteer", "other phrase"]}
df = pd.DataFrame(data=d)
df['text'].apply(str)
Output:
text
0 paid
1 paid and volunteer
2 other phrase
Goal:
1) check each row to determine if paid is present and return a boolean (return True if paid is anywhere in the text column and False if paid is not present. But I would like to exclude the word volunteer. If volunteer is present, the result should be false.
2) create a new column with the results
Desired Output:
text result
0 paid true
1 paid and volunteer false
2 other phrase false
Problem: I am using the following code
df['result'] = df['text'].astype(str).str.contains('paid') #but not volunteer
I checked How to negate specific word in regex? and it shows how to exclude a word but I am not sure how to include in my code
Question:
How do I alter my code to achieve 1) and 2) of my goal
Using lambda:
df['result'] = df['text'].apply(lambda row: True if ('paid' in row) and ('volunteer' not in row) else False)
You can use a logical and to check for both conditions.
(df.text.str.contains('paid')) & (~df.text.str.contains('volunteer'))
Out[14]:
0 True
1 False
2 False
Name: text, dtype: bool

Categories