Pandas DataFrame - duplicated() does not identify duplicate values - python

EDIT: I have stripped down the file to the bits that are problematic
raw_data = {"link":
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = print(df.iloc[12][0])
b = print(df.iloc[13][0])
if a == b:
print("equal")
#duplicate check #2
df.duplicated()
For the first test I get the following output implying there is a duplicate
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
For the second test it seems there are no duplicates
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool
Original post:
Trying to identify duplicate values from the "Link" column of attached file:
data file
import pandas as pd
data = pd.read_csv(r"...\consolidated.csv", sep=",")
df = pd.DataFrame(data)
del df['Unnamed: 0']
duplicate_rows = df[df.duplicated(["Link"], keep="first")]
pd.DataFrame(duplicate_rows)
#a = print(df.iloc[42657][15])
#b = print(df.iloc[42676][15])
#if a == b:
# print("equal")
Used the code above, but the answer I keep getting is that there are no duplicates. Checked it through Excel and there should be seven duplicate instances. Even selected specific cells to do a quick check (the part marked with #s), and the values have been identified as equal. Yet duplicated does not capture them
I have been scratching my head for a good hour, and still no idea what I'm missing - help appreciated!

I had the same problem and converting the columns of the dataframe to "str" helped.
eg.
df['link'] = df['link'].astype(str)
duplicate_rows = df[df.duplicated(["link"], keep="first")]

First, you don't need df = pd.DataFrame(data), as data = pd.read_csv(r"...\consolidated.csv", sep=",") already returns a Dataframe.
As for the deletion of duplicates, check the drop_duplicates method in the Documentation
Hope this helps.

Related

Python check if a dataframe single contains contains a keyword

I have a value in my df NEFT fm Principal Focused Multi.
I have a list of words containing keywords like NEFT, neft. So, if the above string of DF contains NEFT word, I want to add a new column and mark it as True.
Sample Input
I was trying
for index, row in ldf_first_page.iterrows():
row.str.lower().isin(llst_transaction_keywords)
But it is trying to check the exact keyword, and I am looking for something like contains.
Your question is not entirely clear, but what I understand you want to add a column for every key word and set only the value true at the rows that contain that key word? I would do
First read the data
import re
import pandas as pd
from io import StringIO
example_data = """
Date,Narration,Other
31/03/21, blabla, 1
12/2/20, here neft in, 2
10/03/19, NEFT is in, 3
9/04/18, ok, 4
11/08/21, oki cha , 5
"""
df = pd.read_csv(StringIO(example_data))
Now loop over the keywords and see if there are matches. Note that I have imported the re module to allow for INGNORECASE
transaction_keywords = ["neft", "cha"]
for keyword in transaction_keywords:
mask = df["Narration"].str.contains(keyword, regex=True, flags=re.IGNORECASE)
if mask.any():
df[keyword] = mask
The result looks like this:
Date Narration Other neft cha
0 31/03/21 blabla 1 False False
1 12/2/20 here neft in 2 True False
2 10/03/19 NEFT is in 3 True False
3 9/04/18 ok 4 False False
4 11/08/21 oki cha 5 False True
Try:
for index, row in ldf_first_page.iterrows():
if llst_transaction_keywords in row:
return True
try:
df["new_col"] = np.where(df["Narration"].str.lower().str.contains("neft"), True, False)
You can try using any().
for index, row in ldf_first_page.iterrows():
row.str.lower().any(llst_transaction_keywords)

Python - looping through rows and concating rows until a certain value is encountered

I am getting myself very confused over a problem I am encountering with a short python script I am trying to put together. I am trying to iterate through a dataframe, appending rows to a new dataframe, until a certain value is encountered.
import pandas as pd
#this function will take a raw AGS file (saved as a CSV) and convert to a
#dataframe.
#it will take the AGS CSV and print the top 5 header lines
def AGS_raw(file_loc):
raw_df = pd.read_csv(file_loc)
#print(raw_df.head())
return raw_df
import_df = AGS_raw('test.csv')
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if "**PROJ" == True:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif "**ABBR" == True:
break
print(raw_df)
return cut_df
I don't need to get into specifics, but the values (**PROJ and **ABBR) in this data occur as single cells as the top of tables. So I want to loop row-wise through the data, appending rows until **ABBR is encountered.
When I call AGS_snip(import_df), nothing happens. Previous incarnations just spat out the whole df, and I'm just confused over the logic of the loops. Any assistance much appreciated.
EDIT: raw text of the CSV
**PROJ,
1,32
1,76
32,56
,
**ABBR,
1,32
1,76
32,56
The test CSV looks like this:
The reason that "nothing happens" is likely b/c of the conditions you're using in if and elif.
Neither "**PROJ" == True nor "**ABBR" == True will ever be True because neither "**PROJ" nor "**ABBR" are equal to True. Your code is equivalent to:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
if False:
cut_df = cut_df.concat([cut_df,df_new_row],ignore_index=True, sort=False)
elif False:
break
print(raw_df)
return cut_df
Which is the same as:
def AGS_snip(raw_df):
for i in raw_df.iterrows():
df_new_row = pd.DataFrame(i)
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
You also always return from inside the loop and df_new_row isn't used for anything, so it's equivalent to:
def AGS_snip(raw_df):
first_row = next(raw_df.iterrows(), None)
if first_row:
cut_df = pd.DataFrame(raw_df)
print(raw_df)
return cut_df
Here's how to parse your CSV file into multiple separate dataframes based on a row condition. Each dataframe is stored in a Python dictionary, with titles as keys and dataframes as values.
import pandas as pd
df = pd.read_csv('ags.csv', header=None)
# Drop rows which consist of all NaN (Not a Number) / missing values.
# Reset index order from 0 to the end of dataframe.
df = df.dropna(axis='rows', how='all').reset_index(drop=True)
# Grab indices of rows beginning with "**", and append an "end" index.
idx = df.index[df[0].str.startswith('**')].append(pd.Index([len(df)]))
# Dictionary of { dataframe titles : dataframes }.
dfs = {}
for k in range(len(idx) - 1):
table_name = df.iloc[idx[k],0]
dfs[table_name] = df.iloc[idx[k]+1:idx[k+1]].reset_index(drop=True)
# Print the titles and tables.
for k,v in dfs.items():
print(k)
print(v)
# **PROJ
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# **ABBR
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# Access each dataframe by indexing the dictionary "dfs", for example:
print(dfs['**ABBR'])
# 0 1
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0
# You can rename column names with for example this code:
dfs['**PROJ'].set_axis(['data1', 'data2'], axis='columns', inplace=True)
print(dfs['**PROJ'])
# data1 data2
# 0 1 32.0
# 1 1 76.0
# 2 32 56.0

Filtering function for pandas - VIewing NaN values within a column

Function I have created:
#Create a function that identifies blank values
def GPID_blank(df, variable):
df = df.loc[df['GPID'] == variable]
return df
Test:
variable = ''
test = GPID_blank(df, variable)
test
Goal: Create a function that can filter any dataframe column 'GPID' to see all of the rows where GPID has missing data.
I have tried running variable = 'NaN' and still no luck. However, I know the function works, as if I use a real-life variable "OH82CD85" the function filters my dataset accordingly.
Therefore, why doesn't it filter out the blank cells variable = 'NaN'? I know for my dataset, there are 5 rows with GPID missing data.
Example df:
df = pd.DataFrame({'Client': ['A','B','C'], 'GPID':['BRUNS2','OH82CD85','']})
Client GPID
0 A BRUNS2
1 B OH82CD85
2 C
Sample of GPID column:
0 OH82CD85
1 BW07TI20
2 OW36HW81
3 PE56TA73
4 CT46SX81
5 OD79AU80
6 GF46DB60
7 OL07ST01
8 VP38SM57
9 AH90AE61
10 PG86KO78
11 NaN
12 NaN
13 SO21GR72
14 DY85IN90
15 KW80CV02
16 CM15QP83
17 VC38FP82
18 DA36RX05
19 DD74HD38
You can't use == with NaN. NaN != NaN.
Instead, you can modify your function a little to check if the parameter is NaN using pd.isna() (or np.isnan()):
def GPID_blank(df, variable):
if pd.isna(variable):
return df.loc[df['GPID'].isna()]
else:
return df.loc[df['GPID'] == variable]
You can't really search for NaN values like an expression. Also, in your example dataframe, '' is not NaN, but is str, and can be searched like an expression.
Instead, you need to check when you want to filter for NaN, and filter differently:
def GPID_blank(df, variable):
if pd.isna(variable):
df = df.loc[df['GPID'].isna()]
else:
df = df.loc[df['GPID'] == variable]
return df
It's not working because with variable = 'NaN' you're looking for a string which content is 'NaN', not for missing values.
You can try:
import pandas as pd
def GPID_blank(df):
# filtered dataframe with NaN values in GPID column
blanks = df[df['GPID'].isnull()].copy()
return blanks
filtered_df = GPID_blank(df)

Manipulate string to drop columns on pandas

I'm trying to manipulate a list (type: string) to use that list to drop some columns from a dataframe.
Dataframe
The list is from a dataframe that I created a condition to return columns whose sums of all values ​​are zero:
Selecting the columns with sum = 0
condicao_CO8 = (ex8_centro_oeste.sum(axis = 0) == 0)
condicao_CO8 = condicao_CO8[condicao_CO8 == True]
condicao_CO8.to_csv('D:\Programas\CO_8.csv')
Importing the dataframe and turning it into a list:
CO8 = pd.read_csv(r'D:\Programas\CO_8.csv',
delimiter=','
)
CO8.rename(columns={'Unnamed: 0': 'Nulos'}, inplace = True)
CO8.rename(columns={'0': 'Info'}, inplace = True)
CO8.drop(columns = ['Info'], inplace = True)
CO8.columns
Images from the list:
List
Some itens from the list:
0 ABI - LETRAS
1 ADMINISTRAÇÃO PÚBLICA
2 AGRONOMIA
3 ALIMENTOS
4 ARQUIVOLOGIA
5 AUTOMAÇÃO INDUSTRIAL
6 BIOMEDICINA
7 BIOTECNOLOGIA
8 CIÊNCIAS - BIOLOGIA
9 CIÊNCIAS - QUÍMICA
10 CIÊNCIAS AGRÁRIAS
11 CIÊNCIAS BIOLÓGICAS
12 CIÊNCIAS BIOLÓGICAS E CONSERVAÇÃO
13 CIÊNCIAS DA COMPUTAÇÃO
14 CIÊNCIAS ECONÔMICAS
My goal is to transform the list so that I can drop the columns from that.
Transforming this list into this:
"ABI - LETRAS", "ADMINISTRAÇÃO PÚBLICA", "AGRONOMIA", "ALIMENTOS", "ARQUIVOLOGIA", "AUTOMAÇÃO INDUSTRIAL"...
for this I made the following code (unsuccessful)
list_CO8 = ('\",\" '.join(CO8['Nulos'].apply(str.upper).tolist()))
Please, can anyone help me?
I'm trying to get:
list = "ABI - LETRAS", "ADMINISTRAÇÃO PÚBLICA", "AGRONOMIA", "ALIMENTOS", "ARQUIVOLOGIA", "AUTOMAÇÃO INDUSTRIAL"...
To make:
ex8_centro_oeste.drop(columns=[list])
Dataset link: Link for Drive (8kb)
I read your question properly only now. I deleted my previous answer. You want to drop all the columns that sums to zero, if I am not mistaken.
import pandas as pd
data = pd.read_csv("./centro-oeste.csv")
import numpy as np
data.columns[np.sum(data) == 0] # data is the dataframe
# It out puts a list that you can use.

Python - data frame - cannot remove duplicates

This has been puzzling mew for a while. I have the following data set denoted under raw data, and have run two checks, #1 to identify a sample duplicate, and #2 to remove duplicates with drop_duplicates. The #1 test does identify duplicates, yet #2 does not seem to remove any duplicates.
raw_data = {'link':
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = df.iloc[12][0]
b = df.iloc[13][0]
if a == b:
print("equal")
#duplicate check #2
df.drop_duplicates(['link'], keep='first')
Output:
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
link
0 https://www.otodom.pl/oferta/mieszkanie-w-spok...
1 https://www.otodom.pl/oferta/mieszkanie-w-spok...
2 https://www.otodom.pl/oferta/mieszkanie-w-spok...
3 https://www.otodom.pl/oferta/mieszkanie-w-spok...
4 https://www.otodom.pl/oferta/zielony-widok-mie...
5 https://www.otodom.pl/oferta/zielony-widok-mie...
6 https://www.otodom.pl/oferta/nowoczesne-osiedl...
7 https://www.otodom.pl/oferta/nowoczesne-osiedl...
8 https://www.otodom.pl/oferta/nowoczesne-osiedl...
9 https://www.otodom.pl/oferta/nowoczesne-osiedl...
10 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
11 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
12 https://www.otodom.pl/oferta/idealny-2pok-apar...
13 https://www.otodom.pl/oferta/idealny-2pok-apar...
Help would appreciated with reasoning why duplicates do not drop, thanks!
You have to reassign the output of drop_duplicates either to df or to a new variable. It does not happen in-place.
df2 = df.drop_duplicates(['link'], keep='first')
The links provided are not same.
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
In one link it is X and in other it is x
Also variable a and b are always None
so it print equal

Categories