Find duplicates with groupby in Pandas

Find duplicates with groupby in Pandas - python

I read a csv file using Pandas. Then, I am checking to see if there are any duplicate rows in the data using the code below:
import pandas as pd
df= pd.read_csv("data.csv", na_values=["", " ", "-"])
print df.shape
>> (71644, 15)
print df.drop_duplicates().shape
>> (31171, 15)
I find that there are some duplicate rows, so I want to see which rows appear more than once:
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size()
size[size > 1]
Doing that I get Series([], dtype: int64).
Futhermore, I can find the duplicate rows doing the following:
duplicates = df[(df.duplicated() == True)]
print duplicates.shape
>> (40473, 15)
So df.drop_duplicates() and df[(df.duplicated() == True)] show that there are duplicate rows but groupby doesn't.
My data consist of strings, integers, floats and nan.
Have I misunderstood something in the functions I mention above or something else happens ?

Simply add the reset_index() to realign aggregates to a new dataframe.
Additionally, the size() function creates an unmarked 0 column which you can use to filter for duplicate row. Then, just find length of resultant data frame to output a count of duplicates like other functions: drop_duplicates(), duplicated()==True.
data_groups = df.groupby(df.columns.tolist())
size = data_groups.size().reset_index()
size[size[0] > 1] # DATAFRAME OF DUPLICATES
len(size[size[0] > 1]) # NUMBER OF DUPLICATES

Related

Creating a new dataframe by filtering matches from columns of two existing dataframes with error tolerance

I am pretty new to python and pandas, and I want to sort through the existing two dataframes by certain columns, and create a third dataframe that contains only the value matches within a tolerance. In other words, I have df1 and df2, and I want df3 to contain the rows and columns of df2 that are within the tolerance of values in df1:
Two dataframes:
df1=pd.DataFrame([[0.221,2.233,7.84554,10.222],[0.222,2.000,7.8666,10.000],
[0.220,2.230,7.8500,10.005]],columns=('rt','mz','mz2','abundance'))
[Dataframe 1]
df2=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,7.8669,10.003],[0.229,2.238,7.8508,10.009]],columns=('rt','mz','mz2','abundance'))
[Dataframe 2]
Expected Output:
df3=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,2.002,7.8669,10.003]],columns=('Rt','mz','mz2','abundance'))
[Dataframe 3]
I have tried forloops and filters, but as I am a newby nothing is really working for me. But here us what I'm trying now:
import pandas as pd
import numpy as np
p=[]
d=np.array(p)
#print(d.dtype)
def count(df2, l, r):
l=[(df1['Rt']-0.001)]
r=[(df1['Rt']+0.001)]
for x in df2['Rt']:
# condition check
if x>= l and x<= r:
print(x)
d.append(x)
where p and d are the corresponding dataframe and the array (if necessary to make array?) that will be populated. I bet the problem lies somewhere in the fact that that the function shouldn't contain the forloop.
Ideally, this could work to sort like ~13,000 rows of a dataframe using the 180 column values of another dataframe.
Thank you in advance!

Is this what you're looking for?:
min = df1.rt.min()-0.001
max = df1.rt.max()+0.001
df3 = df2[(df2.rt >= min) & (df2.rt <= max)]
>>> df3

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!

I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)

I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!

Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

How to compare two CSV files and get the difference?

I have two CSV files,
a1.csv
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
Aguila,Arizona,http://www.co.apache.az.us/planning-and-zoning-division/zoning-ordinances/
a2.csv
city,state,link
Aguila,Arizona,http://www.co.apache.az.us
I want to get the difference.
Here is my attempt:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
mask = a.isin(b.to_dict(orient='list'))
# Reverse the mask and remove null rows.
# Upside is that index of original rows that
# are now gone are preserved (see result).
c = a[~mask].dropna()
print c
Expected Output:
city,state,link
Aguila,Arizona,https://www.glendaleaz.com/planning/documents/AppendixAZONING.pdf
AkChin,Arizona,http://www.maricopa-az.gov/zoningcode/wp-content/uploads/2014/05/Zoning-Code-Rewrite-Public-Review-Draft-3-Tracked-Edits-lowres1.pdf
But I am getting an error:
Empty DataFrame
Columns: [city, state, link]
Index: []**
I want to check based on the first two rows, then if they are the same, remove it off.

You can use pandas to read in two files, join them and remove all duplicate rows:
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
ab = pd.concat([a,b], axis=0)
ab.drop_duplicates(keep=False)
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

First, concatenate the DataFrames, then drop the duplicates while still keeping the first one. Then reset the index to keep it consistent.
import pandas as pd
a = pd.read_csv('a1.csv')
b = pd.read_csv('a2.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep='first', inplace=True) # Set keep to False if you don't want any
# of the duplicates at all
c.reset_index(drop=True, inplace=True)
print(c)

How to Drop rows from pandas dataframe by using a list of indices

Introduction
We have the following dataframe which we create from a CSV file.
data = pd.read_csv(path + name, usecols = ['QTS','DSTP','RSTP','DDATE','RDATE','DTIME','RTIME','DCXR','RCXR','FARE'])
I want to delete specific rows from the dataframe. For this purpose I used a list and appended the ids of the rows we want to delete.
for index,row in data.iterrows():
if (row['FARE'] >= 2500.00):
indices.append(index)
From here i am lost. Don't know how to use the ids in the list to delete the rows from the dataframe
Question
The list containing the row ids must be used in the dataframe to delete rows. Is it possible to do it?
Constraints
We can't use data.drop(index,inplace=True) because it really slows the process
We cannot use a filter because I have some special constraints.

If you are trying to remove rows that have 'FARE' values greater than or equal to zero, you can use a mask that have those values lesser than 2500 -
df_out = df.loc[df.FARE.values < 2500] # Or df[df.FARE.values < 2500]
For large datasets, we might want to work with underlying array data and then construct the output dataframe -
df_out = pd.DataFrame(df.values[df.FARE.values < 2500], columns=df.columns)
To use those indices generated from the loopy code in the question -
df_out = df.loc[np.setdiff1d(df.index, indices)]
Or with masking again -
df_out = df.loc[~df.index.isin(indices)] # or df[~df.index.isin(indices)]

How about filtering data using DataFrame.query() method:
cols = ['QTS','DSTP','RSTP','DDATE','RDATE','DTIME','RTIME','DCXR','RCXR','FARE']
df = pd.read_csv(path + name, usecols=cols).query("FARE < 2500")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find duplicates with groupby in Pandas - python

Related

Creating a new dataframe by filtering matches from columns of two existing dataframes with error tolerance

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

How to compare two CSV files and get the difference?

How to Drop rows from pandas dataframe by using a list of indices

Categories

Resources