De-Duplicate in Pandas based off of multiple rules

De-Duplicate in Pandas based off of multiple rules - python

I want to de-dupe rows in pandas based off of multiple criteria.
I have 3 columns: name, id and nick_name.
First rule is look for duplicate id's. When id's match, only keep rows where name and nick_name are different as long as I am keeping at least one row.
In other words, if name and nick_name don't match, keep that row. If name and nick_name match, then get rid of that row, as long as it isn't the only row that would be left for that id.
Example data:
data = {"name": ["Sam", "Sam", "Joseph", "Joseph", "Joseph", "Philip", "Philip", "James"],
"id": [1,1,2,2,2,3,3,4],
"nick_name": ["Sammie", "Sam", "Joseph", "Joe", "Joey", "Philip", "Philip", "James"]}
df = pd.DataFrame(data)
df
Produces:
name id nick_name
0 Sam 1 Sammie
1 Sam 1 Sam
2 Joseph 2 Joseph
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
6 Philip 3 Philip
7 James 4 James
Based on my rules above, I want a resulting dataframe to produce the following:
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
7 James 4 James

We can split this into 3 boolean condtions to filter your initial dataframe by.
#where name and nick_name match, keep the first value.
con1 = df.duplicated(subset=['name','nick_name'],keep='first')
# where ids are duplicated and name is not equal to nick_name
con2 = df.duplicated(subset=['id'],keep=False) & df['name'].ne(df['nick_name'])
# where no duplicate exists.
con3 = df.groupby('id')['id'].transform('size').eq(1)
print(df.loc[con1 | con2 | con3])
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
6 Philip 3 Philip
7 James 4 James

Related

For each value in a column, how to count the number of unique values in its row?

Suppose I have a data frame:
ID person_1 person_2
ID_001 Aaron Ben
ID_003 Kate Ben
ID_001 Aaron Lou
ID_005 Lee Ben
ID_006 Aaron Cassie
ID_001 Tim Ben
ID_003 Ben Mal
For every ID in the column "ID", I want to count the number of unique names that were associated with the ID
My desired output:
ID Count
ID_001 4
ID_003 3
ID_005 2
ID_006 2
Code for reproduction:
df = pd.DataFrame({
'ID': ["ID_001", "ID_003", "ID_001", "ID_005", "ID_006", "ID_001", "ID_003"],
'person1': ["Aaron","Kate","Aaron","Lee","Aaron","Tim","Ben"],
'person2': ["Ben","Ben","Lou","Ben","Cassie","Ben","Mal"]
})

Use df.melt('ID').groupby('ID')['value'].nunique().
>>> df.melt('ID').groupby('ID')['value'].nunique()
ID
ID_001 4
ID_003 3
ID_005 2
ID_006 2
Name: value, dtype: int64
edit: df.set_index('ID').stack().groupby(level=0).nunique() works too.

Flat your columns person1 and person2 then remove duplicated names and finally count unique value per ID:
out = df.melt('ID').drop_duplicates(['ID', 'value']) \
.value_counts('ID').rename('Count').reset_index()
print(out)
# Output
ID Count
0 ID_001 4
1 ID_003 3
2 ID_005 2
3 ID_006 2

Melt your dataframe together, drop duplicates, then group by ID
and aggregate over the count of the variables. At least, rename the column variable to Count.
df.melt(["ID"]).drop_duplicates(["ID","value"]).groupby(
["ID"]).agg({"variable":"count"}).reset_index().rename(
columns={"variable":"Count"})
ID Count
ID_001 4
ID_003 3
ID_005 2
ID_006 2

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.

Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)

Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

replace row data in pandas based on another dataframe

I've a sample dataframe
name
0 Newyork
1 Los Angeles
2 Ohio
3 Washington DC
4 Kentucky
Also I've a second dataframe
name ratio
0 Newyork 1:2
1 Kentucky 3:7
2 Florida 1:5
3 SF 2:9
How can I replace the data of name column in the df2 with not available, if the name is present in df1?
Desired result:
name ratio
0 Not Available 1:2
1 Not Available 3:7
2 Florida 1:5
3 SF 2:9

Use numpy.where:
df2['name'] = np.where(df2['name'].isin(df1['name']), 'Not Available', df2['name'])

Group by pandas data frame unique first values - numpy array returned

From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.

Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA

Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA

Iterating through two pandas dataframes and appending data from one dataframe to the other

I have two pandas data-frames that look like this:
data_frame_1:
index un_id city
1 abc new york
2 def atlanta
3 gei toronto
4 lmn tampa
data_frame_2:
index name un_id
1 frank gei
2 john lmn
3 lisa abc
4 jessica def
I need to match names to cities via the un_id column either in a new data-frame or an existing data-frame. I am having trouble figuring out how to iterate through one column, grab the un_id, iterate through the other un_id column in the other data-frame with that un_id, and then append the information needed back to the original data-frame.

use pandas merge:
In[14]:df2.merge(df1,on='un_id')
Out[14]:
name un_id city
0 frank gei toronto
1 john lmn tampa
2 lisa abc new york
3 jessica def atlanta

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

De-Duplicate in Pandas based off of multiple rules - python

Related

For each value in a column, how to count the number of unique values in its row?

Keep values assigned to one column in a new dataframe

replace row data in pandas based on another dataframe

Group by pandas data frame unique first values - numpy array returned

Iterating through two pandas dataframes and appending data from one dataframe to the other

Categories

Resources