Drop duplicates for rows with interchangeable name values (Pandas, Python)

Drop duplicates for rows with interchangeable name values (Pandas, Python) - python

I have a DataFrame of form
person1, person2, ..., someMetric
John, Steve, ..., 20
Peter, Larry, ..., 12
Steve, John, ..., 20
Rows 0 and 2 are interchangeable duplicates, so I'd want to drop the last row. I can't figure out how to do this in Pandas.
Thanks!

Here's a NumPy based solution -
df[~(np.triu(df.person1.values[:,None] == df.person2.values)).any(0)]
Sample run -
In [123]: df
Out[123]:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
2 Steve John 19
3 Peter Parker 5
4 Larry Peter 7
In [124]: df[~(np.triu(df.person1.values[:,None] == df.person2.values)).any(0)]
Out[124]:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
3 Peter Parker 5

an approach in pandas
df = pd.DataFrame(
{'person2': {0: 'Steve', 1: 'Larry', 2: 'John', 3: 'Parker', 4: 'Peter'},
'person1': {0: 'John', 1: 'Peter', 2: 'Steve', 3: 'Peter', 4: 'Larry'},
'someMetric': {0: 20, 1: 13, 2: 19, 3: 5, 4: 7}})
print(df)
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
2 Steve John 19
3 Peter Parker 5
4 Larry Peter 7
df['ordered-name'] = df.apply(lambda x: '-'.join(sorted([x['person1'],x['person2']])),axis=1)
df = df.drop_duplicates(['ordered-name'])
df.drop(['ordered-name'], axis=1, inplace=True)
print df
which gives:
person1 person2 someMetric
0 John Steve 20
1 Peter Larry 13
3 Peter Parker 5

Related

How to add a suffix to the first N columns in pandas?

I want to add a suffix to the first N columns. But I can't.
This is how to add a suffix to all columns:
import pandas as pd
df = pd.DataFrame( {"name" : ["John","Alex","Kate","Martin"], "surname" : ["Smith","Morgan","King","Cole"],
"job": ["Engineer","Dentist","Coach","Teacher"],"Age":[25,20,25,30],
"Id": [1,2,3,4]})
df.add_suffix("_x")
And this is the result:
name_x surname_x job_x Age_x Id_x
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
But I want to add the first N columns so let's say the first 3. Desired output is:
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

Work with the indices and take slices to modify a subset of them:
df.columns = (df.columns[:3]+'_x').union(df.columns[3:], sort=False)
print(df)
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

This should work:
N=3
cols=[i for i in df.columns[:N]]
new_cols=[i+'_x' for i in df.columns[:N]]
dict_cols=dict(zip(cols,new_cols))
df.rename(dict_cols,axis=1)

set the column labels using a list comprehension:
n = 3
df.columns = [f'{c}_x' if i < n else c for i, c in enumerate(df.columns)]
results in
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

How to repeat each dataframe row n number of times (n is different for each row)?

I have a dataframe 'data' and a dictionary 'dict_repeat' which specifies how many times to repeat a row. For example, rows containing name 'Jack' should be repeated 3 times in 'data':
import pandas as pd
data = {'Name': ['Anna', 'Nick', 'Jack'],
'Age': [19, 21, 19]}
dict_repeat = {
'Anna': 1,
'Nick': 2,
'Jack': 3,
}
Initial data:
Name Age
0 Anna 19
1 Nick 21
2 Jack 19
Desired output:
Name Age
0 Anna 19
1 Nick 21
1 Nick 21
2 Jack 19
2 Jack 19
2 Jack 19

You can use Index.repeat:
df = pd.DataFrame(data)
df2 = df.loc[df.index.repeat(df['Name'].map(dict_repeat))]
output:
Name Age
0 Anna 19
1 Nick 21
1 Nick 21
2 Jack 19
2 Jack 19
2 Jack 19

Write out only the best X customer names and mark the rest as "Other"

I have a problem. I only want the top 3 customers to receive the names, all other customers should be shown as Other. This should all be written in the column name2. The top 3 customers are simply determined based on frequency. The problem is, I get my v with the values. However, how can I say that as long as the customers contained in v.index[v.gt(2)]) are to receive the name and the rest Other in the column name2.
Dataframe
customerId name
0 1 max
1 1 max
2 2 lisa
3 2 lisa
4 2 lisa
5 2 lisa
6 3 michael
7 3 michael
8 3 michael
9 4 power
10 5 wind
11 5 wind
12 5 wind
13 5 wind
14 5 wind
Code
import pandas as pd
d = {
"customerId": [1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 5],
"name": ['max', 'max', 'lisa', 'lisa', 'lisa', 'lisa', 'michael', 'michael', 'michael', 'power',
'wind', 'wind', 'wind', 'wind', 'wind',]
}
df = pd.DataFrame(data=d)
print(df)
v = df['customerId'].value_counts()
df['name2'] = 'Other'
df.loc[df['customerId']].isin(v.index[v.gt(2)]), 'name2'] = df['name']
Desired output
customerId name name2
0 1 max Other
1 1 max Other
2 2 lisa lisa
3 2 lisa lisa
4 2 lisa lisa
5 2 lisa lisa
6 3 michael michael
7 3 michael michael
8 3 michael michael
9 4 power Other
10 5 wind wind
11 5 wind wind
12 5 wind wind
13 5 wind wind
14 5 wind wind

v = df['customerId'].value_counts()
df["name2"] = np.where(df['customerId'].isin(v.nlargest(3).index), df["name"], "Other")
customerId name name2
0 1 max Other
1 1 max Other
2 2 lisa lisa
3 2 lisa lisa
4 2 lisa lisa
5 2 lisa lisa
6 3 michael michael
7 3 michael michael
8 3 michael michael
9 4 power Other
10 5 wind wind
11 5 wind wind
12 5 wind wind
13 5 wind wind
14 5 wind wind

Your problem is simple.
Take this as a correction:
df.loc[df['customerId'].isin(v.index[v.gt(2)]), 'name2'] = df['name']

Drop rows that only relate to one value in other columns pandas

imagine I have dataframe like this:
item name gender
banana tom male
banana kate female
apple kate female
kiwi jim male
apple tom male
banana kimmy female
kiwi kate female
banana tom male
Is there any way to drop rows that the person only relate(buy) less than 2 item? Also I don't want to drop duplicates. So the output I want like this:
item name gender
banana tom male
banana kate female
apple kate female
apple tom male
kiwi kate female
banana tom male

#sammywemmy's solution:
df.loc[df.groupby('name').item.transform('size').ge(2)]
groupby to group rows with the same name together
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
index item name gender
name
jim 0 3 kiwi jim male
kate 0 1 banana kate female
1 2 apple kate female
2 6 kiwi kate female
kimmy 0 5 banana kimmy female
tom 0 0 banana tom male
1 4 apple tom male
2 7 banana tom male
transform to get a value in every row that represents the group size. (Number of rows)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
This could have been done on any column in this case:
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
Notice how now each row has the corresponding group size at the end. tom has 3 instances so every name == tom row has 3 in group_size.
ge Convert to Boolean Index based on relational operator
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
item name gender group_size should_keep
0 banana tom male 3 True
1 banana kate female 3 True
2 apple kate female 3 True
3 kiwi jim male 1 False
4 apple tom male 3 True
5 banana kimmy female 1 False
6 kiwi kate female 3 True
7 banana tom male 3 True
loc use Boolean Index to get the desired rows
print(df.groupby('name')['item'].transform('size').ge(2))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 True
Name: item, dtype: bool
loc will include any index that is True, any index that is False will be excluded. (indexes 3 and 5 are False so they will not be included)
All together:
import pandas as pd
df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
3: 'kiwi', 4: 'apple', 5: 'banana',
6: 'kiwi', 7: 'banana'},
'name': {0: 'tom', 1: 'kate', 2: 'kate',
3: 'jim', 4: 'tom', 5: 'kimmy',
6: 'kate', 7: 'tom'},
'gender': {0: 'male', 1: 'female',
2: 'female', 3: 'male',
4: 'male', 5: 'female',
6: 'female', 7: 'male'}})
print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
item name gender
0 banana tom male
1 banana kate female
2 apple kate female
4 apple tom male
6 kiwi kate female
7 banana tom male

Pandas data frame spread function or similar?

Here's a pandas df:
df = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'],
'Last' : ['Smith', 'Doe', 'Johnson'],
'Group' : ['A', 'B', 'A'],
'Measure' : [2, 11, 1]})
df
Out[38]:
First Last Group Measure
0 John Smith A 2
1 Jane Doe B 11
2 Mary Johnson A 1
I would like to "spread" the Group variable with the values in Measure.
df_desired
Out[39]:
First Last A B
0 John Smith 2 0
1 Jane Doe 0 11
2 Mary Johnson 1 0
Each level within Group variable becomes its own column populated with the values contained in column Measure. How can I achieve this?

Using pivot_table
df.pivot_table(index=['First','Last'],columns='Group',values='Measure',fill_value=0)
Out[247]:
Group A B
First Last
Jane Doe 0 11
John Smith 2 0
Mary Johnson 1 0

If your order doesn't matter, you can do something along these lines:
df.set_index(['First','Last', 'Group']).unstack('Group').fillna(0).reset_index()
First Last Measure
Group A B
0 Jane Doe 0.0 11.0
1 John Smith 2.0 0.0
2 Mary Johnson 1.0 0.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop duplicates for rows with interchangeable name values (Pandas, Python) - python

I have a DataFrame of form person1, person2, ..., someMetric John, Steve, ..., 20 Peter, Larry, ..., 12 Steve, John, ..., 20 Rows 0 and 2 are interchangeable duplicates, so I'd want to drop the last row. I can't figure out how to do this in Pandas. Thanks!

Related

How to add a suffix to the first N columns in pandas?

How to repeat each dataframe row n number of times (n is different for each row)?

Write out only the best X customer names and mark the rest as "Other"

Drop rows that only relate to one value in other columns pandas

Pandas data frame spread function or similar?

Categories

Resources