Dropping selected rows in Pandas with duplicated columns - python

Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?

You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com

You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com

Related

Strip colum values if startswith a specific string pandas

I have a pandas dataframe(sample).
id name
1 Mr-Mrs-Jon Snow
2 Mr-Mrs-Jane Smith
3 Mr-Mrs-Darth Vader
I'm looking to strip the "Mr-Mrs-" from the dataframe. i.e the output should be:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader
I tried using
df['name'] = df['name'].str.lstrip("Mr-Mrs-")
But while doing so, some of the alphabets of names in some rows are also getting stripped out.
I don't want to run a loop and do .loc for every row, is there a better/optimized way to achieve this ?
Don't strip, replace using a start of string anchor (^):
df['name'] = df['name'].str.replace(r"^Mr-Mrs-", "", regex=True)
Or removeprefix:
df['name'] = df['name'].str.removeprefix("Mr-Mrs-")
Output:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

Filter rows based on multiple criteria

I have the following dataframe:
name date_one date_two
-----------------------------------------
sue
sue
john
john 13-06-2019
sally 23-04-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 14-06-2019
bob 18-05-2019 17-06-2019
The data contains duplicate name rows. I need to filter the data based on the following (in this order of priority):
For each name, keep the row with the newest date_two. If the name doesn't have any rows which have values for date_two, go to step 2
For each name, keep the row with the newest date_one. If the name doesn't have any rows which have values for date_one, go to step 3
These names don't have any rows which have a date_one or date_two so just keep the first row for that name
The above dataframe would be filtered to:
name date_one date_two
-----------------------------------------
sue
john 13-06-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 17-06-2019
This doesn't need to be done in the most performant way. The dataframe is only a few thousand rows and only needs to be done once. If it needs to be done in multiple (slow) steps that's fine.
Use DataFrameGroupBy.idxmax per groups for rows by maximal values, then filter out already matched values by Series.isin and last join together value by concat:
df['date_one'] = pd.to_datetime(df['date_one'], dayfirst=True)
df['date_two'] = pd.to_datetime(df['date_two'], dayfirst=True)
#rule1
df1 = df.loc[df.groupby('name')['date_two'].idxmax().dropna()]
#rule2
df2 = df.loc[df.groupby('name')['date_one'].idxmax().dropna()]
df2 = df2[~df2['name'].isin(df1['name'])]
#rule3
df3 = df[~df['name'].isin(df1['name'].append(df2['name']))].drop_duplicates('name')
df = pd.concat([df1, df2, df3]).sort_index()
print (df)
name date_one date_two
0 sue NaT NaT
3 john 2019-06-13 NaT
5 sally 2019-04-23 2019-04-25
7 bob 2019-05-18 2019-06-17

Concatenate multiple column strings into one column

I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object

Pandas - Rename only first dictionary match instead of last match

I am trying to use pandas to rename a column in CSV files. I want to use a dictionary since sometimes columns with the same information can be named differently (e.g. mobile_phone and telephone instead of phone).
I want to rename the first instance of phone. Here is an example to hopefully explain more.
Here is the original in this example:
0 name mobile_phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
Here is what I want it to do:
0 name phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
This is the code I tried:
phone_dict = {
'phone_number': 'phone',
'mobile_phone': 'phone',
'telephone': 'phone',
'phones': 'phone',
}
if 'phone' not in df.columns:
df.rename(columns=dict(phone_dict), inplace=True)
if 'phone' not in df.columns:
raise ValueError("What are these peoples numbers!? (Need 'phone' column)")
I made a dictionary with some possible column names and that I want them to be named 'phone'. However, when I run this code it turns the columns to this changes the second column instead of the first one that matches a key in the dictionary. I want it to stop after it matches the first column it comes across in the CSV.
This is what is happening:
0 name mobile_phone phone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
If there is, for example, a third column that matches the dictionary they turn to 'phone' which is again not what I want. I am trying to get it to just change the first column it matches.
Here is an example of what happens when I add a third column.
It goes from:
0 name mobile_phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
To this:
0 name phone phone phone
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
But I want it to be this:
0 name phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
Any advice or tips to stop it second changing the second dictionary match instead of the first one or all of them?
Before I had a bunch of elif statements but I thought a dictionary would be cleaner and easier to read.
You shouldn't expect pd.DataFrame.rename to apply any particular sequential ordering with a dict input. Even if the logic worked, it would be an implementation detail as the docs don't describe the actual process.
Instead, you can use pd.DataFrame.filter to find the first valid column label:
df = df.rename(columns={df.filter(like='phone').columns[0]: 'phone'})
print(df)
0 name phone telephone
0 1 Bob 12364234234 12364234234
1 2 Joe 23534235435 43564564563
2 3 Jill 34573474563 78098080807
If it's possible a valid column may not exist, you can catch IndexError:
try:
df = df.rename(columns={df.filter(like='phones').columns[0]: 'phone'})
except IndexError:
print('No columns including "phones" exists.')
Here's one solution:
df:
Columns: [name, mobile_phone, telephone]
Index: []
Finding the first instance of phone (left to right) in the column index:
a = [True if ('phone' in df.columns[i]) & ('phone' not in df.columns[i-1]) else False for i in range(len(df.columns))]
Getting the column that needs to be renamed phone:
phonecol = df.columns[a][0]
Renaming the column:
df.rename(columns = {phonecol : 'phone'})
Output:
Columns: [name, phone, telephone]
Index: []

Categories