Pandas - Rename only first dictionary match instead of last match - python

I am trying to use pandas to rename a column in CSV files. I want to use a dictionary since sometimes columns with the same information can be named differently (e.g. mobile_phone and telephone instead of phone).
I want to rename the first instance of phone. Here is an example to hopefully explain more.
Here is the original in this example:
0 name mobile_phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
Here is what I want it to do:
0 name phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
This is the code I tried:
phone_dict = {
'phone_number': 'phone',
'mobile_phone': 'phone',
'telephone': 'phone',
'phones': 'phone',
}
if 'phone' not in df.columns:
df.rename(columns=dict(phone_dict), inplace=True)
if 'phone' not in df.columns:
raise ValueError("What are these peoples numbers!? (Need 'phone' column)")
I made a dictionary with some possible column names and that I want them to be named 'phone'. However, when I run this code it turns the columns to this changes the second column instead of the first one that matches a key in the dictionary. I want it to stop after it matches the first column it comes across in the CSV.
This is what is happening:
0 name mobile_phone phone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
If there is, for example, a third column that matches the dictionary they turn to 'phone' which is again not what I want. I am trying to get it to just change the first column it matches.
Here is an example of what happens when I add a third column.
It goes from:
0 name mobile_phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
To this:
0 name phone phone phone
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
But I want it to be this:
0 name phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
Any advice or tips to stop it second changing the second dictionary match instead of the first one or all of them?
Before I had a bunch of elif statements but I thought a dictionary would be cleaner and easier to read.

You shouldn't expect pd.DataFrame.rename to apply any particular sequential ordering with a dict input. Even if the logic worked, it would be an implementation detail as the docs don't describe the actual process.
Instead, you can use pd.DataFrame.filter to find the first valid column label:
df = df.rename(columns={df.filter(like='phone').columns[0]: 'phone'})
print(df)
0 name phone telephone
0 1 Bob 12364234234 12364234234
1 2 Joe 23534235435 43564564563
2 3 Jill 34573474563 78098080807
If it's possible a valid column may not exist, you can catch IndexError:
try:
df = df.rename(columns={df.filter(like='phones').columns[0]: 'phone'})
except IndexError:
print('No columns including "phones" exists.')

Here's one solution:
df:
Columns: [name, mobile_phone, telephone]
Index: []
Finding the first instance of phone (left to right) in the column index:
a = [True if ('phone' in df.columns[i]) & ('phone' not in df.columns[i-1]) else False for i in range(len(df.columns))]
Getting the column that needs to be renamed phone:
phonecol = df.columns[a][0]
Renaming the column:
df.rename(columns = {phonecol : 'phone'})
Output:
Columns: [name, phone, telephone]
Index: []

Related

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

Concatenate multiple column strings into one column

I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object

Modify series from other series objects

so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch

Dropping selected rows in Pandas with duplicated columns

Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?
You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com
You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com

Group by pandas data frame unique first values - numpy array returned

From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA

Categories