I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object
Related
I have two dataframes of a format similar to below:
df1:
0 fname lname note
1 abby ross note1
2 rob william note2
3 abby ross note3
4 john doe note4
5 bob dole note5
df2:
0 fname lname note
1 abby ross note6
2 rob william note4
I want to merge finding matches based on fname and lname and then update the note column in the first DataFrame with the note column in the second DataFrame
The result I am trying to achieve would be like this:
0 fname lname note
1 abby ross note6
2 rob william note4
3 abby ross note6
4 john doe note4
5 bob dole note5
This is the code I was working with so far:
pd.merge(df1, df2, on=['fname', 'lname'], how='left')
but it just creates a new column with _y appended to it. How can I get it to just update that column?
Any help would be greatly appreciate, thanks!
You can merge and then correct the values:
df_3 = pd.merge(df1, df2, on=['fname', 'lname'], how='outer')
df_3['note'] = df_3['note_x']
df_3.loc[df_3['note'].isna(), 'note'] = df_3['note_y']
df_3 = df_3.drop(['note_x', 'note_y'], axis=1)
Do what you are doing:
then:
# fill nan values in note_y
out_df['note_y'].fillna(out_df['note_x'])
# Keep cols you want
out_df = out_df[['fname', 'lname', 'note_y']]
# rename the columns
out_df.columns = ['fname', 'lname', 'note']
I don't like this approach a whole lot as it won't be very scalable or generalize able. Waiting for a stellar answer for this question.
Try with update
df1=df1.set_index(['fname','lname'])
df1.update(df2.set_index(['fname','lname']))
df1=df1.reset_index()
df1
Out[55]:
fname lname 0 note
0 abby ross 1.0 note6
1 rob william 2.0 note4
2 john doe 3.0 note3
3 bob dole 4.0 note4
In a pandas dataframe string column, I want to grab everything after a certain character and place it in the beginning of the column while stripping the character. What is the most efficient way to do this / clean way to do achieve this?
Input Dataframe:
>>> df = pd.DataFrame({'city':['Bristol, City of', 'Newcastle, City of', 'London']})
>>> df
city
0 Bristol, City of
1 Newcastle, City of
2 London
>>>
My desired dataframe output:
city
0 City of Bristol
1 City of Newcastle
2 London
Assuming there are only two pieces to each string at most, you can split, reverse, and join:
df.city.str.split(', ').str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
If there are more than two commas, split on the first one only:
df.city.str.split(', ', 1).str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
Another option is str.partition:
u = df.city.str.partition(', ')
u.iloc[:,-1] + ' ' + u.iloc[:,0]
0 City of Bristol
1 City of Newcastle
2 London
dtype: object
This always splits on the first comma only.
You can also use a list comprehension, if you need performance:
df.assign(city=[' '.join(s.split(', ', 1)[::-1]) for s in df['city']])
city
0 City of Bristol
1 City of Newcastle
2 London
Why should you care about loopy solutions? For loops are fast when working with string/regex functions (faster than pandas, at least). You can read more at For loops with pandas - When should I care?.
I work with a dataset that occasionally have values removed before I get my hands on it. When a value is removed, it is generally replaced my NaN or ''. What is the most efficient way to collapse the values to the left?
Specifically, I'm trying to turn this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
Into this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
The column titles don't matter, the only thing that matters is that there are no leading empty cells and no empty cells between.
I would prefer to do this in a non-iterative fashion, as the df can be quite large.
Try this, if those blanks are '', then use mask to np.nan, else you don't need mask nor fillna:
df.mask(df == '').apply(lambda x: pd.Series(x.dropna().values), axis=1).fillna('')
Output:
0 1 2 3
bill sjd meojp
nick tredsn bana
fred ccrw aaaa cretwew bbbb
tom eomwep
jill dew weadf
Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?
You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com
You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com
From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA