Strip colum values if startswith a specific string pandas - python

I have a pandas dataframe(sample).
id name
1 Mr-Mrs-Jon Snow
2 Mr-Mrs-Jane Smith
3 Mr-Mrs-Darth Vader
I'm looking to strip the "Mr-Mrs-" from the dataframe. i.e the output should be:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader
I tried using
df['name'] = df['name'].str.lstrip("Mr-Mrs-")
But while doing so, some of the alphabets of names in some rows are also getting stripped out.
I don't want to run a loop and do .loc for every row, is there a better/optimized way to achieve this ?

Don't strip, replace using a start of string anchor (^):
df['name'] = df['name'].str.replace(r"^Mr-Mrs-", "", regex=True)
Or removeprefix:
df['name'] = df['name'].str.removeprefix("Mr-Mrs-")
Output:
id name
1 Jon Snow
2 Jane Smith
3 Darth Vader

Related

Python remove text if same of another column

I want to drop in my dataframe the text in a column if it starts with the same text that is in another column.
Example of dataframe:
name var1
John Smith John Smith Hello world
Mary Jane Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Dataframe that I want:
name var1
John Smith Hello world
Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Something simple as:
df[~df.var1.str.contains(df.var1)]
does not work. How I should write my python code?
Try using apply lambda;
df["var1"] = df.apply(lambda x: x["var1"][len(x["name"]):].strip() if x["name"] == x["var1"][:len(x["name"])] else x["var1"],axis=1)
How about this?
df['var1'] = [df.loc[i, 'var1'].replace(df.loc[i, 'name'], "") for i in df.index]

How to labeling data in pandas based on value of column have similar value in another column

If there is someone who understands, please help me to resolve this. I want to label user data using python pandas, where there are two columns in my dataset, namely author, and retweeted_screen_name. I want to do a label with the criteria if every user in the author column has the same value in the retweeted_screen_name column then are 1 and the others that do not have the same value are 0.
Author
RT_Screen_Name
Label
Alice
John
1
Sandy
John
1
Lisa
Mario
0
Luna
Mark
0
Luna
John
1
Luke
Anthony
0
df['Label']=0
df.loc[df["RT_Screen_Name"]=="John", ["Label"]] = 1
It is unclear what condition you are using to decide the label variable, but if you are clear on your condition you can change out the conditional statement within this code. Also if you edit your question to clarify the condition, notify me and I will adjust my answer.
IIUC, try with groupby:
df["Label"] = (df.groupby("RT_Screen_Name")["Author"].transform("count")>1).astype(int)
>>> df
Author RT_Screen_Name Label
0 Alice John 1
1 Sandy John 1
2 Lisa Mario 0
3 Luna Mark 0
4 Luna John 1
5 Luke Anthony 0

Concatenate multiple column strings into one column

I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object

Pandas - Rename only first dictionary match instead of last match

I am trying to use pandas to rename a column in CSV files. I want to use a dictionary since sometimes columns with the same information can be named differently (e.g. mobile_phone and telephone instead of phone).
I want to rename the first instance of phone. Here is an example to hopefully explain more.
Here is the original in this example:
0 name mobile_phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
Here is what I want it to do:
0 name phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
This is the code I tried:
phone_dict = {
'phone_number': 'phone',
'mobile_phone': 'phone',
'telephone': 'phone',
'phones': 'phone',
}
if 'phone' not in df.columns:
df.rename(columns=dict(phone_dict), inplace=True)
if 'phone' not in df.columns:
raise ValueError("What are these peoples numbers!? (Need 'phone' column)")
I made a dictionary with some possible column names and that I want them to be named 'phone'. However, when I run this code it turns the columns to this changes the second column instead of the first one that matches a key in the dictionary. I want it to stop after it matches the first column it comes across in the CSV.
This is what is happening:
0 name mobile_phone phone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
If there is, for example, a third column that matches the dictionary they turn to 'phone' which is again not what I want. I am trying to get it to just change the first column it matches.
Here is an example of what happens when I add a third column.
It goes from:
0 name mobile_phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
To this:
0 name phone phone phone
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
But I want it to be this:
0 name phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
Any advice or tips to stop it second changing the second dictionary match instead of the first one or all of them?
Before I had a bunch of elif statements but I thought a dictionary would be cleaner and easier to read.
You shouldn't expect pd.DataFrame.rename to apply any particular sequential ordering with a dict input. Even if the logic worked, it would be an implementation detail as the docs don't describe the actual process.
Instead, you can use pd.DataFrame.filter to find the first valid column label:
df = df.rename(columns={df.filter(like='phone').columns[0]: 'phone'})
print(df)
0 name phone telephone
0 1 Bob 12364234234 12364234234
1 2 Joe 23534235435 43564564563
2 3 Jill 34573474563 78098080807
If it's possible a valid column may not exist, you can catch IndexError:
try:
df = df.rename(columns={df.filter(like='phones').columns[0]: 'phone'})
except IndexError:
print('No columns including "phones" exists.')
Here's one solution:
df:
Columns: [name, mobile_phone, telephone]
Index: []
Finding the first instance of phone (left to right) in the column index:
a = [True if ('phone' in df.columns[i]) & ('phone' not in df.columns[i-1]) else False for i in range(len(df.columns))]
Getting the column that needs to be renamed phone:
phonecol = df.columns[a][0]
Renaming the column:
df.rename(columns = {phonecol : 'phone'})
Output:
Columns: [name, phone, telephone]
Index: []

Dropping selected rows in Pandas with duplicated columns

Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?
You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com
You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com

Categories