I have a pandas dataframe with a fullnames field, I want to change the logic so that the First and Last name will have all the first and last word and the rest will go into the middle name field.
Note: The full name can contain two words in that case middle name will be null and there may be also extra spaces between the names.
Current Logic:
fullnames = "Walter John Ross Schmidt"
first, middle, *last = name.split()
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=" ".join(last)))
Output :
First = Walter
Middle = John
Last = Ross Schmidt
Expected Output :
FirstName = Walter
Middle = John Ross
Last = Schmidt
You can use negative indexing to get the last item in the list for the last name and also use a slice to get all but the first and last for the middle name:
fullnames = "Walter John Ross Schmidt"
first = fullnames.split()[0]
last = fullnames.split()[-1]
middle = " ".join(fullnames.split()[1:-1])
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=last))
PS if you are working with a data frame you can use:
df = pd.DataFrame({'fullnames':['Walter John Ross Schmidt']})
df = df.assign(**{
'first': df['fullnames'].str.split().str[0],
'middle': df['fullnames'].str.split().str[1:-1].str.join(' '),
'last': df['fullnames'].str.split().str[-1]
})
Output:
fullnames first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
You can use capture groups in the regex passed to str.extract(), which will let you do this in a single operation:
df = pd.DataFrame({
"name": [
"Walter John Ross Schmidt",
"John Quincy Adams"
]
})
rx = re.compile(r'^(\w+)\s+(.*?)\s+(\w+)$')
df[['first', 'middle', 'last']] = df['name'].str.extract(pat=rx, expand=True)
This gives you:
name first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
1 John Quincy Adams John Quincy Adams
I would use str.replace and str.extract here:
df["FirstName"] = df["FullName"].str.extract(r'^(\w+)')
df["Middle"] = df["FullName"].str.replace(r'^\w+\s+|\s+\w+$', '')
df["Last"] = df["FullName"].str.extract(r'(\w+)$')
You can use the following line instead.
first, *middle, last = fullnames.split()
Related
I have a csv with a few people. I would like to build a function that will either filter based on all parameters given or return the entire dataframe as it is if no arguments are passed.
So given this as the csv:
FirstName LastName City
Matt Fred Austin
Jim Jack NYC
Larry Bob Houston
Matt Spencer NYC
if I were to call my function find, assuming here is what I would expect to see depending on what I passed as arguments
find(first="Matt", last="Fred")
Output: Matt Fred Austin
find()
Output: Full Dataframe
find(last="Spencer")
Output: Matt Spencer Fred
find(address="NYC")
Output: All people living in NYC in dataframe
This is what I have tried:
def find(first=None, last=None, city=None):
file= pd.read_csv(list)
searched = file.loc[(file["FirstName"] == first) & (file["LastName" == last]) & (file["City"] == city)]
return searched
This returns blank if I just pass in a first name and nothing else
You could do something like that:
import numpy as np
def find(**kwargs):
assert np.isin(list(kwargs.keys()), df.columns).all()
return df.loc[df[list(kwargs.keys())].eq(list(kwargs.values())).all(axis=1)]
search = find(FirstName="Matt", LastName="Fred")
print(search)
# FirstName LastName City
#0 Matt Fred Austin
find(LastName="Spencer")
# FirstName LastName City
#3 Matt Spencer NYC
If you want use "first", "last" and "city":
def find(**kwargs):
df_index = df.rename(columns={"FirstName": "first",
"LastName": "last",
"City": "city"})
assert np.isin(list(kwargs.keys()), df_index.columns).all()
return df.loc[df_index[list(kwargs.keys())]
.eq(list(kwargs.values())).all(axis=1)]
Another alternative approach of filtering columns:
csv_path = os.path.abspath('test.csv')
df = pd.read_table(csv_path, sep='\s+')
def find_by_attrs(df, **attrs):
if attrs.keys() - df.columns:
raise KeyError('Improper column name(s)')
return df[df[attrs.keys()].eq(attrs.values()).all(1)]
print(find_by_attrs(df, City="NYC"))
The output:
FirstName LastName City
1 Jim Jack NYC
3 Matt Spencer NYC
I have a sample df
id
Another header
1
JohnWalter walter
2
AdamSmith Smith
3
Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id
Name
poped_out_string
corrected_name
1
JohnWalter walter
walter
John walter
2
AdamSmith Smith
Smith
Adam Smith
3
Steve Rogers rogers
rogers
Steve Rogers
You could try something like below:
import re # Import to help efficiently find duplicates
# Get unique items from list, and duplicates
def uniqueItems(input):
seen = set()
uniq = []
dups = []
result_dict = {}
for x in input:
xCapitalize = x.capitalize()
if x in uniq and x not in dups:
dups.append(x)
if x not in seen:
uniq.append(xCapitalize)
seen.add(x)
result_dict = {"unique": uniq, "duplicates": dups}
return result_dict
# Split our strings
def splitString(inputString):
stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
uniqueValues = uniqueItems(convertToLower)
else:
result = inputString
return result
# Iterate over rows in data frame
for i, row in df.iterrows():
split_result = splitString(row['Name'])
if (len(split_result["duplicates"]) > 0): # If there are duplicates
df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
if (len(split_result["unique"]) > 1):
df.at[i, "corrected_name"] = " ".join(split_result["unique"])
The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame
import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
Name
0 JohnWalter walter brown
1 winter AdamSmith Smith
2 Steve Rogers rogers
def remove_dups(string):
# first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
names = re.findall('[a-zA-Z][^A-Z ]*', string)
# then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
new_names = []
# capitalize and append names if they are not already added
[new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
# finallyconstruct full name and return
return(' '.join(new_names))
df.Name.apply(remove_dups)
0 John Walter Brown
1 Winter Adam Smith
2 Steve Rogers
Name: Name, dtype: object
I have two dataframes: one with full names and another with nicknames. The nickname is always a portion of the person's full name, and the data is not sorted or indexed, so I can't just merge the two.
What I want as an output is one data frame that contains the full name and the associated nick name by simple search: find the nickname inside the name and match it.
Any solutions to this?
df = pd.DataFrame({'fullName': ['Claire Daines', 'Damian Lewis', 'Mandy Patinkin', 'Rupert Friend', 'F. Murray Abraham']})
df2 = pd.DataFrame({'nickName': ['Rupert','Abraham','Patinkin','Daines','Lewis']})
Thanks
Use Series.str.extract with strings joined by | for regex or with \b for words boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in df2['nickName'])
df['nickName'] = df['fullName'].str.extract('('+ pat + ')', expand=False)
print (df)
fullName nickName
0 Claire Daines Daines
1 Damian Lewis Lewis
2 Mandy Patinkin Patinkin
3 Rupert Friend Rupert
4 F. Murray Abraham Abraham
I have a list of emails I wanted to split into two columns.
df = [Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>;
Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>]
list = df.split(';')
for i in list
print (i)
Expected result is to have two columns, one for name, and one for email:
Name Email
Smith, John jsmith#abc.com
Moores, Jordan jmoores#abc.com
Manson, Tyler tmanson#abc.om
Foster, Ryan rfoster#abc.com`
Do NOT use list as a variable name; there's just no reason to. Here is a way to do it, assuming your input is a string:
data = "Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>; Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>"
# Do not call things list as "list" is a keyword in Python
l1 = data.split(';')
res = []
for i in l1:
splt = i.strip().split()
res.append([" ".join(splt[:2]), splt[-1][1:-1]])
df = pd.DataFrame(res, columns=["Name", "Email"])
i am trying to break the name into two parts and keeping first name last name and finally replacing the common part in all of them such that first name is must then last name then if middle name remain it is added to column
df['owner1_first_name'] = df['owner1_name'].str.split().str[0].astype(str,
errors='ignore')
df['owner1_last_name'] =
df['owner1_name'].str.split().str[-1].str.replace(df['owner1_first_name'],
"").astype(str, errors='ignore')
['owner1_middle_name'] =
df['owner1_name'].str.replace(df['owner1_first_name'],
"").str.replace(df['owner1_last_name'], "").astype(str, errors='ignore')
the problem is i am not able to use
.str.replace(df['owner1_name'], "")
as i am getting an error
"TypeError: 'Series' objects are mutable, thus they cannot be hashed"
is there any replacement sytax in pandas for what i am tryin to achieve
my desired output is
full name = THOMAS MARY D which is in column owner1_name
I want
owner1_first_name = THOMAS
owner1_middle_name = MARY
owner1_last_name = D
I think you need mask which replace if same values in both columns to empty strings:
df = pd.DataFrame({'owner1_name':['THOMAS MARY D', 'JOE Long', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
df['owner1_middle_name'] = splitted.str[1]
df['owner1_middle_name'] = df['owner1_middle_name']
.mask(df['owner1_middle_name'] == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
What is same as:
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
middle = splitted.str[1]
df['owner1_middle_name'] = middle.mask(middle == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
EDIT:
For replace by rows is possible use apply with axis=1:
df = pd.DataFrame({'owner1_name':['THOMAS MARY-THOMAS', 'JOE LongJOE', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['a'] = splitted.str[0]
df['b'] = splitted.str[-1]
df['c'] = df.apply(lambda x: x['b'].replace(x['a'], ''), axis=1)
print (df)
owner1_name a b c
0 THOMAS MARY-THOMAS THOMAS MARY-THOMAS MARY-
1 JOE LongJOE JOE LongJOE Long
2 MARY Small MARY Small Small
the exact code to in three line to achieve what i wanted in my question is
df['owner1_first_name'] = df['owner1_name'].str.split().str[0]
df['owner1_last_name'] = df.apply(lambda x: x['owner1_name'].split()
[-1].replace(x['owner1_first_name'], ''), axis=1)
df['owner1_middle_name'] = df.apply(lambda x:
x['owner1_name'].replace(x['owner1_first_name'],
'').replace(x['owner1_last_name'], ''), axis=1)
Just change your assignment and use another variable:
split = df['owner1_name'].split()
df['owner1_first_name'] = split[0]
df['owner1_middle_name'] = split[-1]
df['owner1_last_name'] = split[1]
splitted = df['Contact_Name'].str.split()
df['First_Name'] = splitted.str[0]
df['Last_Name'] = splitted.str[-1]
df['Middle_Name'] = df['Contact_Name'].loc[df['Contact_Name'].str.split().str.len() == 3].str.split(expand=True)[1]
This might help! the part here is to rightly insert the middle name which you can do by this code..
I like to use the extract parameter. It will return a new dataframe with columns named 0, 1, 2. You can rename them in one line:
col_names = ['owner1_first_name', 'owner1_middle_name', 'owner1_last_name']
df.owner1_name.str.split(extract=True).rename(dict(range(len(col_names), col_names)))
Beware that this code breaks if someone has four names. Better to it in 2 steps: split(n=1, extract=True) and then rsplit(n=1, extract=True