Duplicates in a cell str in data frame - python

I have a sample df
id
Another header
1
JohnWalter walter
2
AdamSmith Smith
3
Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id
Name
poped_out_string
corrected_name
1
JohnWalter walter
walter
John walter
2
AdamSmith Smith
Smith
Adam Smith
3
Steve Rogers rogers
rogers
Steve Rogers

You could try something like below:
import re # Import to help efficiently find duplicates
# Get unique items from list, and duplicates
def uniqueItems(input):
seen = set()
uniq = []
dups = []
result_dict = {}
for x in input:
xCapitalize = x.capitalize()
if x in uniq and x not in dups:
dups.append(x)
if x not in seen:
uniq.append(xCapitalize)
seen.add(x)
result_dict = {"unique": uniq, "duplicates": dups}
return result_dict
# Split our strings
def splitString(inputString):
stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
uniqueValues = uniqueItems(convertToLower)
else:
result = inputString
return result
# Iterate over rows in data frame
for i, row in df.iterrows():
split_result = splitString(row['Name'])
if (len(split_result["duplicates"]) > 0): # If there are duplicates
df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
if (len(split_result["unique"]) > 1):
df.at[i, "corrected_name"] = " ".join(split_result["unique"])
The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame

import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
Name
0 JohnWalter walter brown
1 winter AdamSmith Smith
2 Steve Rogers rogers
def remove_dups(string):
# first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
names = re.findall('[a-zA-Z][^A-Z ]*', string)
# then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
new_names = []
# capitalize and append names if they are not already added
[new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
# finallyconstruct full name and return
return(' '.join(new_names))
df.Name.apply(remove_dups)
0 John Walter Brown
1 Winter Adam Smith
2 Steve Rogers
Name: Name, dtype: object

Related

Pandas Full Name Split into First , Middle and Last Names

I have a pandas dataframe with a fullnames field, I want to change the logic so that the First and Last name will have all the first and last word and the rest will go into the middle name field.
Note: The full name can contain two words in that case middle name will be null and there may be also extra spaces between the names.
Current Logic:
fullnames = "Walter John Ross Schmidt"
first, middle, *last = name.split()
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=" ".join(last)))
Output :
First = Walter
Middle = John
Last = Ross Schmidt
Expected Output :
FirstName = Walter
Middle = John Ross
Last = Schmidt
You can use negative indexing to get the last item in the list for the last name and also use a slice to get all but the first and last for the middle name:
fullnames = "Walter John Ross Schmidt"
first = fullnames.split()[0]
last = fullnames.split()[-1]
middle = " ".join(fullnames.split()[1:-1])
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=last))
PS if you are working with a data frame you can use:
df = pd.DataFrame({'fullnames':['Walter John Ross Schmidt']})
df = df.assign(**{
'first': df['fullnames'].str.split().str[0],
'middle': df['fullnames'].str.split().str[1:-1].str.join(' '),
'last': df['fullnames'].str.split().str[-1]
})
Output:
fullnames first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
You can use capture groups in the regex passed to str.extract(), which will let you do this in a single operation:
df = pd.DataFrame({
"name": [
"Walter John Ross Schmidt",
"John Quincy Adams"
]
})
rx = re.compile(r'^(\w+)\s+(.*?)\s+(\w+)$')
df[['first', 'middle', 'last']] = df['name'].str.extract(pat=rx, expand=True)
This gives you:
name first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
1 John Quincy Adams John Quincy Adams
I would use str.replace and str.extract here:
df["FirstName"] = df["FullName"].str.extract(r'^(\w+)')
df["Middle"] = df["FullName"].str.replace(r'^\w+\s+|\s+\w+$', '')
df["Last"] = df["FullName"].str.extract(r'(\w+)$')
You can use the following line instead.
first, *middle, last = fullnames.split()

Pandas dataframe: select list items in a column, then transform string on the items

One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:
Name
Listed_Items
Tom
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]
From what I've read I can turn the column into a list
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output
Name
MD
OD
Tom
Coca Cola
Water
Steve
Pepsi
Orange Juice
Phil
Dr Pepper
Coffee
What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!
Use pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.
I took a slightly different approach from the previous answers.
given a df of form:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:
only one item in a list matches the target mask
the target mask always appears at the start of the entry string
I created the following function to parse the list:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)

Parse movie text data into a dataframe

I have some .txt data from a movie script that looks like this.
JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then
I'd like to parse out the text and create a dataframe with 2 columns. I for person e.g JOHN and TOM and a second column for the lines (which are the block of text below each name). The result would be like..
index | person | lines
0 | JOHN | "Hi man. How are you?"
1 | TOM | "A little hungry but okay."
2 | JOHN | "Let's get breakfast then"
I know I'm late to this party but this will parse an entire script into a dictionary of character name and their dialogue as values all you need to do then is df = pd.DataFrame(final_dict.values(), columns = final_dict.keys())
*# Grouped regex pattern to capture char and dialouge in a tuple
char_dialogue = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
extract_dialogue = char_dialogue.findall(script)
final_dict = {}
for element in extract_dialogue:
# Seperating the character and dialogue from the tuple
char = element[0]
line = element[1]
# If the char is already a key in the dictionary
# and line is not empty append the dialogue to the value list
if char in final_dict:
if line != '':
final_dict[char].append(line)
else:
# Else add the character name to the dictionary keys with their first line
# Drop any lower case matches from group 0
# Can adjust the len here if you have characters with fewer letters
if char.isupper() and len(char) >2:
final_dict[char] = [line]
# Some final cleaning to drop empty dalouge
final_dict = {k: v for k, v in final_dict.items() if v != ['']}
# More filtering to reutrn only main characters with more than 50
# lines of dialogue
final_dict = {k: v for k, v in final_dict.items() if len(v) > 50}*
If every text is in one line then you can split to lines
lines = text.split('\n')
Remove spaces
lines = [x.strip() for x in lines]
And use slice [start:end:step] to create dataframe
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
Example:
text = ''' JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n')
lines = [x.strip() for x in lines]
import pandas as pd
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
print(df)
Result:
person lines
0 JOHN Hi man. How are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then
If person may have text in many lines - ie.
JOHN
Hi man.
How are you?
then it needs more spliting and striping.
You can do it before creating DataFrame.
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
data = []
parts = text.split('\n\n')
for part in parts:
person, lines = part.split('\n', 1)
person = person.strip()
lines = "\n".join(x.strip() for x in lines.split('\n'))
data.append([person, lines])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['person', 'lines']
print(df)
Or you can try to do it after creating DataFrame
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n\n')
lines = [x.split('\n', 1) for x in lines]
import pandas as pd
df = pd.DataFrame(lines)
df.columns = ['person', 'lines']
df['person'] = df['person'].str.strip()
df['lines'] = df['lines'].apply(lambda txt: "\n".join(x.strip() for x in txt.split('\n')))
print(df)
Resulst:
person lines
0 JOHN Hi man.\nHow are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then

Seperate list into row and column based on delimiter

I have a list of emails I wanted to split into two columns.
df = [Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>;
Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>]
list = df.split(';')
for i in list
print (i)
Expected result is to have two columns, one for name, and one for email:
Name Email
Smith, John jsmith#abc.com
Moores, Jordan jmoores#abc.com
Manson, Tyler tmanson#abc.om
Foster, Ryan rfoster#abc.com`
Do NOT use list as a variable name; there's just no reason to. Here is a way to do it, assuming your input is a string:
data = "Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>; Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>"
# Do not call things list as "list" is a keyword in Python
l1 = data.split(';')
res = []
for i in l1:
splt = i.strip().split()
res.append([" ".join(splt[:2]), splt[-1][1:-1]])
df = pd.DataFrame(res, columns=["Name", "Email"])

i am trying to split a full name to first middle and last name in pandas but i am stuck at replace

i am trying to break the name into two parts and keeping first name last name and finally replacing the common part in all of them such that first name is must then last name then if middle name remain it is added to column
df['owner1_first_name'] = df['owner1_name'].str.split().str[0].astype(str,
errors='ignore')
df['owner1_last_name'] =
df['owner1_name'].str.split().str[-1].str.replace(df['owner1_first_name'],
"").astype(str, errors='ignore')
['owner1_middle_name'] =
df['owner1_name'].str.replace(df['owner1_first_name'],
"").str.replace(df['owner1_last_name'], "").astype(str, errors='ignore')
the problem is i am not able to use
.str.replace(df['owner1_name'], "")
as i am getting an error
"TypeError: 'Series' objects are mutable, thus they cannot be hashed"
is there any replacement sytax in pandas for what i am tryin to achieve
my desired output is
full name = THOMAS MARY D which is in column owner1_name
I want
owner1_first_name = THOMAS
owner1_middle_name = MARY
owner1_last_name = D
I think you need mask which replace if same values in both columns to empty strings:
df = pd.DataFrame({'owner1_name':['THOMAS MARY D', 'JOE Long', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
df['owner1_middle_name'] = splitted.str[1]
df['owner1_middle_name'] = df['owner1_middle_name']
.mask(df['owner1_middle_name'] == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
What is same as:
splitted = df['owner1_name'].str.split()
df['owner1_first_name'] = splitted.str[0]
df['owner1_last_name'] = splitted.str[-1]
middle = splitted.str[1]
df['owner1_middle_name'] = middle.mask(middle == df['owner1_last_name'], '')
print (df)
owner1_name owner1_first_name owner1_last_name owner1_middle_name
0 THOMAS MARY D THOMAS D MARY
1 JOE Long JOE Long
2 MARY Small MARY Small
EDIT:
For replace by rows is possible use apply with axis=1:
df = pd.DataFrame({'owner1_name':['THOMAS MARY-THOMAS', 'JOE LongJOE', 'MARY Small']})
splitted = df['owner1_name'].str.split()
df['a'] = splitted.str[0]
df['b'] = splitted.str[-1]
df['c'] = df.apply(lambda x: x['b'].replace(x['a'], ''), axis=1)
print (df)
owner1_name a b c
0 THOMAS MARY-THOMAS THOMAS MARY-THOMAS MARY-
1 JOE LongJOE JOE LongJOE Long
2 MARY Small MARY Small Small
the exact code to in three line to achieve what i wanted in my question is
df['owner1_first_name'] = df['owner1_name'].str.split().str[0]
df['owner1_last_name'] = df.apply(lambda x: x['owner1_name'].split()
[-1].replace(x['owner1_first_name'], ''), axis=1)
df['owner1_middle_name'] = df.apply(lambda x:
x['owner1_name'].replace(x['owner1_first_name'],
'').replace(x['owner1_last_name'], ''), axis=1)
Just change your assignment and use another variable:
split = df['owner1_name'].split()
df['owner1_first_name'] = split[0]
df['owner1_middle_name'] = split[-1]
df['owner1_last_name'] = split[1]
splitted = df['Contact_Name'].str.split()
df['First_Name'] = splitted.str[0]
df['Last_Name'] = splitted.str[-1]
df['Middle_Name'] = df['Contact_Name'].loc[df['Contact_Name'].str.split().str.len() == 3].str.split(expand=True)[1]
This might help! the part here is to rightly insert the middle name which you can do by this code..
I like to use the extract parameter. It will return a new dataframe with columns named 0, 1, 2. You can rename them in one line:
col_names = ['owner1_first_name', 'owner1_middle_name', 'owner1_last_name']
df.owner1_name.str.split(extract=True).rename(dict(range(len(col_names), col_names)))
Beware that this code breaks if someone has four names. Better to it in 2 steps: split(n=1, extract=True) and then rsplit(n=1, extract=True

Categories