I have a list of emails I wanted to split into two columns.
df = [Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>;
Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>]
list = df.split(';')
for i in list
print (i)
Expected result is to have two columns, one for name, and one for email:
Name Email
Smith, John jsmith#abc.com
Moores, Jordan jmoores#abc.com
Manson, Tyler tmanson#abc.om
Foster, Ryan rfoster#abc.com`
Do NOT use list as a variable name; there's just no reason to. Here is a way to do it, assuming your input is a string:
data = "Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>; Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>"
# Do not call things list as "list" is a keyword in Python
l1 = data.split(';')
res = []
for i in l1:
splt = i.strip().split()
res.append([" ".join(splt[:2]), splt[-1][1:-1]])
df = pd.DataFrame(res, columns=["Name", "Email"])
Related
I have a sample df
id
Another header
1
JohnWalter walter
2
AdamSmith Smith
3
Steve Rogers rogers
How can I find whether it is duplicated in every row and pop it out?
id
Name
poped_out_string
corrected_name
1
JohnWalter walter
walter
John walter
2
AdamSmith Smith
Smith
Adam Smith
3
Steve Rogers rogers
rogers
Steve Rogers
You could try something like below:
import re # Import to help efficiently find duplicates
# Get unique items from list, and duplicates
def uniqueItems(input):
seen = set()
uniq = []
dups = []
result_dict = {}
for x in input:
xCapitalize = x.capitalize()
if x in uniq and x not in dups:
dups.append(x)
if x not in seen:
uniq.append(xCapitalize)
seen.add(x)
result_dict = {"unique": uniq, "duplicates": dups}
return result_dict
# Split our strings
def splitString(inputString):
stringProcess = re.sub( r"([A-Z])", r" \1", inputString).split()
if (len(stringProcess) > 1): #If there are multiple items in a cell, after splitting
convertToLower = [x.lower() for x in stringProcess] #convert all strings to lower case for easy comparison
uniqueValues = uniqueItems(convertToLower)
else:
result = inputString
return result
# Iterate over rows in data frame
for i, row in df.iterrows():
split_result = splitString(row['Name'])
if (len(split_result["duplicates"]) > 0): # If there are duplicates
df.at[i, "poped_out_string"] = split_result["duplicates"] # Typo here - consider fixing
if (len(split_result["unique"]) > 1):
df.at[i, "corrected_name"] = " ".join(split_result["unique"])
The general idea is to iterate over each row, split the string in the "Name" column, check for duplicates, and then write those values into the data frame
import re
df = pd.DataFrame(['JohnWalter walter brown', 'winter AdamSmith Smith', 'Steve Rogers rogers'], columns=['Name'])
df
Name
0 JohnWalter walter brown
1 winter AdamSmith Smith
2 Steve Rogers rogers
def remove_dups(string):
# first find names that starts with simple/capital leter having one or more characters excluding space and upper cases
names = re.findall('[a-zA-Z][^A-Z ]*', string)
# then take new array to get non-duplicates (set can't use as it doesn't preserve order of the names)
new_names = []
# capitalize and append names if they are not already added
[new_names.append(name.capitalize()) for name in names if name.capitalize() not in new_names]
# finallyconstruct full name and return
return(' '.join(new_names))
df.Name.apply(remove_dups)
0 John Walter Brown
1 Winter Adam Smith
2 Steve Rogers
Name: Name, dtype: object
I have a pandas dataframe with a fullnames field, I want to change the logic so that the First and Last name will have all the first and last word and the rest will go into the middle name field.
Note: The full name can contain two words in that case middle name will be null and there may be also extra spaces between the names.
Current Logic:
fullnames = "Walter John Ross Schmidt"
first, middle, *last = name.split()
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=" ".join(last)))
Output :
First = Walter
Middle = John
Last = Ross Schmidt
Expected Output :
FirstName = Walter
Middle = John Ross
Last = Schmidt
You can use negative indexing to get the last item in the list for the last name and also use a slice to get all but the first and last for the middle name:
fullnames = "Walter John Ross Schmidt"
first = fullnames.split()[0]
last = fullnames.split()[-1]
middle = " ".join(fullnames.split()[1:-1])
print("First = {first}".format(first=first))
print("Middle = {middle}".format(middle=middle))
print("Last = {last}".format(last=last))
PS if you are working with a data frame you can use:
df = pd.DataFrame({'fullnames':['Walter John Ross Schmidt']})
df = df.assign(**{
'first': df['fullnames'].str.split().str[0],
'middle': df['fullnames'].str.split().str[1:-1].str.join(' '),
'last': df['fullnames'].str.split().str[-1]
})
Output:
fullnames first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
You can use capture groups in the regex passed to str.extract(), which will let you do this in a single operation:
df = pd.DataFrame({
"name": [
"Walter John Ross Schmidt",
"John Quincy Adams"
]
})
rx = re.compile(r'^(\w+)\s+(.*?)\s+(\w+)$')
df[['first', 'middle', 'last']] = df['name'].str.extract(pat=rx, expand=True)
This gives you:
name first middle last
0 Walter John Ross Schmidt Walter John Ross Schmidt
1 John Quincy Adams John Quincy Adams
I would use str.replace and str.extract here:
df["FirstName"] = df["FullName"].str.extract(r'^(\w+)')
df["Middle"] = df["FullName"].str.replace(r'^\w+\s+|\s+\w+$', '')
df["Last"] = df["FullName"].str.extract(r'(\w+)$')
You can use the following line instead.
first, *middle, last = fullnames.split()
One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:
Name
Listed_Items
Tom
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]
From what I've read I can turn the column into a list
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output
Name
MD
OD
Tom
Coca Cola
Water
Steve
Pepsi
Orange Juice
Phil
Dr Pepper
Coffee
What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!
Use pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.
I took a slightly different approach from the previous answers.
given a df of form:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:
only one item in a list matches the target mask
the target mask always appears at the start of the entry string
I created the following function to parse the list:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)
Using the existing column name, add a new column first_name to df such that the new column splits the name into multiple words and takes the first word as its first name. For example, if the name is Elon Musk, it is split into two words in the list ['Elon', 'Musk'] and the first word Elon is taken as its first name. If the name has only one word, then the word itself is taken as its first name.
A snippet of the data frame
Name
Alemsah Ozturk
Igor Arinich
Christopher Maloney
DJ Holiday
Brian Tracy
Philip DeFranco
Patrick Collison
Peter Moore
Dr.Darrell Scott
Atul Gawande
Everette Taylor
Elon Musk
Nelly_Mo
This is what I have so far. I am not sure how to extract the name after I tokenize it
import nltk
first = df.name.apply(lambda x: nltk.word_tokenize(x))
df["first_name"] = This is where I'm stuck
Try this snippet:
df["first_name"] = df['Name'].map(lambda x: x.split(' ')[0])
df["last_name"] = df['Name'].map(lambda x: x.split(' ')[1])
I am trying to create a new column in this data frame. The data set has multiple records for each PERSON because each record is a different account. The new column values should be a combination of the values for each PERSON in the TYPE column. For example, if John Doe has four accounts the value next to his nae in the new column should be a concatenation of the values in TYPE. An example of the final data frame is below. Thanks in advance.
enter image description here
You can do this in two lines (first code, then explanation):
Code:
in: name_types = df.pivot_table(index='Name', values='AccountType', aggfunc=set)
out:
AccountType
Name
Jane Doe {D}
John Doe {L, W, D}
Larry Wild {L, D}
Patti Shortcake {L, W}
in: df['ClientType'] = df['Name'].apply(lambda x: name_types.loc[x]['AccountType'])
Explanation:
The pivot table gets all the AccountTypes for each individual name, and removes all duplicates using the 'set' aggregate function.
The apply function then iterates through each 'Name' in the main data frame, looks up the AccountType associated with each in name_typed, and adds it to the new column ClientType in the main dataframe.
And you're done!
Addendum:
If you need the column to be a string instead of a set, use:
in: def to_string(the_set):
string = ''
for item in the_set:
string += item
return string
in: df['ClientType'] = df['ClientType'].apply(to_string)
in: df.head()
out:
Name AccountType ClientType
0 Jane Doe D D
1 John Doe D LDW
2 John Doe D LDW
3 John Doe L LDW
4 John Doe D LDW