How to deal with long names in data cleaning?

How to deal with long names in data cleaning? - python

I have a users database. I want to separate them into two columns to have user1 and user2.
The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.
The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.
Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi
After I split the user columns
0
1
2
3
4
5
6
7
8
Maria
Melinda
Del
Valle
Justin
Howard
Devin
Craig
Jr.
Michael
Carter
III
Jeanne
De
Bordeaux
Alhamdi
The expected result is the following
User1
User2
Maria Melinda Del valle
Justin Howard
Devin Craig Jr.
Michael Carter III
Jeanne De Bordeaux
Alhamdi

You can use:
def f(sr):
m = sr.isna().cumsum().loc[lambda x: x < 2]
return sr.dropna().groupby(m).apply(' '.join)
out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')
Output:
>>> out
User1 User2
0 Maria Melinda Del Valle Justin Howard
1 Devin Craig Jr. Michael Carter III
2 Jeanne De Bordeaux Alhamdi
As suggested by #Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

Related

NetworkX graph with some specifications based on two dataframes

I have two dataframes. The first shows the name of people of a program, called df_student.
Student-ID
Name
20202456
Luke De Paul
20202713
Emil Smith
20202456
Alexander Müller
20202713
Paul Bernard
20202456
Zoe Michailidis
20202713
Joanna Grimaldi
20202456
Kepler Santos
20202713
Dominic Borg
20202456
Jessica Murphy
20202713
Danielle Dominguez
And the other shows a dataframe where people reach the best grades with at least one person from the df_student in a course and is called df_course.
Course-ID
Name
Grade
UNI44
Luke De Paul, Benjamin Harper
17
UNI45
Dominic Borg
20
UNI61
Luke De Paul, Jonathan MacAllister
20
UNI62
Alexander Müller, Kepler Santos
17
UNI63
Joanna Grimaldi
19
UNI65
Emil Smith, Filippo Visconti
18
UNI71
Moshe Azerad, Emil Smith
18
UNI72
Luke De Paul, Jessica Murphy
18
UNI73
Luke De Paul, Filippo Visconti
17
UNI74
Matthias Noem, Kepler Santos
19
UNI75
Luke De Paul, Kepler Santos
16
UNI76
Kepler Santos
17
UNI77
Kepler Santos, Benjamin Harper
17
UNI78
Dominic Borg, Kepler Santos
18
UNI80
Luke De Paul, Gabriel Martin
18
UNI81
Dominic Borg, Alexander Müller
19
UNI82
Luke De Paul, Giancarlo Di Lorenzo
20
UNI83
Emil Smith,Joanna Grimaldi
20
I would like to create a NetworkX graph where there is a vertex for each student from df_student and also from each student from df_course. There should also be an unweighted each between two vertices only if two student received the best grade in the same course.
Now what I tried is this
import networkx as nx
G = nx.Graph()
G.add_edge(student, course)
But when I doing is it say that argument is not right. And so I don't know how to continue

Try:
import networkx as nx
import pandas as pd
df_students = pd.read_clipboard()
df_course = pd.read_clipboard()
df_s_t = df_course['Name'].str.split(',', expand=True)
G = nx.from_pandas_edgelist(df_net, 0, 1)
df_net = df_s_t[df_s_t.notna().all(1)]
G.add_nodes_from(pd.concat([df_students['Name'],
df_s_t.loc[~df_s_t.notna().all(1),0]]))
fig, ax = plt.subplots(1,1, figsize=(15,15))
nx.draw_networkx(G)
Output:

Update DataFrame based on matching rows in another DataFrame

Say there is a group of people who can choose an English and / or a Spanish word. Let's say they chose like this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water',None,'hello','thanks',None,'green'],spanish=[None,'agua',None,None,'bienvenido','verde']))
person english spanish
0 mary water None
1 james None agua
2 patricia hello None
3 robert thanks None
4 jennifer None bienvenido
5 michael green verde
Say I also have an English-Spanish dictionary (assume no duplicates, i.e. one-to-one relationship):
>>> pandas.DataFrame(dict(english=['hello','bad','green','thanks','welcome','water'],spanish=['hola','malo','verde','gracias','bienvenido','agua']))
english spanish
0 hello hola
1 bad malo
2 green verde
3 thanks gracias
4 welcome bienvenido
5 water agua
How can I fill in any missing words, i.e. update the first DataFrame using the second DataFrame where either english or spanish is None, to arrive at this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water','water','hello','thanks','welcome','green'],spanish=['agua','agua','hola','gracias','bienvenido','verde']))
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

You may check the map with fillna
df['english'] = df['english'].fillna(df['spanish'].map(df2.set_index('spanish')['english']))
df['spanish'] = df['spanish'].fillna(df['english'].map(df2.set_index('english')['spanish']))
df
Out[200]:
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

Fill subsequent values beneath an existing value in pandas dataframe column

I have a Pandas dataframe df
I want to populate subsequent values in a column based on the value that preceded it and when I come across another value do the same for that.
So the dept column is complete and I can merge this dataset with another to have departments linked info for PIs.
Don't know the best approach, is there a vectorized approach to this our would it require looping, maybe using iterrows() or itertuples().
data = {"dept": ["Emergency Medicine", "", "", "", "Family Practice", "", ""],
"pi": [NaN, "Tiger Woods", "Michael Jordan", "Roger Federer", NaN, "Serena Williams", "Alex Morgan"]
}
df = pd.DataFrame(data=data)
dept pi
0 Emergency Medicine
1 Tiger Woods
2 Michael Jordan
3 Roger Federer
4 Family Practice
5 Serena Williams
6 Alex Morgan
desired_df
dept pi
0 Emergency Medicine
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice
5 Family Practice Serena Williams
6 Family Practice Alex Morgan

Use where to mask those empty rows with nan, then ffill
# if you have empty strings
mask = df['dept'].ne('')
df['dept'] = df['dept'].where(mask).ffill()
# otherwise, just
# df['dept'] = df['dept'].ffill()
Output:
dept pi
0 Emergency Medicine NaN
1 Emergency Medicine Tiger Woods
2 Emergency Medicine Michael Jordan
3 Emergency Medicine Roger Federer
4 Family Practice NaN
5 Family Practice Serena Williams
6 Family Practice Alex Morgan

Pandas -Split data and create columns when string occurs

I am looking to read in a text file (see below) and then create columns for all the English leagues only. So I'll be looking to do something like where "Alias name" is "England_" then create a new column with the alias name as the header and then the player names in the rows. note that the first occurrence for Alias is down as "Aliases" in the text file.
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
This is my current code on how I'm reading in the data
df = pd.read_csv(r'Desktop\SampleData.txt', sep='\n', header=None)
This gives me a pandas DF with one column. I'm fairly new to python so I'm wondering how I would go about getting the below result? should I use a delimiter when reading in the file?
England_Premier League
England_Division 1
England_Division 2
Harry Kane
Will Griggs
Eoin Doyle
Mohamed Salah
Jonson Clarke-Harris
Matt Watters
Kevin De Bruyne
Ivan Toney
James Vughan
Troy Parrott

You can use re module for the task. For example:
import re
import pandas as pd
txt = """
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
"""
r_competitions = re.compile(r"^Alias(?:(?:es for)| name)\s*(.*?)$", flags=re.M)
r_names = re.compile(r"^-+$\s*(.*?)\s*The command", flags=re.M | re.S)
dfs = []
for comp, names in zip(r_competitions.findall(txt), r_names.findall(txt)):
if not "England" in comp:
continue
data = []
for n in names.split("\n"):
data.append({comp: n})
dfs.append(pd.DataFrame(data))
print(pd.concat(dfs, axis=1).fillna(""))
Prints:
England_Premier League England_Division 1 England_Division 2
0 Harry Kane Will Grigg Eoin Doyle
1 Mohamed Salah Jonson Clarke-Harris Matt Watters
2 Kevin De Bruyne Jerry Yates James Vughan
3 Ivan Toney
4 Troy Parrott

How to create a new column with values from comparing two other columns?

I am working on a project that will perform an audit of employees with computer accounts. I want to print one data frame with the two new columns in it. This is different from the Comparing Columns in Dataframes question because I am working with strings. I will also need to do some fuzzy logic but that is further down the line.
The data I receive is in Excel sheets. It comes from two sources that I don't have control over and so I format them to be [First Name, Last Name] and print them to the console to ensure the data I am working with is correct. I convert the .xls to .csv files, format the information and am able to output the two lists of names in a single dataframe with two columns but have not been able to put the values I want in the last two columns. I have used query (which returned True/False, not the names), diff and regex. I assume that I am just using the tools incorrectly.
import pandas as pd
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary
Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ
McMahon","Amelia H. Hayden"]}
info = pd.DataFrame(data=nd)
for row in info:
if info.col1.value not in info.col2:
info["Need Account"] = info.col1.value
if info.col2.value not in info.col1:
info["Delete Account"] = info.col2.value
print(info)
What I would like is a new dataframe with 2 columns: Need Account and Delete Account and fill in the appropriate values based on the other columns in the dataframe. In this case, I am getting an error that 'Series' has not attribute 'value'.
Here is an example of my expected output:
df_out:
Need Account Delete Account
Demetrius McMahon Abe Oliver
Abraham Oliver Hillary Emerson
Hilary Emerson DJ McMahon
From this list I can look to see who's nickname showed up and pare the list down from there.

You want to use isin and np.where to conditionally assign the new values:
info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen NaN NaN
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon NaN DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver NaN
Or if you want a new dataframe like you stated in your question:
need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
newdf = pd.DataFrame({'Need Account':need,
'Delete Account':delete})
Need Account Delete Account
0 NaN NaN
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 NaN DJ McMahon
4 Abraham Oliver NaN

I'm taking a chance without seeing your expected output, but reading what you are attempting in your code. Let me know if this is what you are looking for?
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"],
'Need Account':"",
'Delete Account':""
}
info = pd.DataFrame(data=nd)
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden
Don't use loops, use vectors...
info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver Amelia H. Hayden

IIUC, it doesn't seem like there is much 'structure' to be maintained from your input dataframe, so you could use sets to compare membership in groups directly.
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)
col1 = set(df['col1'])
col2 = set(df['col2'])
need = col1 - col2
delete = col2 - col1
print('need = ', need)
print('delete = ', delete)
yields
need = {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete = {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}
You could then place in a new dataframe:
data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()
(Edited to account for possibility that need and delete are of unequal length.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.