Get first element of tokenized words in a row - python

Using the existing column name, add a new column first_name to df such that the new column splits the name into multiple words and takes the first word as its first name. For example, if the name is Elon Musk, it is split into two words in the list ['Elon', 'Musk'] and the first word Elon is taken as its first name. If the name has only one word, then the word itself is taken as its first name.
A snippet of the data frame
Name
Alemsah Ozturk
Igor Arinich
Christopher Maloney
DJ Holiday
Brian Tracy
Philip DeFranco
Patrick Collison
Peter Moore
Dr.Darrell Scott
Atul Gawande
Everette Taylor
Elon Musk
Nelly_Mo
This is what I have so far. I am not sure how to extract the name after I tokenize it
import nltk
first = df.name.apply(lambda x: nltk.word_tokenize(x))
df["first_name"] = This is where I'm stuck

Try this snippet:
df["first_name"] = df['Name'].map(lambda x: x.split(' ')[0])
df["last_name"] = df['Name'].map(lambda x: x.split(' ')[1])

Related

How can I find the mean 'vote_average' for each actor?

In my movie data dataframe I have a column named 'cast', which contains a string of all the cast members for that given movie separated by a pipe character.
For example, the movie 'Jurrassic World' has "Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson" in its cast column.
Some actors appear multiple times in the dataframe for separate movies.
I want to compare each separate cast member against another column called 'vote_average' and find each cast member's mean 'vote_average' for the all the movies that they have been in.
I have tried df['cast'].str.cat(sep = '|').split('|') to get a list containing all actors, but not sure where to go from here?
From what I could interpret from your question, you have a DataFrame that looks a bit like this:
import pandas as pd
df = pd.DataFrame({"film": ["Jurassic World", "Jurassic World: Fallen Kingdom"],
"cast": ["Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D'Onofrio|Nick Robinson",
"Chris Pratt|Bryce Dallas Howard|Rafe Spall"],
"vote_average": [5, 4]})
You then split all the actors in cast by "|" to a list of actors:
df['cast'] = df['cast'].apply(lambda x: x.split('|'))
To find the average vote_average for each actor, you can then explode the column so each actor is in a separate row:
df = df.explode('cast')
Then finally, group the actors, and calculate the mean vote_average:
actors_mean_vote_avg = df.groupby('cast')['vote_average'].mean()
actors_mean_vote_avg
#Out:
#cast
#Bryce Dallas Howard 4.5
#Chris Pratt 4.5
#Irrfan Khan 5.0
#Nick Robinson 5.0
#Rafe Spall 4.0
#Vincent D'Onofrio 5.0
#Name: vote_average, dtype: float64
If this is not correct, please can you provide an example of your DataFrame, and an example of the desired output.
Since I dind't had your DF I invented one from what I understood from your question:
List generator (just to exemplify your df):
x=int(input('Insert lenght (int): '))
y=str(input('Insert string: '))
lst=list([y]*x)
new_list=[]
for i in range(x):
new_list.append(lst[i]+str(' ')+str(i))
new_list.append('Jurrassic World ') # added your film
actors=['Vin Diesel|Shahrukh Khan|Salman Khan|Irrfan Khan',
'Vin Gasoline|Harrison Tesla|Salmon Rosa|Matt Angel|Demi Less',
'Not von Diesel|Ryan Davidson',
'Chris Bratt|Bread Butter|Bruce Wayno|Robinson Crusoe',
'Groot|Watzlav|David Bronzefield|Vin Diesel',
'Jessica Fox|Jamie Rabbit|Harrison Tesla|Salmon Rosa',
'Bryce Dallas Howard|David Bronzefield|Robinson Crusoe',
'Asterix|Garfield|Chris Pratt|Smurfix',
'Almost vin Diesel|Vin Gasoline|Dwayne Paper',
'Vin Gasoline|Jessica Fox|Demi Less',
'Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vincent D`Onofrio|Nick Robinson'] # 11 rows
votes_average = np.random.uniform(low=6, high=9.8, size=(11,))
Here my df for the answer:
df=pd.DataFrame({'film' : new_list, 'actors': actors, 'imdb' : votes_average})
# First split the column with our cast and split it in other columns, named `cast_x`
part=df['actors'].str.split('|',expand=True).rename(columns= lambda x : 'cast_'+str(x))
#Now joining to main df and creating df_new
df_new=pd.concat([df,beta],axis=1)
Now comes a complicated part, but you try it for your selft after each method and see what is happening to the df:
group = (df_new.filter(like='cast').stack()
.reset_index(level=1, drop=True)
.to_frame('casts')
.join(df)
.groupby('casts')
.agg({'imdb':(np.mean,np.size),'film': lambda x: list(pd.unique(x))}))
I found reasonable to use .agg and get more statistics(you can apply np.min and/or np.max after the , as well).
I wanted to see the avg from how many movies np.size and which movies did an actor do lambda with pd.unique:
group.loc['Vin Gasoline']

Matching a nickname to a name in Pandas

I have two dataframes: one with full names and another with nicknames. The nickname is always a portion of the person's full name, and the data is not sorted or indexed, so I can't just merge the two.
What I want as an output is one data frame that contains the full name and the associated nick name by simple search: find the nickname inside the name and match it.
Any solutions to this?
df = pd.DataFrame({'fullName': ['Claire Daines', 'Damian Lewis', 'Mandy Patinkin', 'Rupert Friend', 'F. Murray Abraham']})
df2 = pd.DataFrame({'nickName': ['Rupert','Abraham','Patinkin','Daines','Lewis']})
Thanks
Use Series.str.extract with strings joined by | for regex or with \b for words boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in df2['nickName'])
df['nickName'] = df['fullName'].str.extract('('+ pat + ')', expand=False)
print (df)
fullName nickName
0 Claire Daines Daines
1 Damian Lewis Lewis
2 Mandy Patinkin Patinkin
3 Rupert Friend Rupert
4 F. Murray Abraham Abraham

Seperate list into row and column based on delimiter

I have a list of emails I wanted to split into two columns.
df = [Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>;
Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>]
list = df.split(';')
for i in list
print (i)
Expected result is to have two columns, one for name, and one for email:
Name Email
Smith, John jsmith#abc.com
Moores, Jordan jmoores#abc.com
Manson, Tyler tmanson#abc.om
Foster, Ryan rfoster#abc.com`
Do NOT use list as a variable name; there's just no reason to. Here is a way to do it, assuming your input is a string:
data = "Smith, John <jsmith#abc.com>; Moores, Jordan <jmoores#abc.com>; Manson, Tyler <tmanson#abc.com>; Foster, Ryan <rfoster#abc.com>"
# Do not call things list as "list" is a keyword in Python
l1 = data.split(';')
res = []
for i in l1:
splt = i.strip().split()
res.append([" ".join(splt[:2]), splt[-1][1:-1]])
df = pd.DataFrame(res, columns=["Name", "Email"])

How to use pandas python3 to get just Middle Initial from Middle name column of CSV and write to new CSV

I need help. I have a CSV file that contains names (First, Middle, Last)
I would like to know a way to use pandas to convert Middle Name to just a Middle initial, and save First Name, Middle Init, Last Name to a new csv.
Source CSV
First Name,Middle Name,Last Name
Richard,Dale,Leaphart
Jimmy,Waylon,Autry
Willie,Hank,Paisley
Richard,Jason,Timmons
Larry,Josiah,Williams
What I need new CSV to look like:
First Name,Middle Name,Last Name
Richard,D,Leaphart
Jimmy,W,Autry
Willie,H,Paisley
Richard,J,Timmons
Larry,J,Williams
Here is the Python3 code using pandas that I have so far that is reading and writing to a new CSV file. I just need a some help modifying that one column of each row, saving just the first Character.
'''
Read CSV file with First Name, Middle Name, Last Name
Write CSV file with First Name, Middle Initial, Last Name
Print before and after in the terminal to show work was done
'''
import pandas
from pathlib import Path, PureWindowsPath
winCsvReadPath = PureWindowsPath("D:\\TestDir\\csv\\test\\original-
NameList.csv")
originalCsv = Path(winCsvReadPath)
winCsvWritePath= PureWindowsPath("D:\\TestDir\\csv\\test\\modded-
NameList2.csv")
moddedCsv = Path(winCsvWritePath)
df = pandas.read_csv(originalCsv, index_col='First Name')
df.to_csv(moddedCsv)
df2 = pandas.read_csv(moddedCsv, index_col='First Name')
print(df)
print(df2)
Thanks in advance..
You can use the str accessor, which allows you to slice strings like you would in normal Python:
df['Middle Name'] = df['Middle Name'].str[0]
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams
Or Just to another approach with str.extract
Your csv file processing with pandas:
>>> df = pd.read_csv("sample.csv", sep=",")
>>> df
First Name Middle Name Last Name
0 Richard Dale Leaphart
1 Jimmy Waylon Autry
2 Willie Hank Paisley
3 Richard Jason Timmons
4 Larry Josiah Williams
Second, Middle Name extraction from the DataFrame:
assuming all the names starting with first letter with upper case.
>>> df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})')
# df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})', expand=True)
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams

How to remove lines in pandas data frame based on specific character

I got this in my data frame
name : john,
address : Milton Kings,
phone : 43133241
Concern:
customer complaint about the services is so suck
thank you
How can I process the above to remove only line of text in data frame containing :? My objective is to get the lines which contains the following only.
customer complaint about the services is so suck
Kindly help.
One thing you can do is to separate the sentence after ':' from your data frame. And you can do this by creating a series from your data frame.
Let's say c is your series.
c=pd.Series(df['column'])
s=[c[i].split(':')[1] for i in range(len(c))]
By doing this you will be able to separate your sentence from colon.
Assuming you want to keep the second part of the sentences, you can use the applymap
method to solve your problem.
import pandas as pd
#Reproduce the dataframe
l = ["name : john",
"address : Milton Kings",
"phone : 43133241",
"Concern : customer complaint about the services is so suck" ]
df = pd.DataFrame(l)
#split on each element of the dataframe, and keep the second part
df.applymap(lambda x: x.split(":")[1])
input :
0
0 name : john
1 address : Milton Kings
2 phone : 43133241
3 Concern : customer complaint about the services is so suck
output :
0
0 john
1 Milton Kings
2 43133241
3 customer complaint about the services is so suck

Categories