I have the following list :
personnages = ['Stanley','Kevin', 'Franck']
I want to use str.contains function to create a new pandas dataframe df3 :
df3 = df2[df2['speaker'].str.contains('|'.join(personnages))]
However, if the row of the column speaker contains : 'Stanley & Kevin', i don't want it in df3.
How can I improve my code to do this ?
Here what I would do:
# toy data
df = pd.DataFrame({'speaker':['Stanley & Kevin', 'Everybody',
'Kevin speaks', 'The speaker is Franck', 'Nobody']})
personnages = ['Stanley','Kevin', 'Franck']
pattern = '|'.join(personnages)
s = (df['speaker'].str
.extractall(f'({pattern})') # extract all personnages
.groupby(level=0)[0] # group by df's row
.nunique().eq(1) # count the unique number
)
df.loc[s.index[s]]
Output:
speaker
2 Kevin speaks
3 The speaker is Franck
You'll want to denote line start and end in your regex, that way it only contains the single name:
import pandas as pd
speakers = ['Stanley', 'Kevin', 'Frank', 'Kevin & Frank']
df = pd.DataFrame([{'speaker': speaker} for speaker in speakers])
speaker
0 Stanley
1 Kevin
2 Frank
3 Kevin & Frank
r = '|'.join(speakers[:-1]) # gets all but the last one for the sake of example
# the ^ marks start of string, and $ is the end
df[df['speaker'].str.contains(f'^({r})$')]
speaker
0 Stanley
1 Kevin
2 Frank
Related
I have two dataframes with actor names (their types are object) that look like the following:
df = pd.DataFrame({Actors: [Christian Bale, Ben Kingsley, Halley Bailey, Aaron Paul, etc...]
df2 = pd.read_csv({id: [Halley Bailey - 1998, Coco Jones – 1998, etc...]
Normally I would use the following code to find if one item is present in another dataframe but due to the numbers in the second dataframe I get 0 matches. Is there any smart way of going over this?
df.assign(indf=df.Actors.isin(df_actor_list.id).astype(int))
The code above did not work obviously
You can extract the actor names from df2['id'] and check if df['Actors'] is in it:
df.assign(indf=df['Actors'].isin(df2['id'].str.extract('(.*)(?=\s[-–])',
expand=False)).astype(int))
output:
Actors indf
0 Christian Bale 0
1 Ben Kingsley 0
2 Halley Bailey 1
3 Aaron Paul 0
Another, more generic, approach relying on a regex:
import re
regex = '|'.join(map(re.escape, df['Actors']))
# 'Christian\\ Bale|Ben\\ Kingsley|Halley\\ Bailey|Aaron\\ Paul'
actors = df2['id'].str.extract(f'({regex})', expand=False).dropna()
df.assign(indf=df['Actors'].isin(actors).astype(int))
used inputs:
df = pd.DataFrame({'Actors': ['Christian Bale', 'Ben Kingsley', 'Halley Bailey', 'Aaron Paul']})
df2 = pd.DataFrame({'id': ['Halley Bailey - 1998', 'Coco Jones – 1998']})
I have a df that looks similar to this:
|email|first_name|last_name|id|group_email|
|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|
I parse out the group_code, the sub string after the 3rd hyphen. I now want to add this sub tring back into the dataframe for each entry. So the df will look like so:
|email|first_name|last_name|id|group_email|group_code|
|-|-|-|-|-|-|
|drew#mail.com|drew|barry|05|san-red-gate-rate#mail.com|rate|
|nate#mail.com|nate|lewis|03|san-blue-gate-factor#mail.com|factor|
|chris#mail.com|chris|ryan|04|san-red-wheels-drive#mail.com|drive|
How can I go about doing this?
Let's try
df['group_code'] = (df['group_email'].str.extract('(-[^#]*){3}')[0]
.str.lstrip('-'))
print(df)
email first_name last_name id group_email group_code
0 drew#mail.com drew barry 5 san-red-gate-rate#mail.com rate
1 nate#mail.com nate lewis 3 san-blue-gate-factor#mail.com factor
2 chris#mail.com chris ryan 4 san-red-wheels-drive#mail.com drive
I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.
I have two dataframes: one with full names and another with nicknames. The nickname is always a portion of the person's full name, and the data is not sorted or indexed, so I can't just merge the two.
What I want as an output is one data frame that contains the full name and the associated nick name by simple search: find the nickname inside the name and match it.
Any solutions to this?
df = pd.DataFrame({'fullName': ['Claire Daines', 'Damian Lewis', 'Mandy Patinkin', 'Rupert Friend', 'F. Murray Abraham']})
df2 = pd.DataFrame({'nickName': ['Rupert','Abraham','Patinkin','Daines','Lewis']})
Thanks
Use Series.str.extract with strings joined by | for regex or with \b for words boundaries:
pat = '|'.join(r"\b{}\b".format(x) for x in df2['nickName'])
df['nickName'] = df['fullName'].str.extract('('+ pat + ')', expand=False)
print (df)
fullName nickName
0 Claire Daines Daines
1 Damian Lewis Lewis
2 Mandy Patinkin Patinkin
3 Rupert Friend Rupert
4 F. Murray Abraham Abraham
I need help. I have a CSV file that contains names (First, Middle, Last)
I would like to know a way to use pandas to convert Middle Name to just a Middle initial, and save First Name, Middle Init, Last Name to a new csv.
Source CSV
First Name,Middle Name,Last Name
Richard,Dale,Leaphart
Jimmy,Waylon,Autry
Willie,Hank,Paisley
Richard,Jason,Timmons
Larry,Josiah,Williams
What I need new CSV to look like:
First Name,Middle Name,Last Name
Richard,D,Leaphart
Jimmy,W,Autry
Willie,H,Paisley
Richard,J,Timmons
Larry,J,Williams
Here is the Python3 code using pandas that I have so far that is reading and writing to a new CSV file. I just need a some help modifying that one column of each row, saving just the first Character.
'''
Read CSV file with First Name, Middle Name, Last Name
Write CSV file with First Name, Middle Initial, Last Name
Print before and after in the terminal to show work was done
'''
import pandas
from pathlib import Path, PureWindowsPath
winCsvReadPath = PureWindowsPath("D:\\TestDir\\csv\\test\\original-
NameList.csv")
originalCsv = Path(winCsvReadPath)
winCsvWritePath= PureWindowsPath("D:\\TestDir\\csv\\test\\modded-
NameList2.csv")
moddedCsv = Path(winCsvWritePath)
df = pandas.read_csv(originalCsv, index_col='First Name')
df.to_csv(moddedCsv)
df2 = pandas.read_csv(moddedCsv, index_col='First Name')
print(df)
print(df2)
Thanks in advance..
You can use the str accessor, which allows you to slice strings like you would in normal Python:
df['Middle Name'] = df['Middle Name'].str[0]
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams
Or Just to another approach with str.extract
Your csv file processing with pandas:
>>> df = pd.read_csv("sample.csv", sep=",")
>>> df
First Name Middle Name Last Name
0 Richard Dale Leaphart
1 Jimmy Waylon Autry
2 Willie Hank Paisley
3 Richard Jason Timmons
4 Larry Josiah Williams
Second, Middle Name extraction from the DataFrame:
assuming all the names starting with first letter with upper case.
>>> df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})')
# df['Middle Name'] = df['Middle Name'].str.extract('([A-Z]\w{0})', expand=True)
>>> df
First Name Middle Name Last Name
0 Richard D Leaphart
1 Jimmy W Autry
2 Willie H Paisley
3 Richard J Timmons
4 Larry J Williams