A better/faster way to handle human names in Pandas columns? - python

I am dealing with a large amount of data that includes the standard five columns for human names (prefix, firstname, middlename, lastname, suffix) and I would like to merge them in a separate column as a readable name. The issue I have is with handling blank values - the issue creates spacing problems. Also, I cannot modify the original columns. My current process feels a little insane (but it works!) so I am looking for a more elegant solution.
My current code:
def add_space_prefix(x):
x = str(x)
if len(x) > 0:
return x + ' '
else:
return x
def add_space_middle(x):
x = str(x)
if len(x) > 0:
return ' ' + x
else:
return x
def add_space_suffix(x):
x = str(x)
if len(x) > 0:
return ', ' + x
else:
return x`
df["middlename"] =
df["middlename"].map(lambda x: add_space_middle(x))
df["prefix"] = df["prefix"].map(lambda x: add_space_prefix(x))
df["suffix"] = df["suffix"].map(lambda x: add_space_suffix(x))
df['fullname'] = df["prefix"] + df["firstname"] + df[
"middlename"] + ' ' + df["lastname"] + df['suffix']
Sample Dataframe
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Jobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons

Option 1
' '.join and pd.Series.str
In this solution we join the entire row by spaces. This may lead to spaces at the beginning or end of the string or with 2 or more spaces in the middle. We handle this by chaining string accessor methods.
df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).apply(' '.join, 1).str.replace('\s+', ' ').str.strip(' ,')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Option 2
list comprehension
In this solution, we perform the same activities as with the first solution, but we bundle the string operations together and within a comprehension.
[re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
['Michael Hobart, Jr.',
'Mr. Alan Lilt',
'Jon A. Smith, III',
'Joe Miller',
'Mika Jennifer Shabosky',
'Mrs. Angela Calder',
'Boris Al Bert, Esq.',
'Dr. Natasha Chorus',
'Bill Gibbons']
df['fullname'] = [re.sub(r'\s+', ' ', ' '.join(s)).strip(' ,')
for s in df.assign(lastname=df.lastname + ',').values.tolist()]
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Option 3
pd.replace and pd.DataFrame.stack
This one is a bit different in that we replace blanks '' with np.nan so that when we stack the np.nan are naturally dropped. This makes for the joining with ' ' more straight forward.
df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
0 Michael Hobart, Jr.
1 Mr. Alan Lilt
2 Jon A. Smith, III
3 Joe Miller
4 Mika Jennifer Shabosky
5 Mrs. Angela Calder
6 Boris Al Bert, Esq.
7 Dr. Natasha Chorus
8 Bill Gibbons
dtype: object
df['fullname'] = df.assign(
lastname=df.lastname + ','
).replace('', np.nan).stack().groupby(level=0).apply(' '.join).str.strip(',')
df
prefix firstname middlename lastname suffix fullname
0 Michael Hobart Jr. Michael Hobart, Jr.
1 Mr. Alan Lilt Mr. Alan Lilt
2 Jon A. Smith III Jon A. Smith, III
3 Joe Miller Joe Miller
4 Mika Jennifer Shabosky Mika Jennifer Shabosky
5 Mrs. Angela Calder Mrs. Angela Calder
6 Boris Al Bert Esq. Boris Al Bert, Esq.
7 Dr. Natasha Chorus Dr. Natasha Chorus
8 Bill Gibbons Bill Gibbons
Timing
bundling within a comprehension is fastest!
%timeit df.assign(fullname=df.replace('', np.nan).stack().groupby(level=0).apply(' '.join))
%timeit df.assign(fullname=df.apply(' '.join, 1).str.replace('\s+', ' ').str.strip())
%timeit df.assign(fullname=[re.sub(r'\s+', ' ', ' '.join(s)).strip() for s in df.values.tolist()])
100 loops, best of 3: 2.51 ms per loop
1000 loops, best of 3: 979 µs per loop
1000 loops, best of 3: 384 µs per loop

Related

How to insert a variable in xpath within a for loop?

for i in range(length):
# print(i)
driver.execute_script("window.history.go(-1)")
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
content2 = driver.page_source.encode('utf-8').strip()
soup2 = BeautifulSoup(content2,"html.parser")
name2 = soup2.find('h1', {'data-qa-target': 'ProviderDisplayName'}).text
phone2 = soup2.find('a', {'class': 'click-to-call-button-secondary hg-track mobile-click-to-call'}).text
print(name2, phone2)
Hey guy I am trying to scrape the First and last Name, Telephone for each person this website: https://www.healthgrades.com/family-marriage-counseling-directory. I want the (l.4) button to adapt to the variable (i). if i manually change i to a number everything works perfectly fine. But as soon as I placed in the variable i it doesn't work, any help much appreciated!
Instead of this :
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
do this :
range = driver.find_element_by_xpath(f"(//a[#class = 'button'])[{i}]").click()
Update 1 :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.healthgrades.com/family-marriage-counseling-directory")
for name in driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"):
print(name.text)
Output :
Noe Gutierrez, MSW
Melissa Huston, LCSW
Gina Kane, LMHC
Dr. Mary Marino, PHD
Emili-Erin Puente, MED
Richard Vogel, LMFT
Lynn Bednarz, LCPC
Nicole Palow, LMHC
Dennis Hart, LPCC
Dr. Robert Meeks, PHD
Jody Davis
Dr. Kim Logan, PHD
Artemis Paschalis, LMHC
Mark Webb, LMFT
Deirdre Holland, LCSW-R
John Paul Dilorenzo, LMHC
Joseph Hayes, LPC
Dr. Maylin Batista, PHD
Ella Gray, LCPC
Cynthia Mack-Ernsdorff, MA
Dr. Edward Muldrow, PHD
Rachel Sievers, LMFT
Dr. Lisa Burton, PHD
Ami Owen, LMFT
Sharon Lorber, LCSW
Heather Rowley, LCMHC
Dr. Bonnie Bryant, PHD
Marilyn Pearlman, LCSW
Charles Washam, BCD
Dr. Liliana Wolf, PHD
Christy Kobe, LCSW
Dana Paine, LPCC
Scott Kohner, LCSW
Elizabeth Krzewski, LMHC
Luisa Contreras, LMFT
Dr. Joel Nunez, PHD
Susanne Sacco, LISW
Lauren Reminger, MA
Thomas Recher, AUD
Kristi Smith, LCSW
Kecia West, LPC
Gregory Douglas, MED
Gina Smith, LCPC
Anne Causey, LPC
Dr. David Greenfield, PHD
Olga Rothschild, LMHC
Dr. Susan Levin, PHD
Ferguson Jennifer, LMHC
Marci Ober, LMFT
Christopher Checke, LMHC
Process finished with exit code 0
Update 2 :
leng = len(driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"))
for i in range(leng):
driver.find_element_by_xpath(f"(//a[text()='View Profile'])[{i}]").click()

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

Can you make value_counts on a specific interval of characters with pandas?

So, I have a column "Names". If I do:
df['Names'].value_counts()
I get this:
Mr. Richard Vance 1
Mrs. Angela Bell 1
Mr. Stewart Randall 1
Mr. Andrew Ogden 1
Mrs. Maria Berry 1
..
Mrs. Lillian Wallace 1
Mr. William Bailey 1
Mr. Paul Ball 1
Miss Pippa Bond 1
Miss Caroline Gray 1
It's ok... Thera are lots of DISTINCT names. But what I want is to do this value_counts() only for the first characters until it get's to the empty character (i.e. space that devides, for instance Miss or Mrs. from Lillian Wallace) So that the output would be, for example:
Mrs. 1000
Mr. 2000
Miss 2000
Just to know how many distinct variants there are in the column names so that, in a 2nd stage create another variable (namely gender) based on those variants.
You can use value_counts(dropna=False) on str[0] after a str.split():
df = pd.DataFrame({'Names': ['Mr. Richard Vance','Mrs. Angela Bell','Mr. Stewart Randall','Mr. Andrew Ogden','Mrs. Maria Berry','Mrs. Lillian Wallace','Mr. William Bailey','Mr. Paul Ball','Miss Pippa Bond','Miss Caroline Gray','']})
df.Names.str.split().str[0].value_counts(dropna=False)
# Mr. 5
# Mrs. 3
# Miss 2
# NaN 1
# Name: Names, dtype: int64
If you want to know the unique values and if there's always a space you can do this.
df = pd.DataFrame(['Mr. Richard Vance',
'Mrs. Angela Bell',
'Mr. Stewart Randall',
'Mr. Andrew Ogden',
'Mrs. Maria Berry',
'Mrs. Lillian Wallace',
'Mr. William Bailey',
'Mr. Paul Ball',
'Miss Pippa Bond',
'Miss Caroline Gray'], columns=['names'])
df['names'].str.split(' ').str[0].unique().tolist()
Output is a list:
['Mr.', 'Mrs.', 'Miss']
Here is a solution. You can use regex:
#Dataset
Names
0 Mr. Richard Vance
1 Mrs. Angela Bell
2 Mr. Stewart Randall
3 Mr. Andrew Ogden
4 Mrs. Maria Berry
5 Mrs. Lillian Wallace
df['Names'].str.extract(r'(\w+\.\s)').value_counts()
#Output:
Mr. 3
Mrs. 3
Note : (\w+\.\s) will extract Mr. and Mrs. parts (or any title like Dr.) from the names

startswith() function help needed in Pandas Dataframe

I have a Name Column in Dataframe in which there are Multiple names.
DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
"Mr. Roderick Robert Crispin",
"Cunningham"," Mr. Alfred Fleming"]})`
OUTPUT
Name
0 Brailey, Mr. William Theodore Ronald
1 Roger Marie Bricoux
2 Mr. Roderick Robert Crispin
3 Cunningham
4 Mr. Alfred Fleming
I wrote a row classification function, like if I pass a row/name it should return output class
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke']
def classify_role(row):
if row.loc['name'] in mus:
return 'musician'
Calling a function
is_brailey = df['name'].str.startswith('Brailey')
print(classify_role(df[is_brailey].iloc[0]))
Should show 'musician'
But output is showing different class I think I am writing something wrong here in classify_role()
Must be this row
if row.loc['name'] in mus:
Summary:
I am in need of a solution if I put first name of a person in startswith() who is in musi it should return musician
EDIT: If want test if values exist in lists you can create dictionary and test membership by Series.isin:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
cat1 = ['Mr. Alfred Fleming','Cunningham']
d = {'musician':mus, 'category':cat1}
for k, v in d.items():
df.loc[df['Name'].isin(v), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin NaN
3 Cunningham category
4 Mr. Alfred Fleming category
Your solution should be changed:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
def classify_role(row):
if row in mus:
return 'musician'
df['type'] = df['Name'].apply(classify_role)
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin None
3 Cunningham None
4 Mr. Alfred Fleming None
You can pass values in tuple to Series.str.startswith, solution should be expand to match more categories by dictionary:
d = {'musician': ['Brailey, Mr. William Theodore Ronald'],
'cat1':['Roger Marie Bricoux', 'Cunningham']}
for k, v in d.items():
df.loc[df['Name'].str.startswith(tuple(v)), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux cat1
2 Mr. Roderick Robert Crispin NaN
3 Cunningham cat1
4 Mr. Alfred Fleming NaN

Scraping table by beautiful soup 4

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc
There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more?
Here is the code:
#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]: (50,1)
Using DevTools in Firefox I see it gets data (in JSON format) for next page from
https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2
If you change value in page= then you can get other pages.
import requests
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
for page in range(1, 4):
print('\n---', page, '---\n')
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
print(item['athlete']['displayName'])
Result:
--- 1 ---
Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky
--- 2 ---
Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington
--- 3 ---
Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman
EDIT: get all data as DataFrame
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
df = pd.DataFrame() # emtpy DF at start
for page in range(1, 4):
print('page:', page)
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
player_name = item['athlete']['displayName']
position = item['athlete']['position']['abbreviation']
gp = item['categories'][0]['totals'][0]
other_values = item['categories'][2]['totals']
row = [player_name, position, gp] + other_values
df = df.append( [row] ) # append one row
df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']
print(len(df)) # 150
print(df.head(20))

Categories