Pandas -Split data and create columns when string occurs - python

I am looking to read in a text file (see below) and then create columns for all the English leagues only. So I'll be looking to do something like where "Alias name" is "England_" then create a new column with the alias name as the header and then the player names in the rows. note that the first occurrence for Alias is down as "Aliases" in the text file.
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
This is my current code on how I'm reading in the data
df = pd.read_csv(r'Desktop\SampleData.txt', sep='\n', header=None)
This gives me a pandas DF with one column. I'm fairly new to python so I'm wondering how I would go about getting the below result? should I use a delimiter when reading in the file?
England_Premier League
England_Division 1
England_Division 2
Harry Kane
Will Griggs
Eoin Doyle
Mohamed Salah
Jonson Clarke-Harris
Matt Watters
Kevin De Bruyne
Ivan Toney
James Vughan
Troy Parrott

You can use re module for the task. For example:
import re
import pandas as pd
txt = """
"-----------------------------------------------------------------------------------------------------------"
"- NEW TEAM -"
"-----------------------------------------------------------------------------------------------------------"
Europe Players
17/04/2019
07:59 p.m.
Aliases for England_Premier League
-------------------------------------------------------------------------------
Harry Kane
Mohamed Salah
Kevin De Bruyne
The command completed successfully.
Alias name England_Division 1
Comment Teams
Members
-------------------------------------------------------------------------------
Will Grigg
Jonson Clarke-Harris
Jerry Yates
Ivan Toney
Troy Parrott
The command completed successfully.
Alias name Spanish La Liga
Comment
Members
-------------------------------------------------------------------------------
Lionel Messi
Luis Suarez
Cristiano Ronaldo
Sergio Ramos
The command completed successfully.
Alias name England_Division 2
Comment
Members
-------------------------------------------------------------------------------
Eoin Doyle
Matt Watters
James Vughan
The command completed successfully.
"""
r_competitions = re.compile(r"^Alias(?:(?:es for)| name)\s*(.*?)$", flags=re.M)
r_names = re.compile(r"^-+$\s*(.*?)\s*The command", flags=re.M | re.S)
dfs = []
for comp, names in zip(r_competitions.findall(txt), r_names.findall(txt)):
if not "England" in comp:
continue
data = []
for n in names.split("\n"):
data.append({comp: n})
dfs.append(pd.DataFrame(data))
print(pd.concat(dfs, axis=1).fillna(""))
Prints:
England_Premier League England_Division 1 England_Division 2
0 Harry Kane Will Grigg Eoin Doyle
1 Mohamed Salah Jonson Clarke-Harris Matt Watters
2 Kevin De Bruyne Jerry Yates James Vughan
3 Ivan Toney
4 Troy Parrott

Related

How to deal with long names in data cleaning?

I have a users database. I want to separate them into two columns to have user1 and user2.
The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.
The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.
Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi
After I split the user columns
0
1
2
3
4
5
6
7
8
Maria
Melinda
Del
Valle
Justin
Howard
Devin
Craig
Jr.
Michael
Carter
III
Jeanne
De
Bordeaux
Alhamdi
The expected result is the following
User1
User2
Maria Melinda Del valle
Justin Howard
Devin Craig Jr.
Michael Carter III
Jeanne De Bordeaux
Alhamdi
You can use:
def f(sr):
m = sr.isna().cumsum().loc[lambda x: x < 2]
return sr.dropna().groupby(m).apply(' '.join)
out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')
Output:
>>> out
User1 User2
0 Maria Melinda Del Valle Justin Howard
1 Devin Craig Jr. Michael Carter III
2 Jeanne De Bordeaux Alhamdi
As suggested by #Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

NetworkX graph with some specifications based on two dataframes

I have two dataframes. The first shows the name of people of a program, called df_student.
Student-ID
Name
20202456
Luke De Paul
20202713
Emil Smith
20202456
Alexander Müller
20202713
Paul Bernard
20202456
Zoe Michailidis
20202713
Joanna Grimaldi
20202456
Kepler Santos
20202713
Dominic Borg
20202456
Jessica Murphy
20202713
Danielle Dominguez
And the other shows a dataframe where people reach the best grades with at least one person from the df_student in a course and is called df_course.
Course-ID
Name
Grade
UNI44
Luke De Paul, Benjamin Harper
17
UNI45
Dominic Borg
20
UNI61
Luke De Paul, Jonathan MacAllister
20
UNI62
Alexander Müller, Kepler Santos
17
UNI63
Joanna Grimaldi
19
UNI65
Emil Smith, Filippo Visconti
18
UNI71
Moshe Azerad, Emil Smith
18
UNI72
Luke De Paul, Jessica Murphy
18
UNI73
Luke De Paul, Filippo Visconti
17
UNI74
Matthias Noem, Kepler Santos
19
UNI75
Luke De Paul, Kepler Santos
16
UNI76
Kepler Santos
17
UNI77
Kepler Santos, Benjamin Harper
17
UNI78
Dominic Borg, Kepler Santos
18
UNI80
Luke De Paul, Gabriel Martin
18
UNI81
Dominic Borg, Alexander Müller
19
UNI82
Luke De Paul, Giancarlo Di Lorenzo
20
UNI83
Emil Smith,Joanna Grimaldi
20
I would like to create a NetworkX graph where there is a vertex for each student from df_student and also from each student from df_course. There should also be an unweighted each between two vertices only if two student received the best grade in the same course.
Now what I tried is this
import networkx as nx
G = nx.Graph()
G.add_edge(student, course)
But when I doing is it say that argument is not right. And so I don't know how to continue
Try:
import networkx as nx
import pandas as pd
df_students = pd.read_clipboard()
df_course = pd.read_clipboard()
df_s_t = df_course['Name'].str.split(',', expand=True)
G = nx.from_pandas_edgelist(df_net, 0, 1)
df_net = df_s_t[df_s_t.notna().all(1)]
G.add_nodes_from(pd.concat([df_students['Name'],
df_s_t.loc[~df_s_t.notna().all(1),0]]))
fig, ax = plt.subplots(1,1, figsize=(15,15))
nx.draw_networkx(G)
Output:

Update DataFrame based on matching rows in another DataFrame

Say there is a group of people who can choose an English and / or a Spanish word. Let's say they chose like this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water',None,'hello','thanks',None,'green'],spanish=[None,'agua',None,None,'bienvenido','verde']))
person english spanish
0 mary water None
1 james None agua
2 patricia hello None
3 robert thanks None
4 jennifer None bienvenido
5 michael green verde
Say I also have an English-Spanish dictionary (assume no duplicates, i.e. one-to-one relationship):
>>> pandas.DataFrame(dict(english=['hello','bad','green','thanks','welcome','water'],spanish=['hola','malo','verde','gracias','bienvenido','agua']))
english spanish
0 hello hola
1 bad malo
2 green verde
3 thanks gracias
4 welcome bienvenido
5 water agua
How can I fill in any missing words, i.e. update the first DataFrame using the second DataFrame where either english or spanish is None, to arrive at this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water','water','hello','thanks','welcome','green'],spanish=['agua','agua','hola','gracias','bienvenido','verde']))
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde
You may check the map with fillna
df['english'] = df['english'].fillna(df['spanish'].map(df2.set_index('spanish')['english']))
df['spanish'] = df['spanish'].fillna(df['english'].map(df2.set_index('english')['spanish']))
df
Out[200]:
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

How to insert a variable in xpath within a for loop?

for i in range(length):
# print(i)
driver.execute_script("window.history.go(-1)")
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
content2 = driver.page_source.encode('utf-8').strip()
soup2 = BeautifulSoup(content2,"html.parser")
name2 = soup2.find('h1', {'data-qa-target': 'ProviderDisplayName'}).text
phone2 = soup2.find('a', {'class': 'click-to-call-button-secondary hg-track mobile-click-to-call'}).text
print(name2, phone2)
Hey guy I am trying to scrape the First and last Name, Telephone for each person this website: https://www.healthgrades.com/family-marriage-counseling-directory. I want the (l.4) button to adapt to the variable (i). if i manually change i to a number everything works perfectly fine. But as soon as I placed in the variable i it doesn't work, any help much appreciated!
Instead of this :
range = driver.find_element_by_xpath("(//a[#class = 'button'])[i]").click()
do this :
range = driver.find_element_by_xpath(f"(//a[#class = 'button'])[{i}]").click()
Update 1 :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
driver.get("https://www.healthgrades.com/family-marriage-counseling-directory")
for name in driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"):
print(name.text)
Output :
Noe Gutierrez, MSW
Melissa Huston, LCSW
Gina Kane, LMHC
Dr. Mary Marino, PHD
Emili-Erin Puente, MED
Richard Vogel, LMFT
Lynn Bednarz, LCPC
Nicole Palow, LMHC
Dennis Hart, LPCC
Dr. Robert Meeks, PHD
Jody Davis
Dr. Kim Logan, PHD
Artemis Paschalis, LMHC
Mark Webb, LMFT
Deirdre Holland, LCSW-R
John Paul Dilorenzo, LMHC
Joseph Hayes, LPC
Dr. Maylin Batista, PHD
Ella Gray, LCPC
Cynthia Mack-Ernsdorff, MA
Dr. Edward Muldrow, PHD
Rachel Sievers, LMFT
Dr. Lisa Burton, PHD
Ami Owen, LMFT
Sharon Lorber, LCSW
Heather Rowley, LCMHC
Dr. Bonnie Bryant, PHD
Marilyn Pearlman, LCSW
Charles Washam, BCD
Dr. Liliana Wolf, PHD
Christy Kobe, LCSW
Dana Paine, LPCC
Scott Kohner, LCSW
Elizabeth Krzewski, LMHC
Luisa Contreras, LMFT
Dr. Joel Nunez, PHD
Susanne Sacco, LISW
Lauren Reminger, MA
Thomas Recher, AUD
Kristi Smith, LCSW
Kecia West, LPC
Gregory Douglas, MED
Gina Smith, LCPC
Anne Causey, LPC
Dr. David Greenfield, PHD
Olga Rothschild, LMHC
Dr. Susan Levin, PHD
Ferguson Jennifer, LMHC
Marci Ober, LMFT
Christopher Checke, LMHC
Process finished with exit code 0
Update 2 :
leng = len(driver.find_elements(By.CSS_SELECTOR, "a[data-qa-target='provider-details-provider-name']"))
for i in range(leng):
driver.find_element_by_xpath(f"(//a[text()='View Profile'])[{i}]").click()

Scraping table by beautiful soup 4

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc
There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more?
Here is the code:
#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]: (50,1)
Using DevTools in Firefox I see it gets data (in JSON format) for next page from
https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2
If you change value in page= then you can get other pages.
import requests
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
for page in range(1, 4):
print('\n---', page, '---\n')
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
print(item['athlete']['displayName'])
Result:
--- 1 ---
Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky
--- 2 ---
Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington
--- 3 ---
Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman
EDIT: get all data as DataFrame
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
df = pd.DataFrame() # emtpy DF at start
for page in range(1, 4):
print('page:', page)
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
player_name = item['athlete']['displayName']
position = item['athlete']['position']['abbreviation']
gp = item['categories'][0]['totals'][0]
other_values = item['categories'][2]['totals']
row = [player_name, position, gp] + other_values
df = df.append( [row] ) # append one row
df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']
print(len(df)) # 150
print(df.head(20))

Categories