Splitting a string into a dataframe pandas

Splitting a string into a dataframe pandas - python

I have a string the exact same as below, My goal is so split it into a dataframe but I am finding trouble getting it to work. I have tried search on stack but have got nowhere.
'Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065'
Is there a way to turn this into a dataframe, in a reproducible way so I could do it with other strings?
My goal dataframe would look like as follows:
Position name Average
Goalkeeper Manuel 4.17017132535
Defender Diego 4.14973163459
Defender Giorgio 4.10115207373
Defender Thiago 3.93318274318
Defender Andrea 3.85132973289
Midfielder Arjen 4.80556193806
Midfielder Alexander 4.51037598508
Midfielder Franck 4.48063714064
Midfielder David 3.76028050109
Forward Cristiano 7.87909462636
Forward Hnery 6.85401665065
I am new to pandas so any help would be greatly appreciated

This is one way.
import pandas as pd
mystr = 'Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065'
lst = mystr.split()
data = [lst[pos:pos+4] for pos in range(0, len(lst), 4)]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
# Position Players Average Form
# 0 Goalkeeper Manuel Neuer 4.17017132535
# 1 Defender Diego Godin 4.14973163459
# 2 Defender Giorgio Chiellini 4.10115207373
# 3 Defender Thiago Silva 3.93318274318
# 4 Defender Andrea Barzagli 3.85132973289
# 5 Midfielder Arjen Robben 4.80556193806
# 6 Midfielder Alexander Meier 4.51037598508
# 7 Midfielder Franck Ribery 4.48063714064
# 8 Midfielder David Silva 3.76028050109
# 9 Forward Cristiano Ronaldo 7.87909462636
# 10 Forward Zlatan Ibrahimovic 6.85401665065
This method will not be perfect in these instances:
Whitespace in column names, as above. In this case, you will need to redefine column names.
Whitespace in player names. This does not appear to be a problem with the data provided.

Here is how you will work your way around that.
import pandas as pd
from io import StringIO
data = StringIO('Position Players Average Form\nGoalkeeper Manuel Neuer 4.17017132535\n Defender Diego Godin 4.14973163459\n Defender Giorgio Chiellini 4.10115207373\n Defender Thiago Silva 3.93318274318\n Defender Andrea Barzagli 3.85132973289\nMidfielder Arjen Robben 4.80556193806\nMidfielder Alexander Meier 4.51037598508\nMidfielder Franck Ribery 4.48063714064\nMidfielder David Silva 3.76028050109\n Forward Cristiano Ronaldo 7.87909462636\n Forward Zlatan Ibrahimovic 6.85401665065')
df = pd.read_csv(data, sep="\n")
print(df)
Output :
Position Players Average Form
0 Goalkeeper Manuel Neuer 4.17017132535
1 Defender Diego Godin 4.14973163459
2 Defender Giorgio Chiellini 4.10115207373
3 Defender Thiago Silva 3.93318274318
4 Defender Andrea Barzagli 3.85132973289
5 Midfielder Arjen Robben 4.80556193806
6 Midfielder Alexander Meier 4.51037598508
7 Midfielder Franck Ribery 4.48063714064
8 Midfielder David Silva 3.76028050109
9 Forward Cristiano Ronaldo 7.87909462636
10 Forward Zlatan Ibrahimovic 6.85401665065

Related

How to deal with long names in data cleaning?

I have a users database. I want to separate them into two columns to have user1 and user2.
The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.
The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.
Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi
After I split the user columns
0
1
2
3
4
5
6
7
8
Maria
Melinda
Del
Valle
Justin
Howard
Devin
Craig
Jr.
Michael
Carter
III
Jeanne
De
Bordeaux
Alhamdi
The expected result is the following
User1
User2
Maria Melinda Del valle
Justin Howard
Devin Craig Jr.
Michael Carter III
Jeanne De Bordeaux
Alhamdi

You can use:
def f(sr):
m = sr.isna().cumsum().loc[lambda x: x < 2]
return sr.dropna().groupby(m).apply(' '.join)
out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')
Output:
>>> out
User1 User2
0 Maria Melinda Del Valle Justin Howard
1 Devin Craig Jr. Michael Carter III
2 Jeanne De Bordeaux Alhamdi
As suggested by #Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

NetworkX graph with some specifications based on two dataframes

I have two dataframes. The first shows the name of people of a program, called df_student.
Student-ID
Name
20202456
Luke De Paul
20202713
Emil Smith
20202456
Alexander Müller
20202713
Paul Bernard
20202456
Zoe Michailidis
20202713
Joanna Grimaldi
20202456
Kepler Santos
20202713
Dominic Borg
20202456
Jessica Murphy
20202713
Danielle Dominguez
And the other shows a dataframe where people reach the best grades with at least one person from the df_student in a course and is called df_course.
Course-ID
Name
Grade
UNI44
Luke De Paul, Benjamin Harper
17
UNI45
Dominic Borg
20
UNI61
Luke De Paul, Jonathan MacAllister
20
UNI62
Alexander Müller, Kepler Santos
17
UNI63
Joanna Grimaldi
19
UNI65
Emil Smith, Filippo Visconti
18
UNI71
Moshe Azerad, Emil Smith
18
UNI72
Luke De Paul, Jessica Murphy
18
UNI73
Luke De Paul, Filippo Visconti
17
UNI74
Matthias Noem, Kepler Santos
19
UNI75
Luke De Paul, Kepler Santos
16
UNI76
Kepler Santos
17
UNI77
Kepler Santos, Benjamin Harper
17
UNI78
Dominic Borg, Kepler Santos
18
UNI80
Luke De Paul, Gabriel Martin
18
UNI81
Dominic Borg, Alexander Müller
19
UNI82
Luke De Paul, Giancarlo Di Lorenzo
20
UNI83
Emil Smith,Joanna Grimaldi
20
I would like to create a NetworkX graph where there is a vertex for each student from df_student and also from each student from df_course. There should also be an unweighted each between two vertices only if two student received the best grade in the same course.
Now what I tried is this
import networkx as nx
G = nx.Graph()
G.add_edge(student, course)
But when I doing is it say that argument is not right. And so I don't know how to continue

Try:
import networkx as nx
import pandas as pd
df_students = pd.read_clipboard()
df_course = pd.read_clipboard()
df_s_t = df_course['Name'].str.split(',', expand=True)
G = nx.from_pandas_edgelist(df_net, 0, 1)
df_net = df_s_t[df_s_t.notna().all(1)]
G.add_nodes_from(pd.concat([df_students['Name'],
df_s_t.loc[~df_s_t.notna().all(1),0]]))
fig, ax = plt.subplots(1,1, figsize=(15,15))
nx.draw_networkx(G)
Output:

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico

Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

Scraping table by beautiful soup 4

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc
There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more?
Here is the code:
#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape
out[0]: (50,1)

Using DevTools in Firefox I see it gets data (in JSON format) for next page from
https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2
If you change value in page= then you can get other pages.
import requests
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
for page in range(1, 4):
print('\n---', page, '---\n')
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
print(item['athlete']['displayName'])
Result:
--- 1 ---
Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky
--- 2 ---
Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington
--- 3 ---
Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman
EDIT: get all data as DataFrame
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='
df = pd.DataFrame() # emtpy DF at start
for page in range(1, 4):
print('page:', page)
r = requests.get(url + str(page))
data = r.json()
#print(data.keys())
for item in data['athletes']:
player_name = item['athlete']['displayName']
position = item['athlete']['position']['abbreviation']
gp = item['categories'][0]['totals'][0]
other_values = item['categories'][2]['totals']
row = [player_name, position, gp] + other_values
df = df.append( [row] ) # append one row
df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']
print(len(df)) # 150
print(df.head(20))

How to pull only certain fields with BeautifulSoup

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")

Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")

Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a string into a dataframe pandas - python

Related

How to deal with long names in data cleaning?

NetworkX graph with some specifications based on two dataframes

Draw a Map of cities in python

Scraping table by beautiful soup 4

How to pull only certain fields with BeautifulSoup

Categories

Resources