I am trying to download data from a website. When I do this, there are some rows that are not part of the data included, which is obvious because their first column is not a number.
So I'm getting something like
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
where the last row is one that I want to drop. In the actual table, there are multiple rows like this randomly throughout the table.
Here is the code that I have tried so far to drop those rows:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
Thank you in advance for any help!
Let's cast your 'Gm#' column and drop records in a couple of steps:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
Output:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA # OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA # OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA # OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA # OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA # CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA # CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]
Related
Please , i want to get table of information "Meilleurs buteurs par édition" into csv file i try this code but csv file it seems empty and output is the first table not the table that i need some one to help me please !
from bs4 import BeautifulSoup
import requests
import pandas as pd
URL='https://fr.wikipedia.org/wiki/Liste_des_buteurs_de_la_Coupe_du_monde_de_football'
results=[]
response = requests.get(URL)
soup= BeautifulSoup(response.text, 'html.parser')
#print(soup)
#table= soup.find('table')
table = soup.find("table")
tbody=table.find("tbody")
#table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
#table = soup.find("table", {"class":"wikitable sortable alternance jquery-tablesorter"}).tbody
#print(table)
rows=table.findAll('tr')
columns=[v.text.replace('\n', '') for v in rows[0].find_all('th')]
df=pd.DataFrame(columns=columns)
for i in range (1,len(rows)):
tds=rows[i].find_all('td')
if len(tds)==6:
values = [tds[0].text,tds[1].text,tds[2].text,tds[3].text,tds[4].text,tds[5].text]
else:
#for i in range(7):
# df = df.append({'columns': i}, ignore_index=True)
values=[td.text for td in tds]
df = df.append(pd.Series(values), ignore_index=True)
print(df)
print(columns)
df = pd.DataFrame(columns=['A'])
for i in range(5):
df = df.append({'A': i}, ignore_index=True)
df = pd.DataFrame({'test': results})
df.to_csv('but.csv', index=False, encoding='utf-8')
Output
Rang Joueur Équipe ... 3 4 5
0 NaN NaN NaN ... 24\n 0,67\n 16\n
1 NaN NaN NaN ... 19\n 0,79\n 15\n
2 NaN NaN NaN ... 13\n 1,08\n 14\n
3 NaN NaN NaN ... 6\n 2,17\n 13\n
4 NaN NaN NaN ... 14\n 0,86\n 12\n
5 NaN NaN NaN ... 5\n 2,2\n 11\n
6 NaN NaN NaN ... 17\n 0,65\n 11\n
7 NaN NaN NaN ... 10\n 1\n 10\n
8 NaN NaN NaN ... 12\n 0,83\n 10\n
9 NaN NaN NaN ... 12\n 0,83\n 10\n
10 NaN NaN NaN ... 13\n 0,77\n 10\n
11 NaN NaN NaN ... 16\n 0,63\n 10\n
12 NaN NaN NaN ... 20\n 0,5\n 10\n
[13 rows x 13 columns]
['Rang', 'Joueur', 'Équipe', 'Détail par édition', 'Matchs', 'Ratio', 'Buts']
The easiest way is to use pandas.read_html:
import pandas as pd
url = "https://fr.wikipedia.org/wiki/Liste_des_buteurs_de_la_Coupe_du_monde_de_football"
df = pd.read_html(url)[1]
df["Ratio"] = df["Buts"] / df["Matchs"]
print(df)
df.to_csv("data.csv", index=False)
Prints:
Édition Joueur Équipe Matchs Ratio Buts
0 1930 Guillermo Stábile Argentine 4 2.000000 8
1 1934 Oldřich Nejedlý Tchécoslovaquie 4 1.250000 5
2 1938 Leônidas Brésil 4 1.750000 7
3 1950 Ademir Brésil 6 1.333333 8
4 1954 Sándor Kocsis Hongrie 5 2.200000 11
5 1958 Just Fontaine France 6 2.166667 13
6 1962 Flórián Albert Hongrie 3 1.333333 4
7 1962 Garrincha Brésil 6 0.666667 4
8 1962 Valentin Ivanov Union soviétique 4 1.000000 4
9 1962 Dražan Jerković Yougoslavie 6 0.666667 4
10 1962 Leonel Sánchez Chili 6 0.666667 4
11 1962 Vavá Brésil 6 0.666667 4
12 1966 Eusébio Portugal 6 1.500000 9
13 1970 Gerd Müller Allemagne de l’Ouest 6 1.666667 10
14 1974 Grzegorz Lato Pologne 7 1.000000 7
15 1978 Mario Kempes Argentine 7 0.857143 6
16 1982 Paolo Rossi Italie 7 0.857143 6
17 1986 Gary Lineker Angleterre 5 1.200000 6
18 1990 Salvatore Schillaci Italie 7 0.857143 6
19 1994 Oleg Salenko Russie 3 2.000000 6
20 1994 Hristo Stoitchkov Bulgarie 7 0.857143 6
21 1998 Davor Šuker Croatie 7 0.857143 6
22 2002 Ronaldo Brésil 7 1.142857 8
23 2006 Miroslav Klose Allemagne 7 0.714286 5
24 2010 Diego Forlán Uruguay 7 0.714286 5
25 2010 Thomas Müller Allemagne 6 0.833333 5
26 2010 Wesley Sneijder Pays-Bas 7 0.714286 5
27 2010 David Villa Espagne 7 0.714286 5
28 2014 James Rodríguez Colombie 5 1.200000 6
29 2018 Harry Kane Angleterre 6 1.000000 6
and saves data.csv (screenshot from LibreOffice):
I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203
I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)
I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN
I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN