Web scraping python to csv file problems

Web scraping python to csv file problems - python

Please , i want to get table of information "Meilleurs buteurs par édition" into csv file i try this code but csv file it seems empty and output is the first table not the table that i need some one to help me please !
from bs4 import BeautifulSoup
import requests
import pandas as pd
URL='https://fr.wikipedia.org/wiki/Liste_des_buteurs_de_la_Coupe_du_monde_de_football'
results=[]
response = requests.get(URL)
soup= BeautifulSoup(response.text, 'html.parser')
#print(soup)
#table= soup.find('table')
table = soup.find("table")
tbody=table.find("tbody")
#table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
#table = soup.find("table", {"class":"wikitable sortable alternance jquery-tablesorter"}).tbody
#print(table)
rows=table.findAll('tr')
columns=[v.text.replace('\n', '') for v in rows[0].find_all('th')]
df=pd.DataFrame(columns=columns)
for i in range (1,len(rows)):
tds=rows[i].find_all('td')
if len(tds)==6:
values = [tds[0].text,tds[1].text,tds[2].text,tds[3].text,tds[4].text,tds[5].text]
else:
#for i in range(7):
# df = df.append({'columns': i}, ignore_index=True)
values=[td.text for td in tds]
df = df.append(pd.Series(values), ignore_index=True)
print(df)
print(columns)
df = pd.DataFrame(columns=['A'])
for i in range(5):
df = df.append({'A': i}, ignore_index=True)
df = pd.DataFrame({'test': results})
df.to_csv('but.csv', index=False, encoding='utf-8')
Output
Rang Joueur Équipe ... 3 4 5
0 NaN NaN NaN ... 24\n 0,67\n 16\n
1 NaN NaN NaN ... 19\n 0,79\n 15\n
2 NaN NaN NaN ... 13\n 1,08\n 14\n
3 NaN NaN NaN ... 6\n 2,17\n 13\n
4 NaN NaN NaN ... 14\n 0,86\n 12\n
5 NaN NaN NaN ... 5\n 2,2\n 11\n
6 NaN NaN NaN ... 17\n 0,65\n 11\n
7 NaN NaN NaN ... 10\n 1\n 10\n
8 NaN NaN NaN ... 12\n 0,83\n 10\n
9 NaN NaN NaN ... 12\n 0,83\n 10\n
10 NaN NaN NaN ... 13\n 0,77\n 10\n
11 NaN NaN NaN ... 16\n 0,63\n 10\n
12 NaN NaN NaN ... 20\n 0,5\n 10\n
[13 rows x 13 columns]
['Rang', 'Joueur', 'Équipe', 'Détail par édition', 'Matchs', 'Ratio', 'Buts']

The easiest way is to use pandas.read_html:
import pandas as pd
url = "https://fr.wikipedia.org/wiki/Liste_des_buteurs_de_la_Coupe_du_monde_de_football"
df = pd.read_html(url)[1]
df["Ratio"] = df["Buts"] / df["Matchs"]
print(df)
df.to_csv("data.csv", index=False)
Prints:
Édition Joueur Équipe Matchs Ratio Buts
0 1930 Guillermo Stábile Argentine 4 2.000000 8
1 1934 Oldřich Nejedlý Tchécoslovaquie 4 1.250000 5
2 1938 Leônidas Brésil 4 1.750000 7
3 1950 Ademir Brésil 6 1.333333 8
4 1954 Sándor Kocsis Hongrie 5 2.200000 11
5 1958 Just Fontaine France 6 2.166667 13
6 1962 Flórián Albert Hongrie 3 1.333333 4
7 1962 Garrincha Brésil 6 0.666667 4
8 1962 Valentin Ivanov Union soviétique 4 1.000000 4
9 1962 Dražan Jerković Yougoslavie 6 0.666667 4
10 1962 Leonel Sánchez Chili 6 0.666667 4
11 1962 Vavá Brésil 6 0.666667 4
12 1966 Eusébio Portugal 6 1.500000 9
13 1970 Gerd Müller Allemagne de l’Ouest 6 1.666667 10
14 1974 Grzegorz Lato Pologne 7 1.000000 7
15 1978 Mario Kempes Argentine 7 0.857143 6
16 1982 Paolo Rossi Italie 7 0.857143 6
17 1986 Gary Lineker Angleterre 5 1.200000 6
18 1990 Salvatore Schillaci Italie 7 0.857143 6
19 1994 Oleg Salenko Russie 3 2.000000 6
20 1994 Hristo Stoitchkov Bulgarie 7 0.857143 6
21 1998 Davor Šuker Croatie 7 0.857143 6
22 2002 Ronaldo Brésil 7 1.142857 8
23 2006 Miroslav Klose Allemagne 7 0.714286 5
24 2010 Diego Forlán Uruguay 7 0.714286 5
25 2010 Thomas Müller Allemagne 6 0.833333 5
26 2010 Wesley Sneijder Pays-Bas 7 0.714286 5
27 2010 David Villa Espagne 7 0.714286 5
28 2014 James Rodríguez Colombie 5 1.200000 6
29 2018 Harry Kane Angleterre 6 1.000000 6
and saves data.csv (screenshot from LibreOffice):

Related

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?

Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

Fill up columns in dataframe based on condition

I have a dataframe that looks as follows:
id cyear month datadate fyear
1 1988 3 nan nan
1 1988 4 nan nan
1 1988 5 1988-05-31 1988
1 1988 6 nan nan
1 1988 7 nan nan
1 1988 8 nan nan
1 1988 9 nan nan
1 1988 12 nan nan
1 1989 1 nan nan
1 1989 2 nan nan
1 1989 3 nan nan
1 1989 4 nan nan
1 1989 5 1989-05-31 1989
1 1989 6 nan nan
1 1989 7 nan nan
1 1989 8 nan nan
1 1990 8 nan nan
4 2000 1 nan nan
4 2000 2 nan nan
4 2000 3 nan nan
4 2000 4 nan nan
4 2000 5 nan nan
4 2000 6 nan nan
4 2000 7 nan nan
4 2000 8 nan nan
4 2000 9 nan nan
4 2000 10 nan nan
4 2000 11 nan nan
4 2000 12 2000-12-31 2000
5 2000 11 nan nan
More specifically, I have a dataframe consisting of monthly (month) data on firms (id) per calendar year (cyear). If the respective row, i.e. month, represents the end of a fiscal year of the firm, the datadate column will denote the respective months end as a date variable and the fyear column will denote the respective fiscal year that just ended.
I now want the fyear value to indicate the respective fiscal year not just in the last month of the respective companies fiscal year, but in every month within the respective fiscal year:
id cyear month datadate fyear
1 1988 3 nan 1988
1 1988 4 nan 1988
1 1988 5 1988-05-31 1988
1 1988 6 nan 1989
1 1988 7 nan 1989
1 1988 8 nan 1989
1 1988 9 nan 1989
1 1988 12 nan 1989
1 1989 1 nan 1989
1 1989 2 nan 1989
1 1989 3 nan 1989
1 1989 4 nan 1989
1 1989 5 1989-05-31 1989
1 1989 6 nan 1990
1 1989 7 nan 1990
1 1989 8 nan 1990
1 1990 8 nan 1991
4 2000 1 nan 2000
4 2000 2 nan 2000
4 2000 3 nan 2000
4 2000 4 nan 2000
4 2000 5 nan 2000
4 2000 6 nan 2000
4 2000 7 nan 2000
4 2000 8 nan 2000
4 2000 9 nan 2000
4 2000 10 nan 2000
4 2000 11 nan 2000
4 2000 12 2000-12-31 2000
5 2000 11 nan nan
Note that months may be missing, as evident in case of id 1, and fiscal years may end on different months in fyear=cyear or fyear=cyear+1 (I have included only the former example, one could construct the latter example by adding 1 to the current fyear values of e.g. id 1). Also, the last row(s) of a given firm may not necessarily be its fiscal year end month, as evident in case of id 1. Lastly, there may exist firms for which no information on fiscal years is available.
I appreciate any help on this.

Do you want this?
def backword_fill(x):
x = x.bfill()
x = x.ffill() + x.isna().astype(int)
return x
df.fyear = df.groupby('id')['fyear'].transform(backword_fill)
Output
id cyear month datadate fyear
0 1 1988 3 <NA> 1988
1 1 1988 4 <NA> 1988
2 1 1988 5 1988-05-31 1988
3 1 1988 6 <NA> 1989
4 1 1988 7 <NA> 1989
5 1 1988 8 <NA> 1989
6 1 1988 9 <NA> 1989
7 1 1988 12 <NA> 1989
8 1 1989 1 <NA> 1989
9 1 1989 2 <NA> 1989
10 1 1989 3 <NA> 1989
11 1 1989 4 <NA> 1989
12 1 1989 5 1989-05-31 1989
13 1 1989 6 <NA> 1990
14 4 2000 1 <NA> 2000
15 4 2000 2 <NA> 2000
16 4 2000 3 <NA> 2000
17 4 2000 4 <NA> 2000
18 4 2000 5 <NA> 2000
19 4 2000 6 <NA> 2000
20 4 2000 7 <NA> 2000
21 4 2000 8 <NA> 2000
22 4 2000 9 <NA> 2000
23 4 2000 10 <NA> 2000
24 4 2000 11 <NA> 2000
25 4 2000 12 2000-12-31 2000

Drop Rows with Non-Numeric Entries in a Column (Python)

I am trying to download data from a website. When I do this, there are some rows that are not part of the data included, which is obvious because their first column is not a number.
So I'm getting something like
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
where the last row is one that I want to drop. In the actual table, there are multiple rows like this randomly throughout the table.
Here is the code that I have tried so far to drop those rows:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
Thank you in advance for any help!

Let's cast your 'Gm#' column and drop records in a couple of steps:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
Output:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA # OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA # OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA # OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA # OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA # CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA # CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.

Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Python CSV joining columns

I am trying to make a new columnn with conditional statements utilizing Pandas Version 0.17.1. I have two csv's both about 100mb's in size.
What I have:
CSV1:
Index TC_NUM
1241 1105.0017
1242 1105.0018
1243 1105.0019
1244 1105.002
1245 1105.0021
1246 1105.0022
CSV2:
KEYS TC_NUM
UXS-689 3001.0045
FIT-3015 1135.0027
FIT-2994 1140.0156
FIT-2991 1910, 1942.0001, 3004.0004, 3004.0020, 3004.0026, 3004.0063, 3004.0065, 3004.0079, 3004.0084, 3004.0091, 2101.0015, 2101.0016, 2101.0017, 2101.0018, 2101.0050, 2101.0052, 2101.0054, 2101.0055, 2101.0071, 2101.0074, 2101.0075, 2206.0001, 2103.0001, 2103.0002, 2103.0009, 2103.0011, 3000.0004, 3000.0030, 1927.0020
FIT-2990 2034.0002, 3004.0035, 3004.0084, 2034.0001
FIT-2918 3001.0039, 3004.0042
What I want:
Index TC_NUM Matched_Keys
1241 1105.0017 FIT-3015
1242 1105.0018 UXS-668
1243 1105.0019 FIT-087
1244 1105.002 FIT-715
1245 1105.0021 FIT-910
1246 1105.0022 FIT-219
If the TC_NUM in CSV2 matches the TC_NUM from CSV1, it prints the key in a column on CSV1
Code:
dftakecolumns = pd.read_csv('JiraKeysEnv.csv')
dfmergehere = pd.read_csv('output2.csv')
s = dftakecolumns['KEYS']
a = dftakecolumns['TC_NUM']
d = dfmergehere['TC_NUM']
for crows in a:
for toes in d:
if toes == crows:
print toes
dfmergehere['Matched_Keys'] = dftakecolumns.apply(toes, axis=None, join_axis=None, join='outer')

You can try this solution:
Notice - I change value in first (1105.0017) and fourth (1105.0022) row of df2 for test of merge.
print df1
Index TC_NUM
0 1241 1105.0017
1 1242 1105.0018
2 1243 1105.0019
3 1244 1105.0020
4 1245 1105.0021
5 1246 1105.0022
print df2
KEYS TC_NUM
0 UXS-689 1105.0017
1 FIT-3015 1135.0027
2 FIT-2994 1140.0156
3 FIT-2991 1105.0022, 1942.0001, 3004.0004, 3004.0020, 30...
4 FIT-2990 2034.0002, 3004.0035, 3004.0084, 2034.0001
5 FIT-2918 3001.0039, 3004.0042
#convert string column TC_NUM to dataframe df3
df3 = pd.DataFrame([ x.split(',') for x in df2['TC_NUM'].tolist() ])
#convert string df3 to float df3
df3 = df3.astype(float)
print df3
0 1 2 3 4 5 \
0 1105.0017 NaN NaN NaN NaN NaN
1 1135.0027 NaN NaN NaN NaN NaN
2 1140.0156 NaN NaN NaN NaN NaN
3 1105.0022 1942.0001 3004.0004 3004.0020 3004.0026 3004.0063
4 2034.0002 3004.0035 3004.0084 2034.0001 NaN NaN
5 3001.0039 3004.0042 NaN NaN NaN NaN
6 7 8 9 ... 19 20 \
0 NaN NaN NaN NaN ... NaN NaN
1 NaN NaN NaN NaN ... NaN NaN
2 NaN NaN NaN NaN ... NaN NaN
3 3004.0065 3004.0079 3004.0084 3004.0091 ... 2101.0074 2101.0075
4 NaN NaN NaN NaN ... NaN NaN
5 NaN NaN NaN NaN ... NaN NaN
21 22 23 24 25 26 27 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 2206.0001 2103.0001 2103.0002 2103.0009 2103.0011 3000.0004 3000.003
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN
28
0 NaN
1 NaN
2 NaN
3 1927.002
4 NaN
5 NaN
[6 rows x 29 columns]
#concat column KEYS to df3
df2 = pd.concat([df2['KEYS'], df3], axis=1)
#stack - rows to one column for merging
df2 = df2.set_index('KEYS').stack().reset_index(level=1,drop=True).reset_index(name='TC_NUM')
print df2
KEYS TC_NUM
0 UXS-689 1105.0017
1 FIT-3015 1135.0027
2 FIT-2994 1140.0156
3 FIT-2991 1105.0022
4 FIT-2991 1942.0001
5 FIT-2991 3004.0004
6 FIT-2991 3004.0020
7 FIT-2991 3004.0026
8 FIT-2991 3004.0063
9 FIT-2991 3004.0065
10 FIT-2991 3004.0079
11 FIT-2991 3004.0084
12 FIT-2991 3004.0091
13 FIT-2991 2101.0015
14 FIT-2991 2101.0016
15 FIT-2991 2101.0017
16 FIT-2991 2101.0018
17 FIT-2991 2101.0050
18 FIT-2991 2101.0052
19 FIT-2991 2101.0054
20 FIT-2991 2101.0055
21 FIT-2991 2101.0071
22 FIT-2991 2101.0074
23 FIT-2991 2101.0075
24 FIT-2991 2206.0001
25 FIT-2991 2103.0001
26 FIT-2991 2103.0002
27 FIT-2991 2103.0009
28 FIT-2991 2103.0011
29 FIT-2991 3000.0004
30 FIT-2991 3000.0030
31 FIT-2991 1927.0020
32 FIT-2990 2034.0002
33 FIT-2990 3004.0035
34 FIT-2990 3004.0084
35 FIT-2990 2034.0001
36 FIT-2918 3001.0039
37 FIT-2918 3004.0042
#merge on column TC_NUM
print pd.merge(df1, df2, on=['TC_NUM'])
Index TC_NUM KEYS
0 1241 1105.0017 UXS-689
1 1246 1105.0022 FIT-2991

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping python to csv file problems - python

Related

Python/Pandas outer merge not including all relevant columns

Fill up columns in dataframe based on condition

Drop Rows with Non-Numeric Entries in a Column (Python)

Adding column in pandas based on values from other columns with conditions

Python CSV joining columns

Categories

Resources