Selecting values from each column separately - python

I need to write a script which sums values from each column (each column is a separate day). In addition, I want to separate the values into planned (blue color) and unplanned (red color). In the HTML code, I found that the unplanned values have a class name as "colBox cal-unplanned" and the planned values have a class name as "colBox cal-planned".
My code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# Here I tried to convert the data into a dataframe, but then you don't know which values are planned and which are unplanned
table = soup.find_all('table')
df = pd.read_html(str(table),header=2)[0]
# Here the values are correct, but they are collected from the whole table
sum = 0
for tr in soup.find_all('td', class_='colBox cal-unplanned'):
val = int(tr.text)
sum += val
print(sum)
for tr in soup.find_all('td', class_='colBox cal-planned'):
print(tr.text)
And here's my question. How can I select values from each column separately

Not sure there's a better way, but you can iterate through the table and store the planned and unplanned into separate values under the key of the column name. Then sum up those values and Then use that dictionary to convert to a dataframe.
But you're right, you lose that attribute in parsing it with .read_html().
This works, but not sure how robust it is for your situation.
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
data = {}
headers = [x.text.strip() for x in table.find_all('tr')[2].find_all('th')]
for header in headers:
data[header] = {'planned':[],
'unplanned':[]}
rows = table.find_all('tr')[3:]
for row in rows:
tds = row.find_all('td')[3:len(headers)+3]
for idx, value in enumerate(tds):
if value.has_attr("class"):
if 'cal-planned' in value['class']:
data[headers[idx]]['planned'].append(int(value.text.strip()))
elif 'cal-unplanned' in value['class']:
data[headers[idx]]['unplanned'].append(int(value.text.strip()))
sum_of_columns = {}
for col, values in data.items():
planned_sum = sum(values['planned'])
unplanned_sum = sum(values['unplanned'])
sum_of_columns[col] = {'planned':planned_sum,
'unplanned':unplanned_sum}
df = pd.DataFrame.from_dict(sum_of_columns,orient="columns" )
Output:
print(df.to_string())
Cz 14 Pt 15 So 16 N 17 Pn 18 Wt 19 Śr 20 Cz 21 Pt 22 So 23 N 24 Pn 25 Wt 26 Śr 27
planned 8808 8301 7750 6863 6069 6199 6069 5627 5627 5695 5695 5235 5235 5376
unplanned 2320 2020 2313 2783 950 950 950 950 950 950 950 910 910 910

So if I understood right, you want to work on single columns of your dataframe?
You could try to use this df['column_name'] to access a certain column of the df and then filter this column for the value you want to use like
df['column_name'] == filter_value
But then again I'm not sure I get your problem.
This helped me heaps with dataframe value selection.

Not sure if this is necessarily an issue for bs4, because I think the information is already in the DataFrame as a sum.
How to access?
Take a look at the tail() of your dataframe:
df.tail(3)
Example
import pandas as pd
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
df = pd.read_html(URL,header=2)[0]
df.tail(3).iloc[:,2:]
Output
Moc Osiągalna (MW) Cz 14 Pt 15 So 16 N 17 Pn 18 Wt 19 Śr 20 Cz 21 Pt 22 So 23 N 24 Pn 25 Wt 26 Śr 27
219 Planowane 11279 10604 8391 6863 6069 6432 6069 5627 5627 5695 5695 5235 5235 5376
220 Nieplanowane 5520 5620 2313 2783 950 950 950 950 950 950 950 910 910 910
221 Łącznie ubytki 16799 16224 10704 9646 7019 7382 7019 6577 6577 6645 6645 6145 6145 6286

Related

Get column values for containing a value

I have a .csv that looks like the below. I was wondering what the best way would be to keep the first few cols (id, account_id, date, amount, payments) intact while creating a new column containing the column name for observations with an 'X' marked.
The first 10 rows of the csv look like:
id,account_id,date,amount,payments,24_A,12_B,12_A,60_D,48_C,36_D,36_C,12_C,48_A,24_C,60_C,24_B,48_D,24_D,48_B,36_A,36_B,60_B,12_D,60_A
4959,2,1994-01-05,80952,3373,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4961,19,1996-04-29,30276,2523,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4962,25,1997-12-08,30276,2523,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4967,37,1998-10-14,318480,5308,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4968,38,1998-04-19,110736,2307,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4973,67,1996-05-02,165960,6915,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4986,97,1997-08-10,102876,8573,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4988,103,1997-12-06,265320,7370,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4989,105,1998-12-05,352704,7348,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4990,110,1997-09-08,162576,4516,-,-,-,-,-,-,X,-,-,-,-,-,-,-,-,-,-,-,-,-
There used to be something called lookup but that's been deprecated in favor of melt + loc[].
The idea is to use the id_vars as the grouping, and all the other columns get smashed into a single column with their respective value. Then filter where that value is X, effectively dropping the other rows.
import pandas as pd
df = pd.read_csv('test.txt')
df = df.melt(id_vars=['id','account_id','date','amount','payments'], var_name='x_col')
df = df.loc[df['value']=='X'].drop(columns='value')
print(df)
Output
id account_id date amount payments x_col
0 4959 2 1994-01-05 80952 3373 24_A
5 4973 67 1996-05-02 165960 6915 24_A
11 4961 19 1996-04-29 30276 2523 12_B
22 4962 25 1997-12-08 30276 2523 12_A
26 4986 97 1997-08-10 102876 8573 12_A
33 4967 37 1998-10-14 318480 5308 60_D
44 4968 38 1998-04-19 110736 2307 48_C
48 4989 105 1998-12-05 352704 7348 48_C
57 4988 103 1997-12-06 265320 7370 36_D
69 4990 110 1997-09-08 162576 4516 36_C

Scraping data from Morningstar using an API

I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()
If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

column's type changed when concatenating scraped dataframes

I'm having some troubles and I couldn't find any answers on the internet (which is very rare !!)
I'm doing some webscraping (meterological data) on a daily granularity by city.
When I scrape for 1-day, the 'city' column is considered string, but when i scrape on more than 1-day like 2-days, the 'city' column is not a string anymore and the same code that used to work when 1-day scrape doesnt work anymore on a 2-days scrape.
Here's my code for scraping
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc)
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170101'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
When I only scrape for 20170101 like above, this below line of code works :
for i, row in df2.iterrows():
if isinstance(df2['Wetterstation\xa0'][i],str) is True:
df2.at[i,'Wetterstation\xa0']=df2['Wetterstation\xa0'][i].replace('/','-')
But when I change the end date to 20170202 for exemple, the code won't work and give me this error message :
AttributeError: 'Series' object has no attribute 'encode'
Please can you explain me what has changed ? Why the type of columns changes when I scrape more than one day ?
Thank you in advance for your time !
It seems you need to specify correct parser, in this case lxml (standard html.parser produces incorrect results).
Running this code for end date 20170202:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'lxml') # <--- specify `lxml` parser
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170202'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
print(df2)
Produces:
Wetterstation  MinimumTemp.[°C] MaximumTemp.[°C] Minimum 5 cmüber dem Erd-boden (Nacht) Schnee-höhe[cm] StärksteWindböe[Bft] Nieder-schlag[l/m2] Sonnen-scheindauer[h] date
0 Aachen -5,1 1,9 -6,9 0 644 1,9 5,3 20170101
1 Ahaus (Münsterland) -1,2 0,7 -1 0 429 3,9 0 20170101
2 Albstadt (Schwäbische Alb, 759 m) -9,9 4,2 -10,8 0 7,1 20170101
3 Aldersbach (Lkr. Passau, Niederbayern) -8,3 -1,7 -7,8 0 4,9 20170101
4 Alfeld (Leine) -4,1 1,6 -4,4 0 322 1,6 1,5 20170101
.. ... ... ... ... ... ... ... ... ...
965 Zernien (N) 0 20170202
966 Zielitz (N) 0 20170202
967 Zinnwald-Georgenfeld (Erzgebirge, 877 m) -7,4 0,2 -0,2 55 540 0,5 0 20170202
968 Zugspitze -7,3 -5,3 235 12123 0 5,8 20170202
969 Zwiesel (Bayerischer Wald, 612 m) -4,4 9,3 -0,3 44 213 0,1 2,4 20170202
[27841 rows x 9 columns]

My year list doesn't work on BeautifulSoup. Why?

I'm newbie learning BeautifulSoup. May someone have a look at the following code? I'd like to scrape data from a website without any success. I'd like to create a dataframe with the sum of player arrivals per year and with a column of players average age.
dataframe repeating codes:
img dataframe error
my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
anos_list = list(range(2005, 2018))
anos_lista = []
valor_contratos_lista = []
idade_média_lista = []
for ano_lista in anos_list:
url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'+ str(anos_list) + ''
page = requests.get(url, headers={'User-Agent': 'Custom5'})
soup = BeautifulSoup(page.text, 'html.parser')
tag_list = soup.tfoot.find_all('td')
valor = (tag_list[0].string)
idade = (tag_list[1].string)
ano = ano_lista
valor_contratos_lista.append(valor)
idade_media_lista.append(idade)
anos_lista.append(ano)
flamengo_df = pd.DataFrame({'Ano': ano_lista,
'Despesa com contratações':valor_contratos_lista,
'Média de idade': idade_média_lista
})
flamengo_df.to_csv('flamengo.csv', encoding = 'utf-8')`
Here's my approach:
Using Beautiful Soup + Regex:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
# Set min and max years as variables
min_year = 2005
max_year = 2019
year_range = list(range(min_year, 2019+1))
base_url = 'https://www.transfermarkt.com/flamengo-rio-de-janeiro/transfers/verein/614/saison_id/'
# Begin iterating
records = []
for year in year_range:
url = base_url+str(year)
# get the page
page = requests.get(url, headers={'User-Agent': 'Custom5'})
soup = BeautifulSoup(page.text, 'html.parser')
# I used the class of "responsive table"
tables = soup.find_all('div',{'class':'responsive-table'})
rows = tables[0].find_all('tr')
cells = [row.find_all('td', {'class':'zentriert'}) for row in rows]
# get variable names:
variables = [x.text for x in rows[0].find_all('th')]
variables_values = {x:[] for x in variables}
# get values
for row in rows:
values = [' '.join(x.text.split()) for x in row.find_all('td')]
values = [x for x in values if x!='']
if len(variables)< len(values):
values.pop(4)
values.pop(2)
for k,v in zip(variables_values.keys(), values):
variables_values[k].append(v)
num_pattern = re.compile('[0-9,]+')
to_float = lambda x: float(x) if x!='' else np.NAN
get_nums = lambda x: to_float(''.join(num_pattern.findall(x)).replace(',','.'))
# Add values to an individual record
rec = {
'Url':url,
'Year':year,
'Total Transfers':len(variables_values['Player']),
'Avg Age': np.mean([int(x) for x in variables_values['Age']]),
'Avg Cost': np.nanmean([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
'Total Cost': np.nansum([get_nums(x) for x in variables_values['Fee'] if ('loan' not in x)]),
}
# Store record
records.append(rec)
Thereafter, initialize dataframe:
Of note, some of the numbers represent millions and would need to be adjusted for.
import pandas as pd
# Drop the URL
df = pd.DataFrame(records, columns=['Year','Total Transfers','Avg Age','Avg Cost','Total Cost'])
Year Total Transfers Avg Age Avg Cost Total Cost
0 2005 26 22.038462 2.000000 2.00
1 2006 32 23.906250 240.660000 1203.30
2 2007 37 22.837838 462.750000 1851.00
3 2008 41 22.926829 217.750000 871.00
4 2009 31 23.419355 175.000000 350.00
5 2010 46 23.239130 225.763333 1354.58
6 2011 47 23.042553 340.600000 1703.00
7 2012 45 24.133333 345.820000 1037.46
8 2013 36 24.166667 207.166667 621.50
9 2014 37 24.189189 111.700000 335.10
10 2015 49 23.530612 413.312000 2066.56
11 2016 41 23.341463 241.500000 966.00
12 2017 31 24.000000 101.433333 304.30
13 2018 18 25.388889 123.055000 738.33
14 2019 10 25.300000 NaN 0.00

Categories