I am workin on scraping text using Python from the link; tournament link
Here is my code to get the tabular data;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows
Now, the goal is to obtain the data as a dataframe.
listnew=[]
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
listnew.append(cleantext) ## append to list
df = pd.DataFrame(listnew)
df.head(10)
Then we get following output;
0 []
1 [Finishers:, 577]
2 [Male:, 414]
3 [Female:, 163]
4 []
5 [1, 814, \r\n\r\n JARED WIL...
6 [2, 573, \r\n\r\n NATHAN A ...
7 [3, 687, \r\n\r\n FRANCISCO...
8 [4, 623, \r\n\r\n PAUL MORR...
9 [5, 569, \r\n\r\n DEREK G O..
I don't know why there is a new line character and carriage return character; \r\n\r\n? how can I remove them and get a dataframe in the proper format? Thanks in advance.
Pandas can parse HTML tables, give this a try:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_1_html = soup.find('table', attrs={'id': 'individualResults'})
t_1 = pd.read_html(table_1_html.prettify())[0]
print(t_1)
Output:
Place Bib Name ... Chip Pace Gun Time Team
0 1 814 JARED WILSON ... 5:51 36:24 NaN
1 2 573 NATHAN A SUSTERSIC ... 5:55 36:45 INTEL TEAM F
2 3 687 FRANCISCO MAYA ... 6:05 37:48 NaN
3 4 623 PAUL MORROW ... 6:13 38:37 NaN
4 5 569 DEREK G OSBORNE ... 6:20 39:24 INTEL TEAM F
.. ... ... ... ... ... ... ...
572 573 273 RACHEL L VANEY ... 15:51 1:38:34 NaN
573 574 467 ROHIT B DSOUZA ... 15:53 1:40:32 INTEL TEAM I
574 575 471 CENITA D'SOUZA ... 15:53 1:40:34 NaN
575 576 338 PRANAVI APPANA ... 16:15 1:42:01 NaN
576 577 443 LIBBY B MITCHELL ... 16:20 1:42:10 NaN
[577 rows x 10 columns]
Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:
<td>
JARED WILSON
</td>
Use str.strip to remove all leading and trailing whitespace, like this:
BeautifulSoup(str_cells, "lxml").get_text().strip().
Well looking at the url you provided, you can see the new lines in the :
...
<td>814</td>
<td>
JARED WILSON
</td>
...
so that's what you get when you scrape. These can easily be removed by the very convenient .strip() string method.
Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right. One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ("individualResults").
df = pd.DataFrame(listnew[4:])
df.head(10)
Have a look here: BeautifulSoup table to dataframe
Related
Okay, I've been beating my head against the wall enough on this one - I'm stuck! I'm trying to build a function that I can input the Favorite from Sagarin's College Football site and it will calculate the spread including the Home advantage.
I am trying to pull the "Predictions_with_Totals" from Sagarin's site:
http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals
I can get to it with the following code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = soup.find_all("pre")
collegeFB_HOME = collegeFB_HOME_ALL[6]
df_collegeFB = collegeFB_HOME
This gets me a very nice table with a few headers I would need to get past to get to the "meat" of the data.
Predictions_with_Totals
These are the "regular method". _
HOME or CLOSEBY (c) team in CAPS _
both teams in lower case _
means "n" for NEUTRAL location Rating Favorite _
MONEY=odds to 100 _
FAVORITE Rating Predict Golden Recent UNDERDOG ODDS PCT% TOTAL _
======================================================================================================
CENTRAL FLORIDA(UCF) 6.35 4.66 5.99 7.92 smu 200 67% 52.40
ALABAMA 20.86 19.07 17.01 26.30 texas a&m 796 89% 42.65
snipped.....
However, I can't get rid of the top HTML code to format this into something useful. If he had made this a table or even a list I think I would find it a lot easier.
I have tried to make a dictionary and use row.find based on searches here but I don't know why it isn't working for me - maybe I need to trash the first few rows before the "FAVORITES" row? How would I do that?
output = []
for row in df_collegeFB:
test = {}
test["headers"] = row.find("FAVORITES")
test['data'] = row.find('all')
output.append(test)
Just gives me garbage. I'm sure I'm putting garbage in so not surprised I'm getting garbage out.
print(output)
[{'headers': -1, 'data': -1}, {'headers': None, 'data': None}, {'headers': -1, 'data': 1699}, {'headers': None, 'data': None}, {'headers': -1, 'data': -1}]
Not sure what exactly you are after. But if you are trying to get that table, you can use regex. Probably not the most efficient way I did it here, but none the less, gets that table into a dataframe:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
html = requests.get("http://sagarin.com/sports/cfsend.htm#Predictions_with_Totals").text
soup = bs(html, "html.parser")
#find and create the table we want to import
collegeFB_HOME_ALL = str(soup.find_all("pre")[6])
pattern = re.compile(r"\s{1,}([a-zA-Z\(\)].*)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}([0-9\.\-]+)\s{1,}(\D+)([0-9]+)\s{1,}([0-9%]+)\s{1,}([0-9\.]+)")
rows = []
# find all matches to groups
for match in pattern.finditer(collegeFB_HOME_ALL):
row = {}
for i, col in enumerate(['FAVORITE', 'Rating', 'Predict', 'Golden', 'Recent', 'UNDERDOG', 'ODDS', 'PCT%', 'TOTAL'], start=1):
row[col] = match.group(i).strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
FAVORITE Rating Predict ... ODDS PCT% TOTAL
0 CENTRAL FLORIDA(UCF) 6.35 4.66 ... 200 67% 52.40
1 ALABAMA 20.86 19.07 ... 796 89% 42.65
2 oregon 12.28 11.89 ... 362 78% 75.82
3 washington 8.28 8.47 ... 244 71% 64.72
4 james madison 8.08 8.52 ... 239 70% 64.71
.. ... ... ... ... ... ... ...
104 east tennessee state 7.92 7.75 ... 235 70% 41.16
105 WEBER STATE 15.32 17.25 ... 482 83% 62.36
106 delaware 2.10 2.89 ... 126 56% 38.73
107 YALE 0.87 0.83 ... 110 52% 54.32
108 YOUNGSTOWN STATE 2.11 4.51 ... 127 56% 48.10
[109 rows x 9 columns]
I have a very specific issue which I have not been able to find a solution to.
Recently, I began a project for which I am monitoring about 100 ETFs and Mutual funds based on specific data acquired from Morningstar. The current solution works great - but I later found out that I need more data from another "Tab" within the website. Specifically, I am trying to get data from the 1st table from the following website: https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1
Right now, I have the code below for scraping data from a table from the tab "Indhold" on the website, and exporting it to Excel. My question is therefore: How do I adjust the code to scrape data from another part of the website?.
To briefly explain the code and reiterate: The code below scrapes data from another tab from the same websites. The many, many IDs are for each website representing each mutual fund/ETF. The setup works very well so I am hoping to simply adjust it (If that is possible) to extract the table from the link above. I have very limited knowledge of the topic so any help is much, much appreciated.
import requests
import re
import pandas as pd
from openpyxl import load_workbook
auth = 'https://www.morningstar.dk/Common/funds/snapshot/PortfolioSAL.aspx'
# Create a Pandas Excel writer using XlsxWriter as the engine.
path= r'/Users/karlemilthulstrup/Downloads/data2.xlsm'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
ids = ['F00000VA2N','F0GBR064OO','F00000YKC2','F000015MVX','F0000020YA','0P00015YTR','0P00015YTT','F0GBR05V8D','F0GBR06XKI','F000013CKH','F00000MG6K','F000014G49',
'F00000WC0Z','F00000QSD2','F000016551','F0000146QH','F0000146QI','F0GBR04KZS','F0GBR064VU','F00000VXLM','F0000119R1','F0GBR04L4T','F000015CS3','F000015CS5','F000015CS6',
'F000015CS4','F000013BZE','F0GBR05W0Q','F000016M1C','F0GBR04L68','F00000Z9T9','F0GBR04JI8','F00000Z9TG','F0GBR04L2P','F000014CU8','F00000ZG2G','F00000MLEW',
'F000013ZOY','F000016614','F00000WUI9','F000015KRL','F0GBR04LCR','F000010ES9','F00000P780','F0GBR04HC3','F000015CV6','F00000YWCK','F00000YWCJ','F00000NAI5',
'F0GBR04L81','F0GBR05KNU','F0GBR06XKB','F00000NAI3','F0GBR06XKF','F000016UA9','F000013FC2','F000014NRE','0P0000CNVT','0P0000CNVX','F000015KRI',
'F000015KRG','F00000XLK7','F0GBR04IDG','F00000XLK6','F00000073J','F00000XLK4','F000013CKG','F000013CKJ','F000013CKK','F000016P8R','F000016P8S','F000011JG6',
'F000014UZQ','F0000159PE','F0GBR04KZG','F0000002OY','F00000TW9K','F0000175CC','F00000NBEL','F000016054','F000016056','F00000TEYP','F0000025UI','F0GBR04FV7',
'F00000WP01','F000011SQ4','F0GBR04KZO','F000010E19','F000013ZOX','F0GBR04HD7','F00000YKC1','F0GBR064UG','F00000JSDD','F000010ROF','F0000100CA','F0000100CD',
'FOGBR05KQ0','F0GBR04LBB','F0GBR04LBZ','F0GBR04LCN','F00000WLA7','F0000147D7','F00000ZB5E','F00000WC0Y']
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
payload = {
'languageId': 'da-DK',
'locale': 'da-DK',
'clientId': 'MDC_intl',
'benchmarkId': 'category',
'component': 'sal-components-mip-factor-profile',
'version': '3.40.1'}
for api_id in ids:
payload = {
'Site': 'dk',
'FC': '%s' %api_id,
'IT': 'FO',
'LANG': 'da-DK',}
response = requests.get(auth, params=payload)
search = re.search('(tokenMaaS:[\w\s]*\")(.*)(\")', response.text, re.IGNORECASE)
bearer = 'Bearer ' + search.group(2)
headers.update({'Authorization': bearer})
url = 'https://www.us-api.morningstar.com/sal/sal-service/fund/factorProfile/%s/data' %api_id
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = []
for k, v in jsonData['factors'].items():
row = {}
row['factor'] = k
historicRange = v.pop('historicRange')
row.update(v)
for each in historicRange:
row.update(each)
rows.append(row.copy())
df = pd.DataFrame(rows)
sheetName = jsonData['id']
df.to_excel(writer, sheet_name=sheetName, index=False)
print('Finished: %s' %sheetName)
writer.save()
writer.close()
If I understand you correctly, you want to get first table of that URL in the form of pandas dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# load the page into soup:
url = "https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000Z1MC&tab=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(str, df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
print(df)
Prints:
Name 2014* 2015* 2016* 2017* 2018 2019 2020 31-08
0 Samlet afkast % 2627 1490 1432 584 -589 2648 -482 1841
1 +/- Kategori 1130 583 808 -255 164 22 -910 -080
2 +/- Indeks 788 591 363 -320 -127 -262 -1106 -162
3 Rank i kategori 2 9 4 80 38 54 92 63
EDIT: To load from multiple URLs:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1",
"https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1",
]
all_data = []
for url in urls:
print("Loading URL {}".format(url))
# load the page into soup:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find correct table:
tbl = soup.select_one(".returnsCalenderYearTable")
# remove the first row (it's not header):
tbl.tr.extract()
# convert the html to pandas DF:
df = pd.read_html(str(tbl))[0]
# move the first row to header:
df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})
df["Company"] = soup.h1.text.split("\n")[0].strip()
df["URL"] = url
all_data.append(df.loc[:, ~df.isna().all()])
df = pd.concat(all_data, ignore_index=True)
print(df)
Prints:
Name 2016 2017 2018 2019 2020 31-08 Company URL
0 Samlet afkast % 1755.0 942.0 -1317.0 1757.0 -189.0 3018 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
1 +/- Kategori 966.0 -54.0 -186.0 -662.0 -967.0 1152 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
2 +/- Indeks 686.0 38.0 -854.0 -1153.0 -813.0 1015 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
3 Rank i kategori 10.0 24.0 85.0 84.0 77.0 4 Great Dane Globale Aktier https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000VA2N&tab=1
4 Samlet afkast % NaN 1016.0 -940.0 1899.0 767.0 2238 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
5 +/- Kategori NaN 20.0 190.0 -520.0 -12.0 373 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
6 +/- Indeks NaN 112.0 -478.0 -1011.0 143.0 235 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
7 Rank i kategori NaN 26.0 69.0 92.0 43.0 25 Independent Generations ESG https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F0GBR064OO&tab=1
8 Samlet afkast % NaN NaN -939.0 1898.0 766.0 2239 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
9 +/- Kategori NaN NaN 191.0 -521.0 -12.0 373 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
10 +/- Indeks NaN NaN -477.0 -1012.0 142.0 236 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
11 Rank i kategori NaN NaN 68.0 92.0 44.0 24 Independent Generations ESG Akk https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F00000YKC2&tab=1
12 Samlet afkast % NaN NaN NaN NaN NaN 2384 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
13 +/- Kategori NaN NaN NaN NaN NaN 518 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
14 +/- Indeks NaN NaN NaN NaN NaN 381 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
15 Rank i kategori NaN NaN NaN NaN NaN 18 Investin Sustainable World https://www.morningstar.dk/dk/funds/snapshot/snapshot.aspx?id=F000015MVX&tab=1
en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu
I have this code I am using to extract data from a site:
import csv
import requests
url = 'https://covid-19.dataflowkit.com/v1'
response = requests.get(url)
with open('covid.csv', 'w') as f:
writer = csv.writer(f)
for line in response.iter_lines():
writer.writerows(line.decode('utf-8').split(','))
I am able to get data out in a CSV file, but the format is wrong and confusing.
How do I format the output in a meaningful way in the CSV file?
Or how can I insert this result/data into a table in SQL server?
The response is a json. I would say the response itself is put in an appropriate manner.
import requests
import pandas as pd
url = 'https://covid-19.dataflowkit.com/v1'
response = requests.get(url)
df = pd.DataFrame(response.json())
df.to_csv("data.csv", index=False)
How does the csv look like?
Active Cases_text Country_text Last Update New Cases_text New Deaths_text Total Cases_text Total Deaths_text Total Recovered_text
0 4,871,695 World 2020-07-12 20:16 +175,247 +3,530 13,008,752 570,564 7,566,493
1 1,757,520 USA 2020-07-12 20:16 +52,144 +331 3,407,790 137,733 1,512,537
2 579,069 Brazil 2020-07-12 19:16 +23,869 +608 1,864,681 72,100 1,213,512
3 301,850 India 2020-07-12 19:16 +29,108 +500 879,466 23,187 554,429
4 214,766 Russia 2020-07-12 20:16 +6,615 +130 727,162 11,335 501,061
.. ... ... ... ... ... ... ... ...
212 0 Caribbean Netherlands NaN 7 7
213 0 St. Barth NaN 6 6
214 0 Anguilla NaN 3 3
215 1 Saint Pierre Miquelon NaN 2 1
If you want to pull out meaning from the data, then I would suggest to analyse the data on the pandas dataframe
If you want to analyse the data in a database, then you use this answer - https://stackoverflow.com/a/25662997/6849682 for sql server
I'm having some troubles and I couldn't find any answers on the internet (which is very rare !!)
I'm doing some webscraping (meterological data) on a daily granularity by city.
When I scrape for 1-day, the 'city' column is considered string, but when i scrape on more than 1-day like 2-days, the 'city' column is not a string anymore and the same code that used to work when 1-day scrape doesnt work anymore on a 2-days scrape.
Here's my code for scraping
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc)
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170101'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
When I only scrape for 20170101 like above, this below line of code works :
for i, row in df2.iterrows():
if isinstance(df2['Wetterstation\xa0'][i],str) is True:
df2.at[i,'Wetterstation\xa0']=df2['Wetterstation\xa0'][i].replace('/','-')
But when I change the end date to 20170202 for exemple, the code won't work and give me this error message :
AttributeError: 'Series' object has no attribute 'encode'
Please can you explain me what has changed ? Why the type of columns changes when I scrape more than one day ?
Thank you in advance for your time !
It seems you need to specify correct parser, in this case lxml (standard html.parser produces incorrect results).
Running this code for end date 20170202:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url_template = 'https://www.wetterkontor.de/de/wetter/deutschland/extremwerte.asp?id={}'
def get_weather(date):
url = url_template.format(date)
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'lxml') # <--- specify `lxml` parser
table = soup.find('table', id="extremwerte")
rows = []
for row in table.find_all('tr'):
rows.append([val.text for val in row.find_all('td')])
headers= [header.text for header in table.find_all('th')]
df = pd.DataFrame(rows[2:], columns=headers)
df['date']=date
return df
df2 = pd.DataFrame()
df2 = df2.fillna(0)
for d in pd.date_range(start='20170101', end='20170202'):
df2=pd.concat([df2,get_weather(d.strftime('%Y%m%d'))])
print(df2)
Produces:
Wetterstation MinimumTemp.[°C] MaximumTemp.[°C] Minimum 5 cmüber dem Erd-boden (Nacht) Schnee-höhe[cm] StärksteWindböe[Bft] Nieder-schlag[l/m2] Sonnen-scheindauer[h] date
0 Aachen -5,1 1,9 -6,9 0 644 1,9 5,3 20170101
1 Ahaus (Münsterland) -1,2 0,7 -1 0 429 3,9 0 20170101
2 Albstadt (Schwäbische Alb, 759 m) -9,9 4,2 -10,8 0 7,1 20170101
3 Aldersbach (Lkr. Passau, Niederbayern) -8,3 -1,7 -7,8 0 4,9 20170101
4 Alfeld (Leine) -4,1 1,6 -4,4 0 322 1,6 1,5 20170101
.. ... ... ... ... ... ... ... ... ...
965 Zernien (N) 0 20170202
966 Zielitz (N) 0 20170202
967 Zinnwald-Georgenfeld (Erzgebirge, 877 m) -7,4 0,2 -0,2 55 540 0,5 0 20170202
968 Zugspitze -7,3 -5,3 235 12123 0 5,8 20170202
969 Zwiesel (Bayerischer Wald, 612 m) -4,4 9,3 -0,3 44 213 0,1 2,4 20170202
[27841 rows x 9 columns]