The code works fine, however, the URL I'm trying to fetch the table for seems to have headers repeated throughout the table, I'm not sure how to deal with this and remove those rows as I'm trying to get the data into BigQuery and there are certain characters which aren't allowed.
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)
You can make class of the tr rows to None so that you don't get duplicate headers.
The following code creates a dataframe from the table
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
Output:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
In order to insert to bigquery, you can directly insert json to bigquery or a dataframe to bigquery using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html
Related
I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)
The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False
I'm very new to python and a novice at development in general. I've been tasked with iterating through a div and I am able to get some data returned. However, when no tag exists I'm getting null results. Would anyone be able to help--I have researched and tried for days now? I truly appreciate your patience and explanations. I'm trying to extract the following:
Start Date [ie, Sat 31 Jul 2021] which I get results from based on my code
End Date [ie, Fri 20 Aug 2021] this one I get no results based on my code
Description [ie, 20 Night New! Malta, The Adriatic & Greece] this one I get no results based on my code
Ship Name [ie, Viking Sea] which I get results from based on my code
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
Here is my code (no judgements please..haha)
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
#initiate data storage
cruisestartdate = []
cruiseenddate = []
itinerarydescription = []
cruiseshipname = []
cruisecabinprice = []
destinationportname = []
portdatetime = []
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
#cruise start date
startdate = container.b.text
print(startdate)
cruisestartdate.append(startdate)
# cruise end date
enddate = container.string
cruiseenddate.append(enddate)
# ship name
ship = container.a.text
cruiseshipname.append(ship)
#pandas dataframe
cruise = pd.DataFrame({
'Sail Date': cruisestartdate,
'End Date': cruiseenddate,
#'Description': description,
'Ship Name': cruiseshipname,
#'imdb': imdb_ratings,
#'metascore': metascores,
#'votes': votes,
#'us_grossMillions': us_gross,
})
print(soup)
print(cruisestartdate)
print(cruiseenddate)
print(itinerarydescription)
print(cruiseshipname)```
Here are my results from the print:
['Sat 31 Jul 2021'] [None] [] ['Viking Sea']
container.text is a nicely formatted list of lines. Just splitit and use the pieces:
cruise_div = soup.find_all('div', class_='cd-info')
#our loop through each container
for container in cruise_div:
lines = container.text.splitlines()
cruisestartdate.append(lines[1])
cruiseenddate.append(lines[2])
itinerarydescription.append(lines[3])
# ship name
ship = container.a.text
cruiseshipname.append(ship)
Another solution, using bs4 API:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
headers = {"Accept-Language": "en-US, en;q=0.5"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
all_data = []
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
4 Fri 13 Aug 2021 Mon 23 Aug 2021 10 Night New! Malta & Greek Isles Discovery Viking Venus
5 Fri 13 Aug 2021 Thu 2 Sep 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
6 Sat 14 Aug 2021 Sat 21 Aug 2021 7 Night Iceland's Natural Beauty Viking Sky
7 Mon 16 Aug 2021 Mon 13 Sep 2021 28 Night Star Collector: From Greek Gods to Gaudí Wind Surf
8 Mon 16 Aug 2021 Fri 3 Sep 2021 18 Night Star Collector: Flamenco of the Mediterranean Wind Surf
9 Mon 16 Aug 2021 Wed 29 Sep 2021 44 Night Star Collector: Myths & Masterpieces of the Mediterranean Wind Surf
and saves data.csv (screenshot from LibreOffice):
EDIT: To scrape the data from all pages:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {"Accept-Language": "en-US, en;q=0.5"}
# url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Jul%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
url = "https://www.cruisetimetables.com/cruisesearch.html?id=cse&sailmonth=Aug%202021&destination=Any&departureport=&arrivalport=&openjawcruise=&portofcall=&cruiseship=&duration=Any&chartercruise="
offset = 1
all_data = []
while True:
u = url + f"&offset={offset}"
print(f"Getting offset {offset=}")
results = requests.get(u, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for div in soup.select("div.cd-info"):
from_ = div.b
to_ = from_.find_next_sibling(text=True).strip()
# remove (To )
to_ = re.sub(r"\(To (.*)\)", r"\1", to_)
desc = from_.find_next("b").get_text(strip=True)
# remove html chars
desc = BeautifulSoup(desc, "html.parser").get_text(strip=True)
ship_name = div.a.get_text(strip=True)
all_data.append(
[
from_.get_text(strip=True),
to_,
desc,
ship_name,
]
)
# get next offset:
offset = soup.select_one(".cd-buttonlist > b + a")
if offset:
offset = int(offset.get_text(strip=True).split("-")[0])
else:
break
df = pd.DataFrame(all_data, columns=["from", "to", "description", "ship name"])
print(df)
df.to_csv("data.csv", index=False)
Prints:
Getting offset offset=1
Getting offset offset=11
Getting offset offset=21
...
from to description ship name
0 Tue 3 Aug 2021 Mon 23 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Venus
1 Tue 10 Aug 2021 Fri 20 Aug 2021 10 Night New! Malta & Adriatic Jewels Viking Sea
2 Tue 10 Aug 2021 Mon 30 Aug 2021 20 Night New! Malta, The Adriatic & Greece Viking Sea
3 Fri 13 Aug 2021 Fri 20 Aug 2021 7 Night Bermuda Escape Viking Orion
...
144 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nights Mediterranean MSC Splendida
145 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Nachte Grose Freiheit - Schwedische Kuste 1 Mein Schiff 6
146 Tue 31 Aug 2021 Tue 7 Sep 2021 7 Night Iceland's Natural Beauty Viking Jupiter
Getting the start date, description, and ship name are straightforward. The only trick here is to get the end date which does lie inside a specific tag. To get it, you have to tweak some regex. Here is one solution similar to #Andrej's:
from bs4 import BeautifulSoup
import re
html = """
<div class="cd-info"> <!-- Start cd-info -->
From <b>Sat 31 Jul 2021</b><br>
(To Fri 20 Aug 2021)<br>
<b>20 Night New! Malta, The Adriatic & Greece</b><br>
Ship
<a class="red" href="/cruise-ship-viking-sea.html">Viking Sea</a>
<br>
<span class="mobile-no-desktop-yes"><br></span>
More details at<br>
<a target="_blank" onclick="trackOutboundLink('https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html');" href="https://www.vikingcruises.com/oceans/cruise-destinations/eastern-mediterranean/malta-adriatic-and-greece/index.html">
<img class="noborder" src="/logos/viking-cruises.gif" alt="More details for 20 Night New! Malta, The Adriatic & Greece at Viking Cruises">
</a>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
bold = soup.find_all('b')
print(f"Start date: {bold[0].text}")
print(f"Description: {bold[1].text}")
untagged_line = re.sub("\n", "", soup.find(text=re.compile('To')))
end_date = re.sub(r"\(To (.*)\)", r"\1", untagged_line)
print(f"End date: {end_date}")
ship = soup.find('a', class_='red')
print(f"Ship: {ship.text}")
Output:
Start date: Sat 31 Jul 2021
Description: 20 Night New! Malta, The Adriatic & Greece
End date: Fri 20 Aug 2021
Ship: Viking Sea
I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)
Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools
This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule
The chart starts out like:
Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN
I am using these packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
The output I want in a .csv file looks like this:
These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')
dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
if v:
last = [*g][0]
dates[last] = []
else:
dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])
all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
for time, team1, team2, network in v:
all_data['Date'].append(k)
all_data['Time'].append(time)
all_data['Team 1'].append(team1)
all_data['Team 2'].append(team2)
all_data['Network'].append(network)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Date Time Team 1 Team 2 Network
0 Tuesday, October 25 8:00 PM Knicks Cavaliers TNT
1 Tuesday, October 25 10:30 PM Spurs Warriors TNT
2 Wednesday, October 26 8:00 PM Thunder Sixers ESPN
3 Wednesday, October 26 10:30 PM Rockets Lakers ESPN
4 Thursday, October 27 8:00 PM Celtics Bulls TNT
.. ... ... ... ... ...
159 Saturday, April 8 8:30 PM Clippers Spurs ABC
160 Monday, April 10 8:00 PM Wizards Pistons TNT
161 Monday, April 10 10:30 PM Rockets Clippers TNT
162 Wednesday, April 12 8:00 PM Hawks Pacers ESPN
163 Wednesday, April 12 10:30 PM Pelicans Blazers ESPN
[164 rows x 5 columns]
And saves data.csv (screenshot from Libre Office):
I am a new in python and is trying to read my excel file in spyder, anaconda. However, when I run it, some row is missing and replaced with '...'. I have seven columns and 100 rows in my excel file. The column arrangement also quite weird.
This is my code:
import pandas as pd
print(" Comparing within 100 Airline \n\n")
def view():
airlines = pd.ExcelFile('Airline_final.xlsx')
df1 = pd.read_excel("Airline_final.xlsx",sheet_name=2)
print("\n\n 1: list of all Airlines \n")
print(df1)
view()
Here is what I get:
18 #051 Cubana Cuba
19 #003 Aigle Azur France
20 #011 Air Corsica France
21 #012 Air France France
22 #019 Air Mediterranee France
23 #050 Corsair France
24 #072 HOP France
25 #087 Joon France
26 #006 Air Berlin Germany
27 #049 Condor Flugdienst Germany
28 #057 Eurowings Germany
29 #064 Germania Germany
.. ... ... ...
70 #018 Air Mandalay Myanmar
71 #020 Air KBZ Myanmar
72 #067 Golden Myanmar Airlines Myanmar
73 #017 Air Koryo North Korea
74 #080 Jetstar Asia Singapore
75 #036 Binter Canarias Spain
76 #040 Canaryfly Spain
77 #073 Iberia and Iberia Express Spain
To print the whole dataframe use:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df1)