Web scraping with bs4 python: How to display football matchups - python

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)

Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

Related

How do I filter a list through a formatted web scraping for loop

I have a list of basketball players that I want to pass through a web scraping for loop I've already set up. The list of players is a list of the 2011 NBA Draft picks. I want to loop through each player and get their college stats from their final year in college. The problem is some drafted players did not go to college and therefore do not have a url formatted in their name so every time I pass in even one player that did not play in college the whole code gets an error. I have tried including "pass" and "continue" but nothing seems to work. This is the closest I gotten so far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User Agent':'Mozilla/5.0'}
players = [
'kyrie-irving','derrick-williams','enes-kanter',
'tristan-thompson','jonas-valanciunas','jan-vesely',
'bismack-biyombo','brandon-knight','kemba-walker,
'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful
player_stats = []
for player in players:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
res = requests.get(url)
#if player in url:
#continue
#else:
#print("This player has no college stats")
#Including this if else statement makes the error say header is not defined. When not included, the error says NoneType object is not iterable
soup = BeautifulSoup(res.content, 'lxml')
header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
rows = soup.findAll('tr')
player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
player_stats
graph = pd.DataFrame(player_stats, columns = header)
You can do 1 of 2 things:
check the response status code. 200 is successful response, anything else is an error. Problem with that is some site will have a valid html page to say "invalid page", so you could still get a successful 200 response.
Just use try/except. If it fails, continue to the next item in the list
Because of that issue with option 1, go with option 2 here. Also, have you considered using pandas to parse the table? It's a little easier to do (and uses BeautifulSoup under the hood)?
Lastly, you're going to need to do a little more logic with this. There are multiple college players "Derrick William". I suspect you're not meaning https://www.sports-reference.com/cbb/players/derrick-williams-1.html. So you need to figure out how to work that out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User Agent':'Mozilla/5.0'}
players = [
'kyrie-irving','derrick-williams','enes-kanter',
'tristan-thompson','jonas-valanciunas','jan-vesely',
'bismack-biyombo','brandon-knight','kemba-walker',
'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful
player_stats = []
for player in players:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
res = requests.get(url)
try:
soup = BeautifulSoup(res.content, 'lxml')
header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
rows = soup.findAll('tr')
player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
player_stats
except:
print("%s has no college stats" %player)
graph = pd.DataFrame(player_stats, columns = header)
With Pandas:
graph = pd.DataFrame()
for player in players:
try:
url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
df = pd.read_html(url)[0]
cols = list(df.columns)
df = df.iloc[-2][cols]
df['Player'] = player
graph = graph.append(df).reset_index(drop=True)
graph = graph[['Player'] + cols]
except:
print("%s has no college stats" %player)

Extracting similar items from a website with beautiful soup

I`m trying to scrape a website rating. I want to get each individual rating and it´s particular date. However, I only get one result in my list, although there should be several.
Am I doing something wrong in the for loop?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
url = "https://www.kununu.com/de/heidelpay/kommentare"
while url != " ":
print(url)
time.sleep(15)
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
print(r.status_code)
soup = BeautifulSoup(r.text, "html.parser")
#print(soup.prettify())
#Get overall score of the company
score_avg = soup.find("span", class_="index__aggregationValue__32exy").text
print(score_avg)
#get individuel scores and dates of the company
rating_list = []
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
print(rating_list)
3,3
[['5,0', 'Januar 2017']]
Many thanks in advance!
It looks like you aren't appending the rating to the rating_list until the last loop is done. Is the printed rating perchance the very last one?
Add the append to your loop, like so:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Well, the problem is that you're just appending the last rating value in rating_list.append(rating) because it's out of the foor loop, so what you have to do is this:
for box in soup.find_all(".index__rating__3nC2L"):
score_ind = box.select(".index__score__16yy9").text
date = select(".index__date__eIOxr").text
rating = [score_ind, date]
rating_list.append(rating)
Like this way you're gonna append each rating value in each iteration of the forloop. Hope this can help you

BeautifulSoup - Scrape multiple pages

I want to scrape the name of the members from each page and move on to the next pages and do the same. My code is working for only one page. I'm very new to this, Any advice would be appreciated. Thank you.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bodia.com/spa-members/page/1")
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
I tried this and it only gives me the members of the page 3.
for i in range (1,4): #to scrape names of page 1 to 3
r = requests.get("https://www.bodia.com/spa-members/page/"+ format(i))
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
Then I tried this :
i = 1
while i<5:
r = requests.get("https://www.bodia.com/spa-members/page/"+str(i))
i+=1
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print (lights_list)
It gives me the name of 4 members, but I don't know from which page
['Seng Putheary (Nana)']
['Marco Julia']
['Simon']
['Ms Anne Guerineau']
Just two changes needed to be made to get it to scrape everything.
r = requests.get("https://www.bodia.com/spa-members/page/"+ format(i)) needs to be changed to r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i)). Your use of format was incorrect.
You were not looping over all the code, so the result was that it only printed out one set of names and then had no way to return to the start of the loop. Indenting everything under the for loop fixed that.
import requests
from bs4 import BeautifulSoup
for i in range (1,4): #to scrape names of page 1 to 3
r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i))
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)
print(lights_list)
The above code was spitting out a list of names every 3 seconds for the pages it scraped.

Python Beautiful Soup not looping results

Im using BS4 for the first time and need to scrape the items from an online catalogue to csv.
I have setup my code however when i run the code the results are only repeating the first item in the catalogue n times (where n is the number of items).
Can someone review my code and let me know where i am going wrong.
Thanks
import requests
from bs4 import BeautifulSoup
from csv import writer
#response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/27/anaesthetic-oxygen-and-resuscitation?CoreListRequest=BrowseCoreList')
response = requests.get('https://my.supplychain.nhs.uk/Catalogue/browse/32/nhs-cat?LastCartId=&LastFavouriteId=&CoreListRequest=BrowseAll')
soup = BeautifulSoup(response.text , 'html.parser')
items = soup.find_all(class_='productPrevDetails')
#print(items)
for item in items:
ItemCode = soup.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = soup.select('p')[58].get_text()
ProductInfo = soup.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)
You always see the first result because you are searching soup, not the item. Try
for item in items:
ItemCode = item.find(class_='product_npc ').get_text().replace('\n','')
ItemNameS = item.select('p')[58].get_text()
ProductInfo = item.find(class_='product_key_info').get_text()
print(ItemCode,ItemNameS,ProductInfo)

Parsing error with Beautiful Soup 4 and Python

I need to get the list of the rooms from this website: http://www.studentroom.ch/en/dynasite.cfm?dsmid=106547
I'm using Beautiful Soup 4 in order to parse the page.
This is the code I wrote until now:
from bs4 import BeautifulSoup
import urllib
pageFile = urllib.urlopen("http://studentroom.ch/dynasite.cfm?dsmid=106547")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
roomsNoFilter = soup.find('div', {"id": "ImmoListe"})
rooms = roomsNoFilter.table.find_all('tr', recursive=False)
for room in rooms:
print room
print "----------------"
print len(rooms)
For now I'm trying to get only the rows of the table.
But I get only 7 rows instead of 78 (or 77).
At first I tough that I was receiving only a partial html, but I printed the whole html and I'm receiving it correctly.
There's no ajax calls that loads new rows after the page loaded...
Someone could please help me finding the error?
This is working for me
soup = BeautifulSoup(pageHtml)
div = soup.select('#ImmoListe')[0]
table = div.select('table > tbody')[0]
k = 0
for room in table.find_all('tr'):
if 'onmouseout' in str(room):
print room
k = k + 1
print "Total ",k
Let me know the status

Categories