Error while scraping: "Expecting value: line 1 column 1 (char 0)"

Error while scraping: "Expecting value: line 1 column 1 (char 0)" - python

I am scraping reviews from rotten tomatoes website using the following code:
Link to the page.
import requests
import re
import json
import pandas as pd
import numpy as np
r = requests.get("https://www.rottentomatoes.com/m/avatar/reviews?type=user")
content = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = content["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
#print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
data['User_Name'].extend(t['displayName'] for t in result["reviews"])
data['Rating'].extend(t['score'] for t in result["reviews"])
data['Review'].extend(t['review'] for t in result["reviews"])
df = pd.DataFrame(data)
I want to convert the above code to a separate function.
Here I have posted the code where I have tried to get this functionality code but is giving an Error with json.loads():
"Expecting value: line 1 column 1 (char 0)"
I had googled the solution for this and found that adding headers parameter will solve but is not working here.
I am not able to understand what is casing this error. It would be helpful if someone can guide me.
import requests
import re
import json
import pandas as pd
import numpy as np
def getReviews(movieId, endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
},
headers={'Content-Type': 'application/json'}
)
return r.json()
def ScrapeReviews(movie):
url = "https://www.rottentomatoes.com/m/" + movie + "/reviews?type=user"
req = requests.get(url)
content = json.loads(re.search('movieReview\s=\s(.*);', req.text).group(1))
movie_id = content["movieId"]
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
#print(f"[{i}] request review")
result = getReviews(movie_id, result["pageInfo"]["endCursor"] if i != 0 else "")
data['User_Name'].extend(t['displayName'] for t in result["reviews"])
data['Rating'].extend(t['score'] for t in result["reviews"])
data['Review'].extend(t['review'] for t in result["reviews"])
df = pd.DataFrame(data)
return df
d = ScrapeReviews('avatar')

The error is in getReviews function, the url should be:
"https://www.rottentomatoes.com/napi/**movie**/{movieId}/reviews/user"
import requests
import re
import json
import pandas as pd
import numpy as np
def getReviews(movieId, endCursor):
r = requests.get(
f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params={"direction": "next", "endCursor": endCursor, "startCursor": ""},
headers={"Content-Type": "application/json"},
)
return r.json()
def ScrapeReviews(movie):
url = "https://www.rottentomatoes.com/m/" + movie + "/reviews?type=user"
req = requests.get(url)
content = json.loads(re.search("movieReview\s=\s(.*);", req.text).group(1))
movie_id = content["movieId"]
data = {"User_Name": [], "Rating": [], "Review": []}
result = {}
for i in range(0, 5):
result = getReviews(
movie_id, result["pageInfo"]["endCursor"] if i != 0 else ""
)
data["User_Name"].extend(t["displayName"] for t in result["reviews"])
data["Rating"].extend(t["score"] for t in result["reviews"])
data["Review"].extend(t["review"] for t in result["reviews"])
df = pd.DataFrame(data)
return df
d = ScrapeReviews("avatar")
print(d)
Prints:
User_Name Rating Review
0 Joe D 5.0 To me this is the most perfect blockbuster of all.\nLove Sam Worthington's empty cup, I find his everyman acting compelling, Saldana may be the most beautiful woman on the planet with her trademark perfect posture, and Sigourney adds class with extra to spare wherever she goes.\nThe planet Pandora remains the real star, and the revelation that we're the bad guys and the spiritual tree-huggers were right all along, I find genuinely touching every time.\nFirst class and I can't wait for more of Cameron's magic touch.
1 Jimmy W 1.0 The fact that this movie can make the most money of all time and also gain a following of hive-minded morons to defend it says more about the state of society than it does the movie itself. For a movie that's meant to make a point about abusive use of the environment, they sure seem to indulge in the use of massive amounts of expensive technology that no doubt utilized way more than its fair share of natural resources. Oh well, at least you can pretend to be vindicated by the box office numbers.
2 Goudkuil E 1.5 Apart from the visuals everything feels uninspired and thrown together using a old cliche of an outsider seeing what's wrong with what he's people have been doing falling in love then whiching sides. The acting is ok, the dialogue is kinda rough. The movie is padded with a lot of nice scenique views with no real narrative meaning.
3 Antonio D 4.0 o filme possui uma fotografia muito bela e, mesmo o filme sendo de 2009 não conseguimos encontrar defeitos em relação a montagem e fotografia, a história é satisfatória e é um reflexo do que sabemos que aconteceu no inicio da colonização
...and so on.

Related

From Python to Excel - Building an excel worksheet

With the help of some very kind people on here I finally got a working script to scrape some data. I now desire to transfer this data from Python to Excel, in a specific format. I have tried multiple approaches, but did not manage to get the desired result.
My script is the following:
import requests
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = ', '.join([i['title'] for i in soup.select("[class$='dead'] > img")])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = ', '.join([i['title'] for i in soup.select("[class='class'] > img")])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
for i in range(46270, 46394):
analyze(i)
To summarize, I scrape a couple of things:
Team names (title & title_ano)
Image titles (image_titles & image_titles_ano)
Team points (points & points_ano)
A string of text (arena)
One line of output currently looks like this:
('Thunder', '0 pts', 'roublard, huppermage, ecaflip') ('Tweaps', '60 pts', 'steamer, feca, sacrieur') A10
My goal is to transfer this output to excel, making it look like this:
To clarify, in terms of the variables I have it would be this:
Currently I can manage to transfer my data to excel, but I can't figure out how to format my data this way. Any help would be greatly appreciated :)

First of all, the code that you are using is not actually wholly correct. E.g.:
analyze(46275)
(('Grind', '10 pts', 'roublard, ecaflip'),
('SOLARY', '50 pts', 'enutrof, eniripsa, steamer, eliotrope'), 'A10')
Notice that the first player only has two image titles, and the second one has four. This is incorrect, and happens because your code assumes that img tags with the class ending in "dead" belong to the first player, and the ones that have a class named "class" belong to the second. This happens to be true for your first match (i.e. https://ktarena.com/fr/207-dofus-world-cup/match/46270), but very often this is not true at all. E.g. if I compare my result below with the same method applied to your analyze function, I end up with mismatches is 118 rows out of 248.
Here's a suggested rewrite:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def analyze_new(i):
# You don't need `/1` at the end of the url
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find('span',class_='name').get_text()
# find all teams, and look for info inside each team
teams = soup.findAll('div',class_='team')
my_teams = [tuple()]*2
for idx, team in enumerate(teams):
my_teams[idx] = my_teams[idx] + \
(team.select(".name a")[0].get_text(),)
my_teams[idx] = my_teams[idx] + \
(soup.select(".result .points")[idx].get_text(),)
my_teams[idx] = my_teams[idx] + \
(', '.join([img['title'] for img in team.findAll('img')[1:]]),)
# notice, we need `return` instead of `print` to use the data
return *my_teams,arena
print(analyze_new(46275))
(('Grind', '10 pts', 'roublard, ecaflip, enutrof'),
('SOLARY', '50 pts', 'eniripsa, steamer, eliotrope'), 'A10')
Before writing this data to excel, I would create a pd.DataFrame, which can then be exported very easily:
# capture info per player in a single row
rows = []
for i in range(46270, 46394):
one, two, arena = analyze_new(i)
# adding `i` to rows, as "Match" seems like a useful `column` to have!
# but if not, you can delete `i` here below (N.B. do NOT delete the COMMA!)
# and cut 'Match' twice below
rows.append(one+(arena,i))
rows.append(two+(arena,i))
cols = ['Team','Points', 'Images', 'Arena','Match']
# create df
df = pd.DataFrame(data=rows,columns=cols)
# split up the images strings in `df.Images` and make new columns for them
# finally, drop the `df.Images` column itself
df = pd.concat([df,
df.Images.str.split(',',expand=True)\
.rename(columns={i:f'Image Title {i+1}'
for i in range(3)})], axis=1)\
.drop('Images', axis=1)
# Strip " pts" from the strings in `df.Points` and convert the type to an `int`
df['Points'] = df.Points.str.replace(' pts','').astype(int)
# Re-order the columns
df = df.loc[:, ['Match', 'Arena','Team', 'Image Title 1', 'Image Title 2',
'Image Title 3', 'Points']]
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0
# Finally, write the `df` to an Excel file
df.to_excel('fname.xlsx')
Result:
If you dislike the default styles added to the header row and index column, you can write it away like so:
df.T.reset_index().T.to_excel('test.xlsx', index=False, header=False)
Result:
Incidentally, I assume you have a particular reason for wanting the function to return the relevant data as *my_teams,arena. If not, it would be better to let the function itself do most of the heavy lifting. E.g. we could write something like this, and return a df directly.
def analyze_dict(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
d = {'Match': [i]*2,
'Arena': [soup.find('span',class_='name').get_text()]*2,
'Team': [],
'Image Title 1': [],
'Image Title 2': [],
'Image Title 3': [],
'Points': [],
}
teams = soup.findAll('div',class_='team')
for idx, team in enumerate(teams):
d['Team'].append(team.select(".name a")[0].get_text())
d['Points'].append(int(soup.select(".result .points")[idx].get_text().split(' ')[0]))
for img_idx, img in enumerate(team.findAll('img')[1:]):
d[f'Image Title {img_idx+1}'].append(img['title'])
return pd.DataFrame(d)
print(analyze_dict(46275))
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46275 A10 Grind roublard ecaflip enutrof 10
1 46275 A10 SOLARY eniripsa steamer eliotrope 50
Now, we only need to do the following outside the function:
dfs = []
for i in range(46270, 46394):
dfs.append(analyze_dict(i))
df = pd.concat(dfs, axis=0, ignore_index=True)
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0

With hardly any changes from your post, you can use the openpyxl library to write the output to an excel file as shown below:
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = image_titles = [i['title'] for i in soup.select("[class='team']:nth-of-type(1) [class^='class'] > img")]
try:
image_title_one = image_titles[0]
except IndexError: image_title_one = ""
try:
image_title_two = image_titles[1]
except IndexError: image_title_two = ""
try:
image_title_three = image_titles[2]
except IndexError: image_title_three = ""
ws.append([arena,title,image_title_one,image_title_two,image_title_three,point])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = [i['title'] for i in soup.select("[class='team']:nth-of-type(2) [class^='class'] > img")]
try:
image_title_ano_one = image_titles_ano[0]
except IndexError: image_title_ano_one = ""
try:
image_title_ano_two = image_titles_ano[1]
except IndexError: image_title_ano_two = ""
try:
image_title_ano_three = image_titles_ano[2]
except IndexError: image_title_ano_three = ""
ws.append([arena,title_ano,image_title_ano_one,image_title_ano_two,image_title_ano_three,point_ano])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
if __name__ == '__main__':
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet("result")
ws.append(['Arena','Team','Image Title 1','Image Title 2','Image Title 3','Points'])
for i in range(46270, 46290):
analyze(i)
wb.save("output.xlsx")
I've fixed the selectors to grab the right number of image titles.

Scraping data beach volleyball on multiple pages

I am trying to scrape all the possible data from this webpage Gstaad 2017
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium.webdriver.support.ui import Select
#Starts the driver and goes to our starting webpage
driver = webdriver.Chrome( "C:/Users/aldi/Downloads/chromedriver.exe")
driver.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
#Imports HTML into python
page = requests.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
soup = BeautifulSoup(driver.page_source, 'lxml')
stages = soup.find_all('div')
stages = driver.find_elements_by_class_name('clsTournBracketHeader')[-1].text
#TODO the first row (country quota matches) has no p tag and therefore it is not included in the data
rows = []
paragraphs = []
empty_paragraphs = []
for x in soup.find_all('p'):
if len(x.get_text(strip=True)) != 0:
paragraph = x.extract()
paragraphs.append(paragraph)
if len(x.get_text(strip=True)) == 0:
empty_paragraph = x.extract()
empty_paragraphs.append(empty_paragraph)
# players
home_team_player_1 = ''
home_team_player_2 = ''
away_team_player_1 = ''
away_team_player_2 = ''
for i in range(0, len(paragraphs)):
#round and satege of the competition
round_n= paragraphs[i].find('u').text
paragraph_rows = paragraphs[i].text.split('\n')[1:-1]
counter = 0
for j in range(0,len(paragraph_rows)):
#TODO tournament info, these can vary from tournament to tournament
tournament_info = soup.find('td', class_ = 'clsTournHeader').text.strip().split()
tournament_category = [' '.join(tournament_info[0 : 2])][0]
tournament_prize_money = tournament_info[2]
#TODO tournament city can also have two elements, not just one
tournament_city = tournament_info[3]
tournament_year = tournament_info[-1]
tournament_days = tournament_info[-2][:-1].split("-")
tournament_starting_day = tournament_days[0]
tournament_ending_day = tournament_days[-1]
tournament_month = tournament_info[-3]
tournament_stars = [' '.join(tournament_info[5 : 7])][0]
players = paragraphs[i].find_all('a', {'href':re.compile('.*player.*')})
home_team_player_1 = players[counter+0].text
home_team_player_2 = players[counter+1].text
away_team_player_1 = players[counter+2].text
away_team_player_2 = players[counter+3].text
#matches
match= paragraph_rows[j].split(":")[0].split()[-1].strip()
#nationalities
nationalities = ["United", "States"]
if paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[3] in nationalities:
home_team_country = "United States"
else:
home_team_country = paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[-2]
if paragraph_rows[j].split("def.")[1].split("/")[1].split(" ")[3] in nationalities:
away_team_country = "United States"
else:
away_team_country = paragraph_rows[j].split("def.")[1].split("/")[1].split("(")[0].split(" ")[-2]
parentheses = re.findall(r'\(.*?\)', paragraph_rows[j])
if "," in parentheses[0]:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = parentheses[0].split(",")[1]
home_team_qualification_round = home_team_qualification_round[1:-1]
else:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = None
if "," in parentheses[1]:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
away_team_qualification_round = parentheses[1].split(",")[1]
away_team_qualification_round = away_team_qualification_round[1:-1]
else:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
match_duration = parentheses[2]
match_duration = match_duration[1:-1]
away_team_qualification_round = None
# sets
sets = re.findall(r'\).*?\(', paragraph_rows[j])
sets = sets[1][1:-1]
if len(sets.split(",")) == 2:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = None
if len(sets.split(",")) == 3:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = sets.split(",")[2]
row = { " home_team_player_1 ": home_team_player_1 ,
" home_team_player_2": home_team_player_2,
"away_team_player_1": away_team_player_1,
"away_team_player_2":away_team_player_1,
"match": match,
"home_team_country":home_team_country,
"away_team_country": away_team_country,
"home_team_ranking": home_team_ranking,
"away_team_ranking": away_team_ranking,
"match_duration": match_duration,
"home_team_qualification_round": home_team_qualification_round,
"away_team_qualification_round": away_team_qualification_round,
"score_set1":score_set1,
"score_set2":score_set2,
"score_set3":score_set3,
"tournament_category": tournament_category,
"tournament_prize_money": tournament_prize_money,
"tournament_city": tournament_city,
"tournament_year": tournament_year,
"tournament_starting_day": tournament_starting_day,
"tournament_ending_day":tournament_ending_day,
"tournament_month":tournament_month,
"tournament_stars":tournament_stars,
"round_n": round_n
}
counter += 4
rows.append(row)
data = pd.DataFrame(rows)
data.to_csv("beachvb.csv", index = False)
I am not really experienced in web scraping. I have just started as a self-taught and find the HTML source code quite messy and poorly structured.
I want to improve my code in two ways:
Include all the missing matches (country quota matches, semifinals, bronze medal, and gold medal) and the respective category for each match (country quota matches, pool, winner's bracket, semifinals, bronze medal, and gold medal)
iterate the code for more years and tournaments from the dropdown menu at the top of the webpage
I have tried to iterate through different years but my code does not work
tournament_years = {"FIVB 2015", "FIVB 2016"}
dfs = []
for year in tournament_years:
# select desired tournament
box_year = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[1]/td[2]/select"))
box_year.select_by_visible_text(year)
box_matches = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[2]/td[2]/select"))
box_matches.select_by_visible_text("Matches")
The main idea was to create a list of dataframes for each year and each tournament by adding a new loop at the beginning of the code.
If someone has a better idea and technique to do so, it is really appreciated!

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

I'm working on a personal project and I'm trying to retrieve air quality data from the https://aqicn.org website using their API.
I've used this code, which I've copied and adapted for the city of Bucharest as follows:
import pandas as pd
import folium
import requests
# GET data from AQI website through the API
base_url = "https://api.waqi.info"
path_to_file = "~/path"
# Got token from:- https://aqicn.org/data-platform/token/#/
with open(path_to_file) as f:
contents = f.readlines()
key = contents[0]
# (lat, long)-> bottom left, (lat, lon)-> top right
latlngbox = "44.300264,25.920181,44.566991,26.297836" # For Bucharest
trail_url=f"/map/bounds/?token={key}&latlng={latlngbox}" #
my_data = pd.read_json(base_url + trail_url) # Joined parts of URL
print('columns->', my_data.columns) #2 cols ‘status’ and ‘data’ JSON
### Built a dataframe from the json file
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],
each_row['lat'],
each_row['lon'],
each_row['aqi']])
df = pd.DataFrame(all_rows, columns=['station_name', 'lat', 'lon', 'aqi'])
# Cleaned the DataFrame
df['aqi'] = pd.to_numeric(df.aqi, errors='coerce') # Invalid parsing to NaN
# Remove NaN entries in col
df1 = df.dropna(subset = ['aqi'])
Unfortunately it only retrieves 4 stations whereas there are many more available on the actual site. In the API documentation the only limitation I saw was for "1,000 (one thousand) requests per second" so why can't I get more of them?
Also, I've tried to modify the lat-long values and managed to get more stations, but they were outside the city I was interested in.
Here is a view of the actual perimeter I've used in the embedded code.
If you have any suggestions as of how I can solve this issue, I'd be very happy to read your thoughts. Thank you!

Try using waqi through aqicn... not exactly a clean API but I found it to work quite well
import pandas as pd
url1 = 'https://api.waqi.info'
# Get token from:- https://aqicn.org/data-platform/token/#/
token = 'XXX'
box = '113.805332,22.148942,114.434299,22.561716' # polygon around HongKong via bboxfinder.com
url2=f'/map/bounds/?latlng={box}&token={token}'
my_data = pd.read_json(url1 + url2)
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],each_row['lat'],each_row['lon'],each_row['aqi']])
df = pd.DataFrame(all_rows,columns=['station_name', 'lat', 'lon', 'aqi'])
From there its easy to plot
df['aqi'] = pd.to_numeric(df.aqi,errors='coerce')
print('with NaN->', df.shape)
df1 = df.dropna(subset = ['aqi'])
df2 = df1[['lat', 'lon', 'aqi']]
init_loc = [22.396428, 114.109497]
max_aqi = int(df1['aqi'].max())
print('max_aqi->', max_aqi)
m = folium.Map(location = init_loc, zoom_start = 5)
heat_aqi = HeatMap(df2, min_opacity = 0.1, max_val = max_aqi,
radius = 60, blur = 20, max_zoom = 2)
m.add_child(heat_aqi)
m
Or as such
centre_point = [22.396428, 114.109497]
m2 = folium.Map(location = centre_point,tiles = 'Stamen Terrain', zoom_start= 6)
for idx, row in df1.iterrows():
lat = row['lat']
lon = row['lon']
station = row['station_name'] + ' AQI=' + str(row['aqi'])
station_aqi = row['aqi']
if station_aqi > 300:
pop_color = 'red'
elif station_aqi > 200:
pop_color = 'orange'
else:
pop_color = 'green'
folium.Marker(location= [lat, lon],
popup = station,
icon = folium.Icon(color = pop_color)).add_to(m2)
m2
checking for stations within HK, returns 19
df[df['station_name'].str.contains('HongKong')]

How do I return multiple 'scorers' when scraping for football results using Python?

I'm just a few hours into learning Python so please go easy with me! I'm just wanting to scrape scores and scorers off a website, I've been able to do that, however, I'm only getting one scorer (if there is one!), when there are multiple goal scorers I am only getting the first. I think I'm trying to look for multiple scorers under '# Home Scorers'.
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.skysports.com/football-results"
match_results = {}
match_details = {}
match_no = 0
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
matches = soup.find_all('div',{'class':'fixres__item'})
for match in matches:
try:
match_url_get = match.find('a',{'class':'matches__item matches__link'}).get('href')
match_url = match_url_get if match_url_get else "unknown"
event_id = match_url[-6:]
match_response = requests.get(match_url)
match_data = match_response.text
match_soup = BeautifulSoup(match_data,'html.parser')
# Match Details
match_date = match_soup.find('time',{'class':'sdc-site-match-header__detail-time'}).text
match_location = match_soup.find('span',{'class':'sdc-site-match-header__detail-venue'}).text
match_info = match_soup.find('p',{'class':'sdc-site-match-header__detail-fixture'}).text
# Home Scores & Team
home_details = match_soup.find_all('span',{'class':'sdc-site-match-header__team-name sdc-site-match-header__team-name--home'})
for home_detail in home_details:
home_team = home_detail.find('span',{'class':'sdc-site-match-header__team-name-block-target'}).text
home_score_get = match_soup.find('span',{'class':'sdc-site-match-header__team-score-block','data-update':'score-home'})
home_score = home_score_get.text if home_score_get else "none"
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
for home_scorer_detail in home_scorer_details:
goal_scorer_get = home_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
goal_scorer = goal_scorer_get.text if goal_scorer_get else "none"
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
# Away Scores & Team
away_details = match_soup.find_all('span',{'class':'sdc-site-match-header__team-name sdc-site-match-header__team-name--away'})
for away_detail in away_details:
away_team = away_detail.find('span',{'class':'sdc-site-match-header__team-name-block-target'}).text
away_score_get = match_soup.find('span',{'class':'sdc-site-match-header__team-score-block','data-update':'score-away'})
away_score = away_score_get.text if away_score_get else "none"
# Home Scorers
away_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-away'})
for away_scorer_detail in away_scorer_details:
away_goal_scorer_get = away_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
away_goal_scorer = away_goal_scorer_get.text if away_goal_scorer_get else "none"
away_goal_score_minute_get = away_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
away_goal_score_minute = away_goal_score_minute_get.text if away_goal_score_minute_get else "none"
print("Match: ",event_id , "Match Date:", match_date, "Match Location:", match_location, "Match Info:", match_info, "\nResult: ", home_team, home_score, away_team, away_score)
print("Home Scorer:", goal_scorer, "Minute:",goal_score_minute, "\nAway Scorer:", away_goal_scorer, "Minute:",away_goal_score_minute)
print(match_date)
except:
pass
match_no+=1
match_results[match_no] = [event_id, home_team, home_score, away_team, away_score, match_url, match_date, match_location, match_info]
match_details[match_no] = [event_id, goal_scorer, goal_score_minute, away_goal_scorer, away_goal_score_minute]
Period = "2021-22"
print("Total Matches: ", match_no)
match_results = pd.DataFrame.from_dict(match_results, orient='index', columns = ['Event_ID:', 'Home Team:','Home Score:','Away Team:','Away Score:','Link:','Match Date:','Match Location:','Match Info:'])
match_results.to_csv("Python/FL/Premier League Results (SkySports.com) " + Period + ".csv")
match_details = pd.DataFrame.from_dict(match_details, orient='index', columns = ['Event_ID:', 'Home Goal:','Home Goal Minute:','Away Goal:','Away Goal Minute:'])
match_details.to_csv("Python/FL/Premier League Details (SkySports.com) " + Period + ".csv")
So the bit that's not working correctly is:
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
for home_scorer_detail in home_scorer_details:
goal_scorer_get = home_scorer_detail.find('li',{'class':'sdc-site-match-header__team-synopsis-line'})
goal_scorer = goal_scorer_get.text if goal_scorer_get else "none"
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
Any ideas how I can return multiple rows for that bit?!
Thanks in advance :)

home_scorer_details only has 1 item, the unordered list itself.
To get all the scorers you need to get the items in that list.
The following code, which is pretty rough, will create a list of dictionaries where each dictionary has the name of the scorer and the minute(s) they scored.
You could use similar code to get all the away scorers.
Like I said, this code is rough and needs refined but it should give you a start.
# Home Scorers
home_scorer_details = match_soup.find_all('ul',{'class':'sdc-site-match-header__team-synopsis','data-update':'synopsis-home'})
home_scorers = []
for home_scorer_detail in home_scorer_details[0].find_all('li'):
goal_scorer = home_scorer_detail.text
goal_score_minute_get = home_scorer_detail.find('span',{'class':'sdc-site-match-header__event-time'})
goal_score_minute = goal_score_minute_get.text if goal_score_minute_get else "none"
home_scorers.append({'scorer': goal_scorer, 'minute': goal_score_minute})
print(home_scorers)

Looping through scraped data and outputting the result

I am trying to e the BBC football results website to get teams, shots, goals, cards and incidents. I currently have 3 teams data passed into the URL.
I writing the script in Python and using the Beautiful soup bs4 package. When outputting the results to screen, the first team is printed, the the first and second team, then the first, second and third team. So the first team is effectively being printed 3 times, When I am trying to get the 3 teams just once.
Once I have this problem sorted I will write the results to file. I am adding the teams data into data frames then into a list (I am not sure if this is the best method).
I am sure if is something to do with the for loops, but I am unsure how to resolve the problem.
Code:
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):
url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
teams_list = []
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for report in soupb.find_all('td', 'match-details'):
home_tag = report.find('span', class_='team-home')
home_team = home_tag and ''.join(home_tag.stripped_strings)
score_tag = report.find('span', class_='score')
score = score_tag and ''.join(score_tag.stripped_strings)
shots_tag = report.find('span', class_='shots-on-target')
shots = shots_tag and ''.join(shots_tag.stripped_strings)
away_tag = report.find('span', class_='team-away')
away_team = away_tag and ''.join(away_tag.stripped_strings)
df = pd.DataFrame({'away_team' : [away_team], 'home_team' : [home_team], 'score' : [score], })
out_list.append(df)
for shots in soupb.find_all('td', class_='shots'):
home_shots_tag = shots.find('span',class_='goal-count-home')
home_shots = home_shots_tag and ''.join(home_shots_tag.stripped_strings)
away_shots_tag = shots.find('span',class_='goal-count-away')
away_shots = away_shots_tag and ''.join(away_shots_tag.stripped_strings)
dfb = pd.DataFrame({'home_shots': [home_shots], 'away_shots' : [away_shots] })
out_list.append(dfb)
for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):
home_inc_tag = incidents.find("td", class_="incident-player-home")
home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)
type_inc_goal_tag = incidents.find("td", "span", class_="incident-type goal")
type_inc_goal = type_inc_goal_tag and ''.join(type_inc_goal_tag.stripped_strings)
type_inc_tag = incidents.find("td", class_="incident-type")
type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)
time_inc_tag = incidents.find('td', class_='incident-time')
time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)
away_inc_tag = incidents.find('td', class_='incident-player-away')
away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)
df_incidents = pd.DataFrame({'home_player' : [home_inc],'event_type' : [type_inc_goal],'event_time': [time_inc],'away_player' : [away_inc]})
out_list.append(df_incidents)
print "end"
print out_list
I am new to python and stack overflow, any suggestions on formatting my questions is also useful.
Thanks in advance!

Those 3 for loops should be inside your main for loop.
out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):
url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
teams_list = []
inner_page = urllib.request.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')
for report in soupb.find_all('td', 'match-details'):
# your code as it is
for shots in soupb.find_all('td', class_='shots'):
# your code as it is
for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):
# your code as it is
It works just fine - shows up a team just once.
Here's output of first for loop:
[{'score': ['1-3'], 'away_team': ['Man City'], 'home_team': ['Dynamo Kiev']},
{'score': ['1-0'], 'away_team': ['Zenit St P'], 'home_team': ['Benfica']},
{'score': ['1-2'], 'away_team': ['Boston United'], 'home_team': ['Bradford Park Avenue']}]

This looks like a printing problem, at what indentation level are you printing out_list ?
It should be at zero indentation, all the way to the left in your code.
Either that, or you want to move out_list into the top most for loop so that it's re-assigned after every iteration.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error while scraping: "Expecting value: line 1 column 1 (char 0)" - python

Related

From Python to Excel - Building an excel worksheet

Scraping data beach volleyball on multiple pages

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

How do I return multiple 'scorers' when scraping for football results using Python?

Looping through scraped data and outputting the result

Categories

Resources