Trouble scraping sports table - python - python

I'm having a lot of trouble scraping results off a sporting table in Python.
I am new to scrapping and have tried everything I can find online.
The website is https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38
and I'm just trying to get the team name and tries.
Any suggestions would be appreciated.

Make sure you have the good requirements:
if you are using Anaconda go to Anaconda cmd line and type:
> pip install beautifulsoup4
> pip install requests
Now, You can try a scrapping library called beautifulsoup, you can specify the name of the div you want in the html source code of your website, with your link and the library catch all the data, example:
import requests
from bs4 import BeautifulSoup
#create variable page
page = requests.get('https://www.imdb.com/title/tt7286456/criticreviews?ref_=tt_ov_rt')
#create the variable soup and calling BeatifulSoup library for our web page
soup = BeautifulSoup(page.text, 'html.parser')
file = open("data.txt", "w") #don't forget create a data.txt file in the same repository of your file.py
#maping the div with class_="summary" with the function find_all()
for x in soup.find_all('div', class_='summary'):
print(x) #print the data scrapped
file.write(x.text) #store the data in the file.txt
file.write("\n")
file.close()
More details in my Scrapping project https://github.com/mehdimaaref7/Scrapping-Sentiment-Analysis/blob/master/big_data.py
The data.txt file is optional, you can just use the variable x with print(x), to display the data you need to scrap.

No need to iterate. Just get the json then let pandas parse it into a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
url = 'https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', {'id':'vue-stats-detail'})['q-data']
jsonData = json.loads(data)
df = pd.DataFrame(jsonData['totalStats']['leaders'])
Output:
print(df)
teamNickName ... played
0 Storm ... 24
1 Roosters ... 24
2 Rabbitohs ... 24
3 Panthers ... 24
4 Eels ... 24
5 Cowboys ... 24
6 Sharks ... 24
7 Broncos ... 24
8 Sea Eagles ... 24
9 Raiders ... 24
10 Titans ... 24
11 Dragons ... 24
12 Knights ... 24
13 Warriors ... 24
14 Bulldogs ... 24
15 Wests Tigers ... 24
[16 rows x 5 columns]

Related

Code giving blank output working on only particular Website while both websites are exactly same

I scraping a district school website in which all website build are same every URL in websites are completely same except their names. The code I use only working on one school when I put it other school name it giving blank output anyone help me where I am going wrong. Here is the working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?
field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,2):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data).to_csv('fcps_school.csv',index=False)
print(df)
here is the other URL I am trying to scrap:
https://aldrines.fcps.edu/staff-directory?keywords=&field_last_name_from=&field_last_name_to=&items_per_page=10&page=
https://aldrines.fcps.edu
I've scraped 10 pages as an example without changing anything from the existing code and it's working fine and also getting the same output in csv file.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output
Name Position contact_url
0 Bouchera Abutaa Instructional Assistant https://fairfaxhs.fcps.edu/staff/bouchera-abutaa
1 Margaret Aderton Substitute Teacher - Regular Term https://fairfaxhs.fcps.edu/staff/margaret-aderton
2 Aja Adu-Gyamfi School Counselor, HS https://fairfaxhs.fcps.edu/staff/aja-adu-gyamfi
3 Paul Agyeman Custodian II https://fairfaxhs.fcps.edu/staff/paul-agyeman
4 Jin Ahn Food Services Worker https://fairfaxhs.fcps.edu/staff/jin-ahn
.. ... ... ...
95 Tiffany Haddock School Counselor, HS https://fairfaxhs.fcps.edu/staff/tiffany-haddock
96 Heather Hakes Learning Disabilities Teacher, MS/HS https://fairfaxhs.fcps.edu/staff/heather-hakes
97 Gabrielle Hall History & Social Studies Teacher, HS https://fairfaxhs.fcps.edu/staff/gabrielle-hall
98 Sydney Hamrick English Teacher, HS https://fairfaxhs.fcps.edu/staff/sydney-hamrick
99 Anne-Marie Hanapole Biology Teacher, HS https://fairfaxhs.fcps.edu/staff/anne-marie-ha...
[100 rows x 3 columns]
Update:
Actually, success of webscraping not noly depends on good coding skill but also 50% success depends on good understanding the website.
Domain name:
https://fairfaxhs.fcps.edu
and
2.https://aldrines.fcps.edu
aren't the same and h1 tag's attribute value is a bit difference, otherwise, the both website's structure is alike.
Working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://aldrines.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://aldrines.fcps.edu'+link.a.get('href') for link in soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark7').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output:
Name ... contact_url
0 Jamileh Abu-Ghannam ... https://aldrines.fcps.edu/staff/jamileh-abu-gh...
1 Linda Adgate ... https://aldrines.fcps.edu/staff/linda-adgate
2 Rehab Ahmed ... https://aldrines.fcps.edu/staff/rehab-ahmed
3 Richard Amernick ... https://aldrines.fcps.edu/staff/richard-amernick
4 Laura Arm ... https://aldrines.fcps.edu/staff/laura-arm
.. ... ... ...
95 Melissa Weinhaus ... https://aldrines.fcps.edu/staff/melissa-weinhaus
96 Kathryn Wheeler ... https://aldrines.fcps.edu/staff/kathryn-wheeler
97 Latoya Wilson ... https://aldrines.fcps.edu/staff/latoya-wilson
98 Shane Wolfe ... https://aldrines.fcps.edu/staff/shane-wolfe
99 Michael Woodring ... https://aldrines.fcps.edu/staff/michael-woodring
[100 rows x 3 columns]

InvalidSchema: No connection adapters were found for "link"?

I have a dataset with multiple links and I'm trying to get the text of all the links using the code below, but I'm getting a error message "InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'".
Dataset:
links
'https://en.wikipedia.org/wiki/Wagner_Group'
'https://en.wikipedia.org/wiki/Vladimir_Putin'
'https://en.wikipedia.org/wiki/Islam_in_Russia'
The code I'm using to web-scrape is:
def get_data(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
return(text)
#works fine
url = 'https://en.wikipedia.org/wiki/M142_HIMARS'
get_data(url)
#Doesn't work
df['links'].apply(get_data)
Error: InvalidSchema: No connection adapters were found for "'https://en.wikipedia.org/wiki/Wagner_Group'"
Thank you in advance
#It works just fine when I apply it to a single url but it doens't work when I apply
it to a dataframe.
df['links'].apply(get_data) is not compatible with requests and bs4.
You can try one of the right ways as follows:
Example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links =[
'https://en.wikipedia.org/wiki/Wagner_Group',
'https://en.wikipedia.org/wiki/Vladimir_Putin',
'https://en.wikipedia.org/wiki/Islam_in_Russia']
data = []
for url in links:
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
for pra in soup.select('div[class="mw-parser-output"] > table~p'):
paragraph = pra.get_text(strip=True)
data.append({
'paragraph':paragraph
})
#print(data)
df = pd.DataFrame(data)
print(df)
Output:
paragraph
0 TheWagner Group(Russian:Группа Вагнера,romaniz...
1 The group came to global prominence during the...
2 Because it often operates in support of Russia...
3 The Wagner Group first appeared in Ukraine in ...
4 The Wagner Group itself was first active in 20...
.. ...
440 A record 18,000 Russian Muslim pilgrims from a...
441 For centuries, theTatarsconstituted the only M...
442 A survey published in 2019 by thePew Research ...
443 Percentage of Muslims in Russia by region:
444 According to the 2010 Russian census, Moscow h...
[445 rows x 1 columns]

Unable to get data form <li> _data_ </li> and using python, I m making web scraper

this is a follow up, question on the question which I asked earlier and got a very good answer, but, that code, I didn't understand fully the program. Please help me to scrape information from the following websites.
https://premieragile.com/csm-training/
https://www.simplilearn.com/agile-and-scrum/csm-certification-training
Here i want all the information given in each card. Also, adding the program I am using, which i got from stackoverflow itself.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://premieragile.com/csm-training/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select(".row > schedules-courses br-10 h-100 p-3 p-sm-4"):
date = row.findAll(".d-flex align-items-center pb-4 h6").text.strip()
# year = row.select_one(".li .batchDetails .date-details .date span").text.strip()
# rating = row.select_one(".imdbRating").text.strip()
# ...other variables
all_data.append([date])
df = pd.DataFrame(all_data, columns=["date"])
print(df.head().to_markdown(index=False))
here, please explain how I should add div class in the 'for loop', also, what will be the hierarchy of the
div
li
h
ul
li
Please help me understand this, I got the general idea that we are crating empty list and adding data in those using beautiflSoup object. I am utterly confused in how I should study the website I want to scrape and thus, how to add column in the row of the program.
P.S I m getting blank output.
Content is dynamically loaded from another resource. It do not contain in your soup, thats why you get an empty output.
Simply load it from this resource https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin and adjust parameters for your needs.
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
HMTL is wrapped in JSON structur so you have to specify the path from that the BeautifulSoup object should be created from.
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
url = "https://premieragile.com/csm-training/?page=1&id=ol&city=&countryCode=DE&trainerid=undefined&timezone=Europe/Berlin"
r = requests.get(url, headers={'x-requested-with': 'XMLHttpRequest'}).json()['html']
soup = BeautifulSoup(r)
all_data = []
for e in soup.select('.loop'):
all_data.append({
'trainer':e.h6.text.strip(),
'date': ' '.join(s.strip() for s in e.li.text.split('\n'))
})
all_data
df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))
Output
trainer
date
Daniel James Gullo
08 Jul - 08 Jul - 2022
Raj Kasturi
11 Jul - 13 Jul - 2022
Michel Goldenberg
11 Jul - 12 Jul - 2022
Valerio Zanini
12 Jul - 14 Jul - 2022
Michael Franken
13 Jul - 15 Jul - 2022

How extract data from the site (corona) by BeautifulSoup?

I want to save the number of articles in each country in the form of the name of the country, the number of articles in a file for my research work from the following site. To do this, I wrote this code, which unfortunately does not work.
http://corona.sid.ir/
!pip install bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
url='http://corona.sid.ir/'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
Result=
[]
You are using the wrong url. Try this:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
url = 'http://corona.sid.ir/world.svg'
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml") # create a soup object using the variable 'data'
soup.find_all(attrs={"class":"value"})
rows = []
for each in soup.find_all(attrs={"class":"value"}):
row = {}
row['country'] = each.text.split(':')[0]
row['count'] = each.text.split(':')[1].strip()
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df)
country count
0 Andorra 17
1 United Arab Emirates 987
2 Afghanistan 67
3 Albania 143
4 Armenia 49
.. ... ...
179 Yemen 54
180 Mayotte 0
181 South Africa 1938
182 Zambia 127
183 Zimbabwe 120
[184 rows x 2 columns]

Not all HTML seems to be retrieved using BeautifulSoup

i'm quite new using BeautifulSoup, I have tried to retrieve a few elements in the following webpage: https://www.booking.com/hotel/tz/zuri-zanzibar.html
from bs4 import BeautifulSoup
import requests
url = "https://www.booking.com/hotel/tz/zuri-zanzibar.html"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for elem in soup.findAll("div", {"class": "facilitiesChecklistSection"}):
try:
print(elem.attrs['data-section-id'])
except:
continue
I get the following IDs, whereas there should be much more:
13
-2
2
7
11
16
22
21
23
25
26
34
1
Would you know why?
Also, i don't get anything back when i try:
soup.findAll("div", {"class": "hp_location_block__map_container bui-spacer--larger"})
I'd like to retrieve the map location.

Categories