Not able to do webscraping using beautifulsoup and requests - python

I am trying to scrape the first two sections values i.e 1*2 and DOUBLECHANCE sections values using bs4 and requests from this website https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106
The code which I written is:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106')
soup = bs.BeautifulSoup(source,'lxml')
for div in soup.find_all('div', class_='SEItem ng-scope'):
print(div.text)
when I run I am not getting anything please help me anyone

The page is loaded via JavaScript, so you have 2 option. or to use selenium or to call the Direct API.
Instead of using Selenium, I've called the API directly and got the required info.
Further explanation about XHR & API < can be found once you click here.
import requests
data = {
'IDGruppoQuota': '0',
'IDSottoEvento': '76512106'
}
def main(url):
r = requests.post(url, json=data).json()
count = 0
for item in r['d']['ClassiQuotaList']:
count += 1
print(item['ClasseQuota'], [x['Quota']
for x in item['QuoteList']])
if count == 2:
break
main("https://web.bet9ja.com/Controls/ControlsWS.asmx/GetSubEventDetails")
Output:
1X2 ['3.60', '4.20', '1.87']
Double Chance ['1.83', '1.19', '1.25']

Try:
import bs4 as bs
import urllib.request
import lxml
source = urllib.request.urlopen('https://web.bet9ja.com/Sport/SubEventDetail?SubEventID=76512106')
soup = bs.BeautifulSoup(source,'lxml')
a = soup.find_all('div')
for i in a:
try:
print(i['class'])
except:
pass
try:
sp = i.find_all('div')
for j in sp:
print(j['class'])
except:
pass
This helps you find available classes in the <div> tag.
You get nothing when the class you give doesn't exist. This happens as many of the sites are dynamically generated and requests can't get them. In these cases, we need to use selenium.

Related

BeautifulSoup is missing content

I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?
It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace('&nbsp', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)
You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.
You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result
In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.

How to get some data in real time from a website using python?

I want to fetch som data from website
https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY
I am using beautifulsoup library to fetch this data, and have tried the following code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
b = soup.find("div", {"class": "style__AtmIVWrapper-idZNMX kUMMRI"})
print(b)
But it shows "None" as the output.
Although there is only one class of this name in the full HTML code, but I also tried this:
for b in soup.find_all('div', attrs={'class':'style__AtmIVWrapper-idZNMX kUMMRI'}):
print(b.get_text())
print(len(b))
But it doesn't work.
Also tried soup.find("div")
But it does not shows the required div tag in the output, maybe due to nested divs present.
Unable to fetch this data and proceed with my work. Please help.
If you are looking for code. This might help:-
from selenium import webdriver
import time
webpage = 'https://web.sensibull.com/optionchain?expiry=2020-03-26&tradingsymbol=NIFTY'
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
driver.get(webpage)
time.sleep(10)
nifty_fut = driver.find_element_by_xpath('//*[#id="app"]/div/div[4]/div[2]/div[3]/div/div/div[2]/div[1]/div[1]/div/button/span[1]/div[1]')
print(nifty_fut.text)
atm_iv = driver.find_element_by_xpath('//*[#id="app"]/div/div[4]/div[2]/div[3]/div/div/div[2]/div[1]/div[2]')
print(atm_iv.text)
driver.quit()
Could be a syntax problem try with soup.find_all("div", class_="style__AtmIVWrapper-idZNMX kUMMRI") or just soup.find("div", class_="style__AtmIVWrapper-idZNMX kUMMRI")
If interested in webscraping and bs4 take a look at the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.
The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

Searching through HTML pages for certain text?

I wanted to play around with python to learn it, so I'm taking on a little project, but a part of it requires me to search for a name on this list:
https://bughunter.withgoogle.com/characterlist/1
(the number one is to be incremented by one every time to search for the name)
So I will be HTML scraping it, I'm new to python and would appreciate if someone could give me an example of how to make this work.
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://bughunter.withgoogle.com'
def get_page_html(page_num):
r = requests.get('{}/characterlist/{}'.format(URL, page_num))
r.raise_for_status()
return r.text
def get_page_profiles(page_html):
page_profiles = {}
soup = BeautifulSoup(page_html)
for table_cell in soup.find_all('td'):
profile_name = table_cell.find_next('h2').text
profile_url = table_cell.find_next('a')['href']
page_profiles[profile_name] = '{}{}'.format(URL, profile_url)
return page_profiles
if __name__ == '__main__':
all_profiles = {}
for page_number in range(1, 81):
current_page_html = get_page_html(page_number)
current_page_profiles = get_page_profiles(current_page_html)
all_profiles.update(current_page_profiles)
with open('google_hall_of_fame_profiles.json', 'w') as f:
json.dump(all_profiles, f, indent=2)
Your question wasn't clear about how you wanted the data structured after scraping so I just saved the profiles in a dict (with the key/value pair as {profile_name: profile_url}) and then dumped the results to a json file.
Let me know if anything is unclear!
Try this. You will need to install bs4 first (python 3). It will get all of the names of the people on the website page:
from bs4 import BeautifulSoup as soup
import urllib.request
text=str(urllib.request.urlopen('https://bughunter.withgoogle.com/characterlist/1').read())
text=soup(text)
print(text.findAll(class_='item-list')[0].get_text())

Python3 scraper. Doesn't parse the xpath till the end

I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.
That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.
For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)

Categories