Web Scraping on Dinamica JS loaded sites

Web Scraping on Dinamica JS loaded sites - python

I am doing a web scraping job of the following page: COVID, what I need to do is generate a csv of the table that appears on the page but is dynamically loaded with data for which I am using selenium. The problem is that even so I cannot find the tables with the code which is the following:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
#url of the page we want to scrape
url = "https://saluddigital.ssch.gob.mx/covid/"
# initiating the webdriver. Parameter includes the path of the webdriver.
driver = webdriver.Firefox()
driver.get(url)
# this is just to ensure that the page is loaded
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
print(len(soup.find_all("table")))
driver.close()
driver.quit()
When I print the table I get 0 since it cannot find it.

I am also trying to extract and generate csv file with data. Hopes it help.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
url = "https://saluddigital.ssch.gob.mx/covid/"
# initiating the webdriver. Parameter includes the path of the webdriver.
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5) # delay for load properly
# # this is just to ensure that the page is loaded
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
table = soup.select_one('div.contenedor-general')
header = [[a.getText(strip=True,separator=' ')][0].split() for a in table.find_all('tr', {'class': 'header-table'})]
text1 = [t.text.strip().split() for t in soup.find_all('tr', {'class': 'ringlon-1'})]
text2 = [t.text.strip().split() for t in soup.find_all('tr', {'class': 'ringlon-2'})]
with open('outz.csv', 'w') as f:
wr = csv.writer(f, delimiter=',')
wr.writerow(header[0][1:])
for row in text1:
wr.writerow(row)
for row in text2:
wr.writerow(row)

It looks like you need to perform a simple GET to https://saluddigital.ssch.gob.mx/app/asincronos/jsonstats.ashx?getconteos=1 and parse JSON response.

Related

Can't identify a table when scraping

Beginner question.. I'm attempting to scrape data from a table but I can't seem to recognize it, I've tried using the class and the id to identify it but my result is 0. The code and output are below.
# Import necessary packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url="https://fbref.com/en/comps/9/stats/Premier-League-Stats"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
#print(soup.prettify()) # print the parsed data of html
gdp = soup.find_all("table", attrs={"id": "stats_standard"})
print("Number of tables on site: ",len(gdp))
Output - 'Number of tables on site: 0'

I suggest you to use selenium for such scraping, its performance is very reliable.
This code will work for you:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
gdp = bs.find_all('table', {'id': 'stats_standard'})
driver.quit()
print("Number of tables on site: ",len(gdp))
Output
Number of tables on site: 1

Can you find the table(s) without using attrs={"id": "stats_standard"}?
I have checked and indeed I cannot find any table whose ID is stats_standard (but there is one with ID stats_standard_sh, for example). So I guess you might be using the wrong ID.

Beautifulsoup can not find table containing specific class

from bs4 import BeautifulSoup
import requests
url="https://www.calculator.net/currency-calculator.html"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content,'html5lib')
print(soup.prettify()) # print the parsed data of html
conv_table = soup.find("table", attrs={"class":"cinfoT "})
conv_data = gdp_table.tbody.find_all("tr")
I have written the above script to get the table listed on this particular website.
when i run the same conv_table comes as None type object.
If you visit the website, basically i want to extract the 2nd table bigger table and its class name contains "cinfoT ". Also i have checked that there are some blank spaces in the class name.
Please help me out.
Thanks in advance.

It is because this data is loaded by javascript. Try selenium. requests will give you plain html file
from selenium import webdriver
from bs4 import BeautifulSoup
DRIVER_PATH="Your selenium chrome driver path"
url = 'https://www.calculator.net/currency-calculator.html'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
table = html_soup.find_all('table', class_ = 'cinfoT ')
driver.quit()
print(table[0].tbody)

Extract all links from drop down list combination

I have a sample website and I want to extract all the "href links" from the website. It has two drop downs and once drop down is selected it displays results with link to manual to download.
It does not navigate to different page instead shows result on the same page. I have extracted the combination of drop down lists, I am trying to extract the manual links and I am unable to find the link.
code is as follows
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
from bs4 import BeautifulSoup
import requests
url = "https://www.cars.com/"
driver = webdriver.Chrome('C:/Users/webdrivers/chromedriver.exe')
driver.get(url)
time.sleep(4)
selectYear = Select(driver.find_element_by_id("odl-selected-year"))
data = []
for yearOption in selectYear.options:
yearText = yearOption.text
selectYear.select_by_visible_text(yearText)
time.sleep(1)
selectModel = Select(driver.find_element_by_id("odl-selected-model"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([yearText,modelText])
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.findAll('div',attrs={"class":"odl-results-container"})
for i in content:
x = i.findAll(['h3','span'])
for y in x:
print(y.get_text())
print does not show any data. How can I get the links for manuals? Thanks in advance

You need to click the button for each car model and year and then retrieve the rendered HTML page source from your Selenium webdriver rather than with requests.
Add this in your inner loop:
button = driver.find_element_by_link_text("Select this vehicle")
button.click()
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
content = soup.findAll('a',attrs={"class":"odl-download-link"})
for i in content:
print(i["href"])
This prints out:
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=6875&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O91668&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7126&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O134871&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7708&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O177941&VIN=&userMarket=GBR
...

BeautifulSoup find_all() returns nothing []

I'm trying to scrape this page of all the offers, and want to iterate over <p class="white-strip"> but page_soup.find_all("p", "white-strip") returns an empty list [].
My code so far-
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.sbicard.com/en/personal/offers.page#all-offers'
# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "lxml")
Edit: I got it working using Selenium and below is the code I used. However, I am not able to figure out the other method through which the same can be done.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome("C:\chromedriver_win32\chromedriver.exe")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')
# html parsing
page_soup = BeautifulSoup(driver.page_source, 'lxml')
# grabs each offer
containers = page_soup.find_all("p", {'class':"white-strip"})
filename = "offers.csv"
f = open(filename, "w")
header = "offer-list\n"
f.write(header)
for container in containers:
offer = container.span.text
f.write(offer + "\n")
f.close()
driver.close()

If you look for either of the items, you can find them within a script tag containing var offerData. To get the desired content out of that script, you can try the following.
import re
import json
import requests
url = "https://www.sbicard.com/en/personal/offers.page#all-offers"
res = requests.get(url)
p = re.compile(r"var offerData=(.*?);",re.DOTALL)
script = p.findall(res.text)[0].strip()
items = json.loads(script)
for item in items['offers']['offer']:
print(item['text'])
Output are like:
Upto Rs 8000 off on flights at Yatra
Electricity Bill payment – Phonepe Offer
25% off on online food ordering
Get 5% cashback at Best Price stores
Get 5% cashback

website is dynamic rendering request data.
You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get('https://www.sbicard.com/en/personal/offers.page#all-offers')
page_soup = BeautifulSoup(driver.page_source, 'lxml')
p_list = page_soup.find_all("p", {'class':"white-strip"})
print(p_list)
where '/usr/bin/chromedriver' selenium web driver path.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/

Python Selenium accessing HTML source - After Search

Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
path = "C:\\Python27\\chromedriver\\chromedriver"
driver = webdriver.Chrome(executable_path=path)
# Open Chrome
driver.get("http://www.thehindu.com/")
# 10 Second Delay
time.sleep(10)
elem = driver.find_element_by_id("searchString")
# Enter Keyword
elem.send_keys("unilever")
elem.send_keys(Keys.RETURN)
time.sleep(10)
# Problem Here
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
print soup
Above it the code.
I want to scrap data from "http://www.thehindu.com/", It searches for "unilever" word in search box and redirect to result page
Link for Search Page
Now I have a question for this, How can I get Source code of the searched Page.
Basically I want news related to "Unilever".

You can get text inside <body>:
body = driver.find_element_by_tag_name("body")
bodyText = body.get_attribute("innerText")
Then you can find your keyword in bodyText.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping on Dinamica JS loaded sites - python

It looks like you need to perform a simple GET to https://saluddigital.ssch.gob.mx/app/asincronos/jsonstats.ashx?getconteos=1 and parse JSON response.

Related

Can't identify a table when scraping

Beautifulsoup can not find table containing specific class

Extract all links from drop down list combination

BeautifulSoup find_all() returns nothing []

Python Selenium accessing HTML source - After Search

Categories

Resources