How to scrape Dynamic JavaScript based website using Python REQUESTS and BeautifulSoup? - python

I am Scraping https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all to collect college informations.
On the webpage below each College only one Course Name is given and rest of courses are scripted in JavaScript.
For eg. +13 More Courses+
So I don't get their info when I use requests.get(url) .
How I can scrape such details using REQUESTS and BeautifulSoup?
I use Anaconda Jupyter Notebook as IDE.
I have heard about Selenium but don't know about it.
Since Selenium is bit heavy is there any possible lite alternative to load all the JavaScript contents at once.
I have also heard about Splash framework. If anyone knows about it and how to integrate it with Python Requests and BeautifulSoup please answer.
Things I have tried
1.PyQt
Reference: https://www.youtube.com/watch?v=FSH77vnOGqU
I have imported different libraries than in video depending on PyQt version in anaconda.
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
import requests
from bs4 import BeautifulSoup
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
client_response=Client(url)
src=client_response.mainFrame().toHtml()
soup = BeautifulSoup(src,"lxml")
tpm = soup.find_all("section",{"class":"tpl-curse-dtls more_46905_0"})
print(tpm)
Output: []
2. json() in Requests Module
import requests
from bs4 import BeautifulSoup
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
r=requests.get(url)
a=r.json()
OUTPUT:
JSONDecodeError: Expecting value: line 3 column 1 (char 3)
3. json.loads() from json module
Inspection Details on clicking
import json
j_url='https://www.shiksha.com//nationalCategoryList/NationalCategoryList/loadMoreCourses/'
def j_data(url=j_url):
dt = tp[0].find_all("input",{"id":"remainingCourseIds_46905"})
output = dt[0]['value']
data = {
'courseIds': '231298,231294,231304,231306',
'loadedCourseCount': 0
#'page':page
}
response = requests.post(url, data=data)
return json.loads(r.content)
print(j_data())
OUTPUT:
JSONDecodeError: Expecting value: line 3 column 1 (char 3)
DRYSCRAPE is not available for Windows

You don't need to know what its Javascript does. Just click the link and use your browser inspector to observe the network request.
In your specific case, the Javascript sends a POST request to "/nationalCategoryList/NationalCategoryList/loadMoreCourses/". So you can send the same request and you'll get back a new HTML string. You can parse that string using BeautifulSoup and get the data you need.
There is an extra step above because the POST request needs a payload that specifies parameters. You should be able to find these parameters in the original page. Once you find them, you can look at their surrounding HTML elements and either use BeautifulSoup to extract them, or use regular expression to find them.
I hope it helps.

Related

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!
i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count

Python : Extract requests query parameters that any given URL receives

Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.
You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7

How to search within a website using python requests module and returning the response page?

I want to retrieve data from a website named as myip.ms. I'm using requests to send data to form and then I want the response page back to me. When I run the script it returns the same page (homepage) in response. I want the next page using the query I provide. I'm new in WebScraping. Here's the code I'm using to achieve this.
import requests
from urllib.parse import urlencode, quote_plus
payload={
'name':'educationmaza.com',
'value':'educationmaza.com',
}
payload=urlencode(payload)
r=requests.post("http://myip.ms/s.php",data=payload)
infile=open("E://abc.html",'wb')
infile.write(r.content)
infile.close()
I'm no expert, but it appears that when interacting with the webpage, the post is processed by jQuery, which requests does not do well with.
As such, you would have to use the Selenium module to interact with it.
The following code will execute as desired:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://myip.ms/s.php")
driver.find_element_by_id("home_txt").send_keys('educationmaza.com')
driver.find_element_by_id("home_submit").click()
html = driver.page_source
infile=open("stack.html",'w')
infile.write(html)
infile.close()
You will have to install the Selenium package, as well as Phantom.JS.
I have tested this code, and it works fine. Let me know if you need any further help!

html get around the noscript tag

I'm using the python library requests to download some webpages and do some parsing after that, eg, get the title of the page. However, it seems requests can't download the source correctly when there's <noscript> tag on some webpages.
For example, when trying to get the source of https://www.coursera.org/course/startup, the source I get from requests is different from visiting the page with Chrome. The source requests get is the same with the view source option in Chrome.
So is there any way to "fool" the <noscript> tag in some way? Or I need to use something else rather than requests?
"The source requests get is the same with the view source option in Chrome" ...view source gives you the real html source of the url, same as requests gets. So what you're seeing is what you should expect to see.
Your problem is nothing to do with the noscript tag, it's that the content of the page is changed via javascript after loading.
As #alecxe pointed out, you need to look deeper into how the coursera site is built, eg observing XHR requests in the 'Network' tab of Chrome Developer Tools, to see the urls where the actual content you're looking for is loaded from... then you may be able to just load those urls directly with Requests.
Alternatively there is a tutorial here for how to get round the problem of rendering a web page with javascript from python:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
they provide example code that looks like this:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
This particular page is rendered via a set of asynchronous XHR calls to the Coursera API. Then, the API responses are used to construct the page. This is all done by the browser.
requests simply downloads the initial HTML page that, in this case, is basically a container for a lot of other things. requests doesn't have a javascript engine built-in, it is not a browser.
Depending on what are you going to do next, you can either automate a real browser (headless or not) with the help of, for example, selenium, or, mimic the API requests being made by the browser - the latter is approach would involve exploring the Coursera API, using browser developer tools to see what API endpoints are used to fill out the page with data.
Example (using selenium and Chrome browser):
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>> from selenium.webdriver.support.ui import WebDriverWait
>>> from selenium.webdriver.support import expected_conditions as EC
>>>
>>> driver = webdriver.Chrome()
>>> driver.get('https://www.coursera.org/course/startup')
>>> element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.c-coursePage-header h1")))
>>> element.text
u'Startup Engineering'

How to use python urlopen scraping after a page finish loading all searching result?

I am trying to scrape air ticket info(including plane info and price info, etc.) from http://flight.qunar.com/ using python3 and BeautifulSoup. Below is the python code I am using. In this code I tried to scrape flight info from Beijing(北京) to Lijiang(丽江) at 2012-07-25.
import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://flight.qunar.com/site/oneway_list.htm'
values = {'searchDepartureAirport':'北京', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'}
encoded_param = urllib.parse.urlencode(values)
full_url = url + '?' + encoded_param
response = urllib.request.urlopen(full_url)
soup = BeautifulSoup(response)
print(soup.prettify())
What I get is the initial page after submit the request and the page is still loading the search results. What I want is the final page after it finish loading the searching results. So how can I achieve this goal using python?
The problem is actually quite hard - the site uses dynamically generated content that gets loaded via JavaScript, however urllib gets basically only what you would get in a browser if you disabled JavaScript. So, what can we do?
Use
Selenium or
PhantomJS or
Crowbar
to fully render a webpage (they are essentially headless, automated browsers for testing and scraping)
Or, if you want a (semi-)pure Python solution, use PyQt4.QtWebKit to render the page. It works approxiametly like this:
import sys
import signal
from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
url = "http://www.stackoverflow.com"
def page_to_file(page):
with open("output", 'w') as f:
f.write(page.mainFrame().toHtml())
f.close()
app = QApplication()
page = QWebPage()
signal.signal( signal.SIGINT, signal.SIG_DFL )
page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file)
page.mainFrame().load(QUrl(url))
sys.exit( app.exec_() )
Edit: There's a nice explanation how this works here.
Ps: You may want to look into requests instead of using urllib :)

Categories