Full page source (prior to JS rendering) using selenium-python? - python

I am scraping data from a site with a paginated table (max results 500 with 25 results per page). When I use chrome to "view source" I can see all 500 results, however, once the JS renders in selenium only 25 results show when using driver.page_source.
I have tried passing the cookies and headers off to requests, but that's not reliable and need to stick with selenium. I have also made a janky solution of clicking through the paginator's next button, but there must be a better way!
So how does one capture the full page source prior to JS rendering using selenium with the python bindings?

There might be a simpler way but it turns out you can do all kinds of asynchronous things from the browser including fetch:
def fetch(url):
return driver.execute_async_script("""
(async () => {
let r = await fetch('""" + url + """')
arguments[0](await r.text())
})()
""")
html = fetch('https://stackoverflow.com/')
Same-origin policy will apply.

Related

Wait Before Scraping using Beatifulsoup

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd
For example, if you go here, you will see how it takes time to load page 2 data
I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!
from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read()
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
while(true):
true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
if(not True):
true=False
else:
req = Request(softwareadvice+'?review.page=2', headers=hdr)
sleep(10)
webpage = urlopen(req, timeout=10)
sleep(10)
webpage = webpage.read().decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
The reviews are loaded from external source via Ajax request. You can use this example how to load them:
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
"https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)
params = {
"q": "s*|-s*",
"facet.gdm_industry_id": '{"sort":"bucket","size":200}',
"fq": "(and product_id: '{}' listed:1)",
"q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
"size": "50",
"start": "50",
"sort": "completeness_score desc,date_submitted desc",
}
# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))
params["fq"] = params["fq"].format(id_)
for start in range(0, 3): # <-- increase the number of pages here
params["start"] = 50 * start
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for h in data["hits"]["hit"]:
if "review" in h["fields"]:
print(h["fields"]["review"])
print("-" * 80)
Prints:
After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible, Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting. We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio.
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------
...and so on.
I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.
The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).
In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.
The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.
About the code you introduced:
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
webpage = web_byte.decode('utf-8')
All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.
So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.
Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.
Here are some links that might be of help to you.
scrape html generated by javascript with python
https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium
Thanks for reading.

Scrape displayed data through onclick using Selenium and Scrapy

I'm doing a script in python using Scrapy in order to scrape data from a website using an authentication.
The page I'm scraping is really painful because mainly made with javascript and AJAX requests. All the body of the page is put inside a <form> that allow to change the page using a submit button. URL don't change (and it's a .aspx).
I have successfully made that scrape all the data I need from page one, then changing page clicking on this input button using this code :
yield FormRequest.from_response(response,
formname="Form",
clickdata={"class":"PageNext"},
callback=self.after_login)
The after_login method is scraping the data.
However I need data that appear in another div after clicking on a container with onclick attribute. I need to do a loop in order to click on each container, displaying the data, scraping them and just after that I'm going to the next page and do the same process.
The thing is I can't find how to make the process where "the script" just click on the container using Selenium (while being logged in, if not I cannot go to this page) and then Scrapy is scraping the data that after the XHR request has been made.
I did a lot of research on the internet but could not try any solution.
Thanks !
Ok so I've almost got what I want, following #malberts advices.
I've used this kind of code in order to get the Ajax response request :
yield scrapy.FormRequest.from_response(
response=response,
formdata={
'param1':param1value,
'param2':param2value,
'__VIEWSTATE':__VIEWSTATE,
'__ASYNCPOST':'true',
'DetailsId':'123'},
callback=self.parse_item)
def parse_item(self, response):
ajax_response = response.body
yield{'Response':ajax_response}
The response is suppose to be in HTML. Thing is the response is not totally the same as the one when I lookup to the Response request through Chrome Dev Tools. I've not taken all the form data into account yet (~10 / 25), could it be it needs all the form data even if they don't change depending the id ?
Thanks !

Phantomjs through selenium in python

I am trying to test a webpage's behaviour to requests from different referrers. I am doing the following so far
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.referer'] = referer
The problem is that the webpage has ajax requests which will change some things in the html, and those ajax requests should have as referer the webpage itself and not the referer i gave at the start. It seems that the referer is set once at the start and every subsequent request be it ajax or image or anchor takes that same referer and it never changes no matter how deep you browse, is there a solution to choocing the referer only for the first request and having it dynamic for the rest?
After some search i found this and i tried to achieve it through selenium, but i have not had any success yet with this:
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.onInitialized'] = """function() {page.customHeaders = {};};"""
Any ideas?
From what I can tell you would need to patch PhantomJS to achieve this.
PhantomJS contains a module called GhostDriver which provides the HTTP API that WebDriver uses to communicate with the PhantomJS instance. So anything you want to do via WebDriver needs to be supported by GhostDriver, but it doesn't seem that onInitialized is supported by GhostDriver.
If you're feeling adventurous you could clone the PhantomJS repository and patch the src/ghostdriver/session.js file to do what you want.
The _init method looks like this:
_init = function() {
var page;
// Ensure a Current Window is available, if it's found to be `null`
if (_currentWindowHandle === null) {
// Create the first Window/Page
page = require("webpage").create();
// Decorate it with listeners and helpers
page = _decorateNewWindow(page);
// set session-specific CookieJar
page.cookieJar = _cookieJar;
// Make the new Window, the Current Window
_currentWindowHandle = page.windowHandle;
// Store by WindowHandle
_windows[_currentWindowHandle] = page;
}
},
You could try using the code you found:
page.onInitialized = function() {
page.customHeaders = {};
};
on the page object created there.
Depending on what you test though you might be able to save a lot of effort and ditch the browser and just test HTTP requests directly using something like the requests module.

How can I use Python to log in to a website and perform actions in it?

These are the steps I need to automatize:
1) Log in
2) Select an option from a drop down menu (To acces a list of products)
3) search something on the search field (The product we are looking for)
4) click a link (To open up the product's options)
5) click another link(To compile all the .pdf files relevant to said product in a bigger .pdf)
6) wait for a .pdf to load and then download it.(Save the .pdf on my machine with the name of the product as the file name)
I want to know if this is possible. If it is, where can I find how to do it?
Is it pivotal that there is actual clicking involved? If you're just looking to download PDFs then I suggest you use the Requests library. You might also want to consider using Scrapy.
In terms of searching on the site, you may want to use Fiddler to capture the HTTP POST request and then replicate that in Python.
Here is some code that might be useful as a starting place - these functions would login to a server and download a target file.
def login():
login_url = 'http://www.example.com'
payload = 'usr=username&pwd=password'
connection = requests.Session()
post_login = connection.post(data=payload,
url=login_url,
headers=main_headers,
proxies=proxies,
allow_redirects=True)
def download():
directory = "C:\\example\\"
url = "http://example.com/download.pdf"
filename = directory + '\\' + url[url.rfind("/")+1:]
r = connection.get(url=url,
headers=main_headers,
proxies=proxies)
file_size = int(r.headers["Content-Length"])
block_size = 1024
mode = 'wb'
print "\tDownloading: %s [%sKB]" % (filename, int(file_size/1024))
if r.status_code == 200:
with open(filename, mode) as f:
for chunk in r.iter_content(block_size):
f.write(chunk)
For static sites you can use the mechanize module, available from PyPi, it does everything you want - except it does not run Javascript and thus does not work on dynamic websites. Also it is strictly Python 2 only.
easy_install mechanize
For something way more complicated you might have to use python bindings for Selenium (install instructions) to control an external browser; or use spynner that embeds a web browser. However these 2 are far more difficult to set up.
Sure, just use selenium webdriver
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://your-website.com')
search_box = browser.find_element_by_css_selector('input[id=search]')
search_box.send_keys('my search term')
browser.find_element_by_css_selector('input[type=submit']).click()
That would get you through the visit page, enter search term, click on search, stage of your problem. Read through the api for the rest.
Mechanize has problems at the moment because so much of a web page is generated via javascript. And if it is not rendering that you can't do much with the page.
It helps if you understand css selectors, else you can find elements by id, or xpath or other things...

How Can I Automatically Add Google Alerts Using Python Mechanize

I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.
Here is what I have so far:
class GAlerts():
def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):
self.uName = uName
self.passWord = passWord
def addAlert(self):
self.cj = mechanize.CookieJar()
loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
alertsURL = 'http://www.google.com/alerts'
#log into google
initialRequest = mechanize.Request(loginURL)
response = mechanize.urlopen(initialRequest)
#put in form info
forms = ClientForm.ParseResponse(response, backwards_compat=False)
forms[0]['Email'] = self.uName
forms[0]['Passwd'] = self.passWord
#click form and get cookies
request2 = forms[0].click()
response2 = mechanize.urlopen(request2)
self.cj.extract_cookies(response, initialRequest)
#now go to alerts page with cookies
request3 = mechanize.Request(alertsURL)
self.cj.add_cookie_header(request3)
response3 = mechanize.urlopen(request3)
#parse forms on this page
formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
formsAdd[0]['q'] = 'Hines Ward'
#click it and submit
request4 = formsAdd[0].click()
self.cj.add_cookie_header(request4)
response4 = mechanize.urlopen(request4)
print response4.read()
myAlerter = GAlerts()
myAlerter.addAlert()
As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?
Thanks
The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).
The lazy solution is to use the galerts library, it looks like it does exactly what you need.
A few hints for future projects involving mechanize (or screen-scraping in general):
Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).
Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.
http://seleniumhq.org/

Categories