I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.
I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.
Is there anyway to use the requests framework with pages that use javascript?
Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.yourjspage.com')
r.html.render() # this call executes the js in the page
As a bonus this wraps BeautifulSoup, I think, so you can do things like
r.html.find('#myElementID').text
which returns the content of the HTML element as you'd expect.
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).
However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!
First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.
Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help
Now, all you need to do is the following code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
If you do not know how to use Selenium, here is a quick overview:
driver.get("https://www.google.com") #Browser goes to google.com
Finding elements:
Use either the ELEMENTS or ELEMENT method. Examples:
driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
driver.find_element(s)_by_class_name(class_name) # Every element with the following class
driver.find_element(s)_by_id(id) # Every element with the following ID
driver.find_element(s)_by_link_text(link_text) # Every with the full link text
driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
driver.find_element(s)_by_name(name) # Every element where name=argument
driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument
Ok! I found an element (or elements list). But what do I do now?
Here are the methods you can do on an element elem:
elem.tag_name # Could return button in a .
elem.get_attribute("id") # Returns the ID of an element.
elem.text # The inner text of an element.
elem.clear() # Clears a text input.
elem.is_displayed() # True for visible elements, False for invisible elements.
elem.is_enabled() # True for an enabled input, False otherwise.
elem.is_selected() # Is this radio button or checkbox element selected?
elem.location # A dictionary representing the X and Y location of an element on the screen.
elem.click() # Click elem.
elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
elem.submit() # Submit the form in which elem takes part.
Special commands:
driver.back() # Click the Back button.
driver.forward() # Click the Forward button.
driver.refresh() # Refresh the page.
driver.quit() # Close the browser including all the tabs.
foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.
In one example it worked through following cookies:
the cookie generated after checking for javascript for this example is "cf_clearance".
so simply create a session.
update cookie and headers as such:
s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
})
s.get(url)
and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.
Some way to do that is to invoke your request by using selenium.
Let's install dependecies by using pip or pip3:
pip install selenium
etc.
If you run script by using python3
use instead:
pip3 install selenium
(...)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'
# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
its a wrapper around pyppeteer or smth? :( i thought its something different
#property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
return self._browser
Related
I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.
I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.
Is there anyway to use the requests framework with pages that use javascript?
Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.yourjspage.com')
r.html.render() # this call executes the js in the page
As a bonus this wraps BeautifulSoup, I think, so you can do things like
r.html.find('#myElementID').text
which returns the content of the HTML element as you'd expect.
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).
However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!
First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.
Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help
Now, all you need to do is the following code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
If you do not know how to use Selenium, here is a quick overview:
driver.get("https://www.google.com") #Browser goes to google.com
Finding elements:
Use either the ELEMENTS or ELEMENT method. Examples:
driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
driver.find_element(s)_by_class_name(class_name) # Every element with the following class
driver.find_element(s)_by_id(id) # Every element with the following ID
driver.find_element(s)_by_link_text(link_text) # Every with the full link text
driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
driver.find_element(s)_by_name(name) # Every element where name=argument
driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument
Ok! I found an element (or elements list). But what do I do now?
Here are the methods you can do on an element elem:
elem.tag_name # Could return button in a .
elem.get_attribute("id") # Returns the ID of an element.
elem.text # The inner text of an element.
elem.clear() # Clears a text input.
elem.is_displayed() # True for visible elements, False for invisible elements.
elem.is_enabled() # True for an enabled input, False otherwise.
elem.is_selected() # Is this radio button or checkbox element selected?
elem.location # A dictionary representing the X and Y location of an element on the screen.
elem.click() # Click elem.
elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
elem.submit() # Submit the form in which elem takes part.
Special commands:
driver.back() # Click the Back button.
driver.forward() # Click the Forward button.
driver.refresh() # Refresh the page.
driver.quit() # Close the browser including all the tabs.
foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.
In one example it worked through following cookies:
the cookie generated after checking for javascript for this example is "cf_clearance".
so simply create a session.
update cookie and headers as such:
s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
})
s.get(url)
and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.
Some way to do that is to invoke your request by using selenium.
Let's install dependecies by using pip or pip3:
pip install selenium
etc.
If you run script by using python3
use instead:
pip3 install selenium
(...)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'
# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
its a wrapper around pyppeteer or smth? :( i thought its something different
#property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
return self._browser
I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io
I am able to get cookies from the website just fine. But I am interested in the cookies which the Chatbot is using for example there are chatbot websites like: <www.kinguin.net> or <www.multibankfx.com> or <coschedule.com>
If we go on to these websites and 'inspect element' them and then see under the cookies for secure.livechat.inc (this is the chatbot) there will be 1 or 2 cookies as shown in the figure below
Here in this image, I am looking into the cookies of the chatbot on the website called <www.kinguin.net> and we can see one cookie there, i.e. "__livechat"
So this cookie is what I want to automate and extract using the selenium.
my following code return all cookies on the website but "_livechat" is missing
import os, sys, json, codecs, subprocess, requests, time, string
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
host = 'kinguin.net'
driver.get("https://"+host)
cookies = driver.get_cookies()
driver.switch_to.default_content()
cookies = driver.get_cookies()
for item in cookies:
print(item['name'])
taking it further, my following code goes into the iFrame of the chatbot and get cookies but return null
driver.switch_to.default_content()
elementID = driver.find_element_by_id('chat-widget')
driver.switch_to.frame(0)
cookies = driver.get_cookies()
for item in cookies:
print(item['name'])
#ble Thanks alot - the way you suggest is helpful only for this particular website which is not what I want. I am sorry if I could not explain it clearly in my earlier query but I want a generic solution for a large scale website dataset.
For example, if we look at <www.ebanx.com> here the chatbot is different and hence I will search it by
elementID = driver.find_element_by_id('hubspot-messages-iframe-container')
and if I use your code after this driver.switch_to.frame(elementID)
it gives me the error
NoSuchFrameException: Message: no such frame: element is not a frame
With this line of code you found iframe element:
elementID = driver.find_element_by_id('chat-widget')
Use this to switch to that iframe and you will be able to collect cookies with code you wrote
driver.switch_to.frame(elementID)
After you finish, switch to default content with
driver.switch_to.default_content()
There are more iframes on that page. The easiest approach is to find element by using unique identifier, such as 'id' or 'name' and store it in variable e.g. 'elementID'. I suggest renaming it to 'iframe_element' because it is not ID, you just got the element by ID.
Also, avoid search by index (driver.switch_to.frame(0)) if there are not so many iframes on page (https://www.guru99.com/handling-iframes-selenium.html)
I am making web-crawler to get information from http://www.caam.org.cn/hyzc, but it showed me HTTP Error 302, and I cannot fix it.
https://imgur.com/a/W0cykim
The picture gives you a rough idea about the special layout of this website in that when you are browsing it, it will pop out a window, telling you that the website is accelerating, for the reason that there are so many people online, and then direct you to that website. As a result, when I use web-crawler, all I get is the information on this window, but nothing on this website. I think this is a good way for the website keeper to get rid of our web crawlers. So I want to ask for your help to get useful information from this website
At first, I used requests of python for my web crawler, and I only got information on that window, the results are shown here: https://imgur.com/a/GLcpdZn
And then I forbad website redirect, I got HTTP Error 303, shown:
https://imgur.com/a/6YtaVOt
This is the latest code I used:
python
import requests
def getpage(url):
try:
r= requests.get(url, headers={'User-Agent':'Mozilla/5.0'}, timeout=10)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "try again"
url = "http://www.caam.org.cn/hyzc"
print(getpage(url))
The expected outcome of this question is to get useful information from the website http://www.caam.org.cn/hyzc. We may need to deal with the window popped out.
Looks like this website have some kind of protection against crawlers using requests, the page is not entirely loaded when you send a get request.
You can try to emulate a browser using selenium:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.caam.org.cn/hyzc')
print(driver.page_source)
driver.close()
driver.page_source will contain the page source.
You can learn how to setup selenium webdriver here.
I added something to delay the closure of my web crawl and this worked. So I want to share my lines in case you meet similar problem in the future:
python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get('http://www.caam.org.cn')
body = driver.find_element_by_tag_name("body")
wait = WebDriverWait(driver, 5, poll_frequency=0.05)
wait.until(EC.staleness_of(body))
print(driver.page_source)
driver.close()
I am trying to scrape an online food-ordering website using Mechanize & BS4. The problem I'm facing is that the website has a form that takes location as input powered by Google. When I try filling it using this method:
from bs4 import BeautifulSoup as bs
import requests, lxml, mechanize
url = raw_input("Enter URL: ")
browser = mechanize.Browser()
browser.open(url)
# 'placeSelectionForm' is the name of the input-field
browser.select_form(name='placeSelectionForm')
control1 = browser.form.controls[0]
control1._value = 'Koramangala'
browser.submit()
soup = bs(browser.response().read(), "lxml")
print soup.prettify()
The script works fine for a normal django form that I have made. But the problem here is that the Google powered form is using auto-complete api like this:
So when I type initials of some location, there are auto-complete suggestions and as soon as I select one option the form auto-submits and I'm taken to a new URL.
Now, the problem with the URL of the new page is that no matter what option I chose in the form, the URL remains the same, and the values that come with the response vary accordingly with the option I chose on previous page.
How can I fill this form (powered by Google maps api) using tools like Mechanize or BS4 or any other such?
This is a quite Javascript "heavy" website which you may find difficult to automate with mechanize. Here is how you can do make a search, choose one of the suggestions and wait for results to load with selenium:
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('http://www.swiggy.com/bangalore')
# wait for input to appear and make a search
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "pac-input"))).send_keys("Koramangala")
# wait for suggestions to appear
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.pac-container div.pac-item")))
# choose the first suggestion
suggestions = driver.find_elements_by_css_selector("div.pac-container div.pac-item")
suggestions[0].click()
# wait for results to load
wait.until(EC.visibility_of_element_located((By.ID, "restaurants")))
# TODO: extract results
I've added comments to make things clear. Let me know if you want me to expand on any of the parts of the code.