Extracting Page Source with Python - python

Is there a way to extract the whole source page in exactly same way as you would be seeing it when you click rmb 'View page source' on a browser, just a raw page with thousands of lines of text?. I've tried requests.get(), but I'm only getting a fraction of it.

from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://whateverpage.com')
x = browser.page_source
print(x)

Related

Get html of inspect element source with selenium

I'm working in selenium with Chrome.
The webpage I'm accessing updates dynamically.
I need the html that shows the results, I can access it when I do 'inspect element'.
I don't get how I need to access that html from my code. I always get the original html.
I tried this: Get HTML Source of WebElement in Selenium WebDriver using Python
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
It seems that it's working after some delay. If I were you I should try to experiment with the delay time.
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
time.sleep(10)
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
Addition: a nicer way is to let the script proceed when an element is available (because of time it takes with JS (for example) before a specific element has been added to the DOM). The element to look for in your example is table with id iceDatTbl (for what I could find after a quick look).

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Python: finding content in dynamically generated HTML

I am trying to get stock options prices from this website based on the series code (for example FMM1), but the content is dynamically generated after the page loads and my python selenium script is not able to extract the correct source code, and therefore does not find it. When I inspect element, I can find it but not when I click on "view source code".
This is my code:
# Here, we open the website for options prices in Chrome
driver = webdriver.Chrome()
driver.get("http://www.bmfbovespa.com.br/pt_br/servicos/market-data/consultas/mercado-de-derivativos/precos-referenciais/precos-referenciais-bm-f-premios-de-opcoes/")
# Since the page is populated by JavaScript code *after* loading the page, we
# tell the browser to wait 10 seconds before getting the source html code
time.sleep(10)
html_file = driver.page_source # gets the html source of the page
print(html_file)
I have also tried the following, but it did not work:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,
"divContainerIframeBmf")))
Use this after the page loads
driver.switch_to.frame(driver.find_element_by_xpath("//iframe"))
and continue performing your operations on the page.

Webscraping links not the same as manual browsing

I have scraped a site for 840 urls...
When I rebuld the urls for more insformation, my python scraper does not porvide the same data as if I manually click on the links.
For example, when I visit this website, https://salesweb.civilview.com/Sales/SalesSearch
If I click on the first 'Details' in the list, it take to a page with more information.
The information that is given is a relative link showing '/Sales/SaleDetails?PropertyId=254119896'
I've scraped the 'details' relative link and then rebuilt the link to match the absolute address.
this address becomes
https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119896
However when I do this and try to scrape, I get a total different set of data and it takes me to a general landing page.
https://salesweb.civilview.com/
I thought at first, I needed to use a headless browser to fix the problem, but now I am not sure.
Here is my code:
import time
from selenium import webdriver
baseurl='https://salesweb.civilview.com'
link='/Sales/SaleDetails?PropertyId=254119946'
url1=baseurl+link
driver = webdriver.PhantomJS()
driver.get(url1)
html = driver.page_source
time.sleep(10)
driver.quit()
I found a workaround, if you first interact with the website, you can access the others urls. Unfortunately I have no idea why it works:
driver = webdriver.PhantomJS()
driver.get("https://salesweb.civilview.com/")
driver.find_element_by_link_text('Atlantic County, NJ').click()
driver.get("https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119946")
html = driver.page_source
print(html)

Python Selenium Run All Page Javascripts

I'm scraping my site which uses a Google custom search iframe. I am using Selenium to switch into the iframe, and output the data. I am using BeautifulSoup to parse the data, etc.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import html5lib
driver = webdriver.Firefox()
driver.get('http://myurl.com')
driver.execute_script()
time.sleep(4)
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to_default_content()
driver.switch_to_frame(iframe)
output = driver.page_source
soup = BeautifulSoup(output, "html5lib")
print soup
I am successfully getting into the iframe and getting 'some' of the data. At the very top of the data output, it talks about Javascript being enabled, and the page being reloaded, etc. The part of the page I'm looking for isn't there (from when I look at the source via developer tools). So, obviously some of it isn't loading.
So, my question - how do you get Selenium to load ALL page javascripts? Is it done automatically?
I see a lot of posts on SO about running an individual function, etc... but nothing about running all of the JS on the page.
Any help is appreciated.
Ahh, so it was in the tag that featured the "Javascript must be enabled" text.
I just posted a question on how to switch within the nested iframe here:
Python Selenum Swith into an iframe within an iframe

Categories