Incomplete html from Selenium - python

Hi I was wondering why if I have a certain page's url and use selenium like this:
webdriver.get(url)
webdriver.page_source
The source code given by selenium lacks elements that are there when inspecting the page from the browser ?
Is it some kind of way the website protects itself from scraping ?

Try adding some delay between webdriver.get(url) and webdriver.page_source to let the page completely loaded

Generally it should give you entire page source content with all the tags and tag attributes. But this is only applicable for static web pages .
for dynamic web pages, webdriver.page_source will only give you page resource whatever is available at that point of time in DOM. cause DOM will be updated based on user interaction with page.
Note that iframes are excluded from page_source in any way.

If the site you are scraping is a Dynamic website, then it takes some time to load as the JavaScript should run, do some DOM manipulations etc., and only after this you get the source code of the page.
So it is better to add some time delay between your get request and getting the page source.
import time
webdriver.get(url)
# pauses execution for x seconds.
time.sleep(x)
webdriver.page_source

The page source might contain one link on javascript file and you will see many controls on the page that has been generated on your side in your browser by running js code.
The source page is:
<script>
[1,2,3,4,5].map(i => document.write(`<p id="${i}">${i}</p>`))
</script>
Virtual DOM is:
<p id="1">1</p>
<p id="2">2</p>
<p id="3">3</p>
<p id="4">4</p>
<p id="5">5</p>
To get Virtual DOM HTML:
document.querySelector('html').innerHTML
<script>
[1,2,3,4,5].map(i => document.write(`<p id="${i}">${i}</p>`))
console.log(document.querySelector('body').innerHTML)
</script>

Related

Beautiful Soup is not returning full HTML code that I see when I inspect the page manually [duplicate]

My issue I'm having is that I want to grab the related links from this page: http://support.apple.com/kb/TS1538
If I Inspect Element in Chrome or Safari I can see the <div id="outer_related_articles"> and all the articles listed. If I attempt to grab it with BeautifulSoup it will grab the page and everything except the related articles.
Here's what I have so far:
import urllib2
from bs4 import BeautifulSoup
url = "http://support.apple.com/kb/TS1538"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read())
print soup
This section is loaded using Javascript. Disable your browser's Javascript to see how BeautifulSoup "sees" the page.
From here you have two options:
Use a headless browser, that will execute the Javascript. See this questions about this: Headless Browser for Python (Javascript support REQUIRED!)
Try and figure out how the apple site loads the content and simulate it - it probably does an AJAX call to some address.
After some digging it seems it does a request to this address (http://km.support.apple.com/kb/index?page=kmdata&requestid=2&query=iOS%3A%20Device%20not%20recognized%20in%20iTunes%20for%20Windows&locale=en_US&src=support_site.related_articles.TS1538&excludeids=TS1538&callback=KmLoader.receiveSuccess) and uses JSONP to load the results with KmLoader.receiveSuccess being the name of the receiving function. Use Firebug of Chrome dev tools to inspect the page in more detail.
I ran into a similar problem, the html contents that are created dynamically may not be captured by BeautifulSoup. A very basic solution for this is to make it wait for few seconds before capturing the contents, or use Selenium instead that has the functionality to wait for an element and then proceed. So for the former, this worked for me:
import time
# .... your initial bs4 code here
time.sleep(5) #5 seconds, it worked with 1 second too
html_source = browser.page_source
# .... do whatever you want to do with bs4

Python: finding content in dynamically generated HTML

I am trying to get stock options prices from this website based on the series code (for example FMM1), but the content is dynamically generated after the page loads and my python selenium script is not able to extract the correct source code, and therefore does not find it. When I inspect element, I can find it but not when I click on "view source code".
This is my code:
# Here, we open the website for options prices in Chrome
driver = webdriver.Chrome()
driver.get("http://www.bmfbovespa.com.br/pt_br/servicos/market-data/consultas/mercado-de-derivativos/precos-referenciais/precos-referenciais-bm-f-premios-de-opcoes/")
# Since the page is populated by JavaScript code *after* loading the page, we
# tell the browser to wait 10 seconds before getting the source html code
time.sleep(10)
html_file = driver.page_source # gets the html source of the page
print(html_file)
I have also tried the following, but it did not work:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,
"divContainerIframeBmf")))
Use this after the page loads
driver.switch_to.frame(driver.find_element_by_xpath("//iframe"))
and continue performing your operations on the page.

Python selenium webdriver not consistently selecting element even though it's there

I'm developing a web scraper to collect the src link from a source tag in an html file and add it to a list.
The site has a video nested under a load of divs, but all of the pages eventually come to:
<video type="video/mp4" poster="someimagelink" preload="metadata" crossorigin="anonymous">
<source type="video/mp4" src="somemp4link">
</video>
My current method is logging into the site, going to the page with the links to the video pages, going to each video page one by one and trying to find the source tag and adding it to the list.
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
# A bunch of log in and get list of video page links, which works fine
soup = BeautifulSoup(browser.page_source)
for i in range(3):
browser.get(soup('a', {'class', 'subject__item'})[i]['href'])
vsoup = BeautifulSoup(browser.page_source)
print(vsoup('source'))
browser.get('pageWithVideoPages')
# This doen't add to a list, it just goes to the video page,
# tries to find the source tag and print it out.
# Then go back to original page and start loop again.
What happens however is I get this:
[<source src="themp4link" type="video/mp4"></source>]
[]
[]
[]
So the first one works fine, then all the rest just return black lists...as if there was no source tag, but mannually checking the inspector reveals that there is a source tag there.
Repeating this, I now get:
[<source src="http://themp4link" type="video/mp4"></source>]
[]
[<source src="http://themp4link" type="video/mp4"></source>]
The site needed javascript enabled to load the content (which is why i'm using webdriver to do this)...could it be something to do with that?
Any help is much appreciated!
You probably need to wait for the web element you are looking for. You should explore using WebDriverWait.

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup.
Basically i am crawling amazon web page.
I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page.
I also google about the issues but I could not find a fix.
What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier, it's not visible on the server-side. Html content that you get, opening "http://someurl.com/page#123" is same as content for "http://someurl.com/page".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.

Selenium download full html page

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup
Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source

Categories