Python: finding content in dynamically generated HTML - python

I am trying to get stock options prices from this website based on the series code (for example FMM1), but the content is dynamically generated after the page loads and my python selenium script is not able to extract the correct source code, and therefore does not find it. When I inspect element, I can find it but not when I click on "view source code".
This is my code:
# Here, we open the website for options prices in Chrome
driver = webdriver.Chrome()
driver.get("http://www.bmfbovespa.com.br/pt_br/servicos/market-data/consultas/mercado-de-derivativos/precos-referenciais/precos-referenciais-bm-f-premios-de-opcoes/")
# Since the page is populated by JavaScript code *after* loading the page, we
# tell the browser to wait 10 seconds before getting the source html code
time.sleep(10)
html_file = driver.page_source # gets the html source of the page
print(html_file)
I have also tried the following, but it did not work:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,
"divContainerIframeBmf")))

Use this after the page loads
driver.switch_to.frame(driver.find_element_by_xpath("//iframe"))
and continue performing your operations on the page.

Related

Extracting Page Source with Python

Is there a way to extract the whole source page in exactly same way as you would be seeing it when you click rmb 'View page source' on a browser, just a raw page with thousands of lines of text?. I've tried requests.get(), but I'm only getting a fraction of it.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://whateverpage.com')
x = browser.page_source
print(x)

Get html of inspect element source with selenium

I'm working in selenium with Chrome.
The webpage I'm accessing updates dynamically.
I need the html that shows the results, I can access it when I do 'inspect element'.
I don't get how I need to access that html from my code. I always get the original html.
I tried this: Get HTML Source of WebElement in Selenium WebDriver using Python
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
It seems that it's working after some delay. If I were you I should try to experiment with the delay time.
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get('http://bijsluiters.fagg-afmps.be/?localeValue=nl')
searchform = browser.find_element_by_class_name('iceInpTxt')
searchform.send_keys('cefuroxim')
button = browser.find_element_by_class_name('iceCmdBtn').click()
time.sleep(10)
element = browser.find_element_by_class_name('contentContainer')
html = element.get_attribute('innerHTML')
browser.close()
print(html)
Addition: a nicer way is to let the script proceed when an element is available (because of time it takes with JS (for example) before a specific element has been added to the DOM). The element to look for in your example is table with id iceDatTbl (for what I could find after a quick look).

Webscraping links not the same as manual browsing

I have scraped a site for 840 urls...
When I rebuld the urls for more insformation, my python scraper does not porvide the same data as if I manually click on the links.
For example, when I visit this website, https://salesweb.civilview.com/Sales/SalesSearch
If I click on the first 'Details' in the list, it take to a page with more information.
The information that is given is a relative link showing '/Sales/SaleDetails?PropertyId=254119896'
I've scraped the 'details' relative link and then rebuilt the link to match the absolute address.
this address becomes
https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119896
However when I do this and try to scrape, I get a total different set of data and it takes me to a general landing page.
https://salesweb.civilview.com/
I thought at first, I needed to use a headless browser to fix the problem, but now I am not sure.
Here is my code:
import time
from selenium import webdriver
baseurl='https://salesweb.civilview.com'
link='/Sales/SaleDetails?PropertyId=254119946'
url1=baseurl+link
driver = webdriver.PhantomJS()
driver.get(url1)
html = driver.page_source
time.sleep(10)
driver.quit()
I found a workaround, if you first interact with the website, you can access the others urls. Unfortunately I have no idea why it works:
driver = webdriver.PhantomJS()
driver.get("https://salesweb.civilview.com/")
driver.find_element_by_link_text('Atlantic County, NJ').click()
driver.get("https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119946")
html = driver.page_source
print(html)

Facing issues while scraping data from a table using python with selenium

I've written a script using python in combination with selenium to parse table from a target page which can be reached out following some steps I've tried to describe below for the clarity. It does reach the destination but at the time of scraping data from that table It throws an error showing in the console "Unable to locate element". I tried with online xpath tester to see if it is wrong but I found that the xpath I've used in my script for "td_data" is right. I suppose, what I'm missing here is beyond my knowledge. Hope there is somebody to take a look into it and provide me with a workaround.
Btw, the site link is given in my script.
Link to see the html contents for the table: "https://www.dropbox.com/s/kaom5qzk78xndqn/Partial%20Html%20content%20for%20the%20table.txt?dl=0"
Steps to reach the target page which my script is able to maintain:
Selecting "I've read and understand above"
Putting this keyword "pump" in the inputbox located right below "Select medical devices".
Selecting the checkbox "Devices found for "pump".
Finally, pressing the search button
Script I've tried with so far:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath('//div[#class="table-responsive"]'):
for tr_data in item.find_elements_by_xpath('.//tr'):
td_data = tr_data.find_element_by_xpath('.//span[#class="hovertext"]//a')
print(td_data.text)
driver.close()
Why don't you just do this:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath(
'//table[#id]/tbody/tr/td[#class]/span[#class]/a[#id]'
):
print(item.text)
driver.close()
Output:
27233
27283
27288
27289
27390
27413
27441
27520
25445
27816
27866
27970
28033
28238
26999
28264
28407
28448
28437
28509
28524
28553
28647
28677
28646
Maybe you want to think about saving the page with driver.page_source, pull out the table, save it as a html file. Then use pandas from html to open the table into a dataframe

Selenium download full html page

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5
This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?
from selenium import webdriver
from bs4 import BeautifulSoup
googleURL = "http://www.google.com/trends/hottrends#pn=p5"
browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source
soup = BeautifulSoup(content)
print soup
Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source

Categories