Web scraping with nested frames and javascript - python

I want to get answers from a online chatbot.
http://talkingbox.dyndns.org:49495/braintalk? (the ? belongs to the link)
To send a question you just have to send a simple request:
http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=[Here the question]
Source looks like this:
<frameset cols="*,185" frameborder="no" border="0" framespacing="0">
<frameset rows="100,*,82" frameborder="no" border="0" framespacing="0">
<frame src="http://thebot.de/bt_banner.html" marginwidth="0" name="frtop" scrolling="no" marginheight="0" frameborder="no">
<frame src="out?id=3B9054BC032E53EF691A9A1803040F1C" name="frout" marginwidth="0" marginheight="0">
<frameset rows="100%,*" border="0" framespacing="0" frameborder="no">
<frame src="bt_in?id=3B9054BC032E53EF691A9A1803040F1C" name="frin" scrolling="no" marginwidth="0" marginheight="0" noresize>
<frame src="" name="frempty" marginwidth="0" marginheight="0" scrolling="auto" frameborder="no" >
</frameset>
</frameset>
<frameset frameborder="no" border="0" framespacing="0" rows="82,*">
<frame src="stats?" name="fr1" scrolling="no" marginwidth="0" marginheight="0" frameborder="no">
<frame src="http://thebot.de/bt_rechts.html" name="fr2" scrolling="auto" marginwidth="0" marginheight="0" frameborder="no" >
</frameset>
</frameset>
I was using "mechanize" and beautifulsoup for web scraping but I suppose mechanize does not support dynamic webpages.
How can I get the answers in this case?
I am also looking for a solution which work good on Windows and Linux.

be it BeautifulSoup, mechanize, Requests or even Scrapy, loading that dynamic pages will have to be done by another step written by you.
for example, using scrapy this may look something like:
class TheBotSpider(BaseSpider):
name = 'thebot'
allowed_domains = ['thebot.de', 'talkingbox.dyndns.org']
def __init__(self, *a, **kw):
super(TheBotSpider, self).__init__(*a, **kw)
self.domain = 'http://talkingbox.dyndns.org:49495/'
self.start_urls = [self.domain +
'in?id=3B9054BC032E53EF691A9A1803040F1C&msg=' +
self.question]
def parse(self, response):
sel = Selector(response)
url = sel.xpath('//frame[#name="frout"]/#src').extract()[0]
yield Request(url=url, callback=dynamic_page)
def dynamic_page(self, response):
.... xpath to scrape answer
run it with a question as argument:
scrapy crawl thebot -a question=[Here the question]
for more details on how to use scrapy see scrapy tutorial

I would use Requests for task like this.
import requests
r = requests.get("http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=" + your_question)
For webpages that do not contain dynamic content, r.text is what you want.
Since you didn't provide more information about dynamic webpage, there is not much more to say.

Related

`Unable to locate element` after waiting till it appears

I'm writing a Selenium scraper that waits until a page is loaded before trying to locate an element. When I run the script, it looks to me like the element has loaded in the browser window, but Selenium thinks otherwise.
Here's scraper.py:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("--ignore-certificate-errors")
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
driver = webdriver.Chrome(chrome_options=options)
startingURL = "https://pbcvssp.co.palm-beach.fl.us/webapp/vssp/AltSelfService;jsessionid=0000lH_keoPs-fzs5sSkYGLah1X:-1"
driver.get(startingURL)
driver.find_element_by_name("guest_login").click()
driver.switch_to_window(driver.window_handles[1]) # Go to window with bids
try:
secondsToWait = 20
wait = WebDriverWait(driver,secondsToWait)
openBidsLinkName = "AMSBrowseOpenSolicit"
openBidsLink = wait.until(
EC.element_to_be_clickable(By.NAME,openBidsLinkName)
)
finally:
print driver.page_source
driver.find_element_by_name(openBidsLinkName)
But when I run python scraper.py, I get this error.
Traceback (most recent call last):
File "scraper.py", line 30, in <module>
driver.find_element_by_name(openBidsLinkName)
File "/home/me/ENV/pbc_vss/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 495, in find_element_by_name
return self.find_element(by=By.NAME, value=name)
File "/home/me/ENV/pbc_vss/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 966, in find_element
'value': value})['value']
File "/home/me/ENV/pbc_vss/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 320, in execute
self.error_handler.check_response(response)
File "/home/me/ENV/pbc_vss/local/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"name","selector":"AMSBrowseOpenSolicit"}
Also driver.page_source looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><!-- BEGIN GENERATED HTML --><html xmlns="http://www.w3.org/1999/xhtml" lang="en" oncontextmenu="MNU_ShowPopup('Default', event);return false"><head>
<title>Self Service Application
</title>
<base href="https://pbcvssp.co.palm-beach.fl.us:443/webapp/vssp/advantage/AltSelfService/" />
<script language="JavaScript" type="text/javascript" src="../AMSJS/ALTSS/ALTSSUtil.js">
<!---->
</script>
<script language="JavaScript" type="text/javascript" src="../AMSJS/AMSMenu.js">
<!---->
</script>
<script language="JavaScript" type="text/javascript" src="../AMSJS/AMSDHTMLLib.js">
<!---->
</script>
<script language="JavaScript" type="text/javascript" src="../AMSJS/AMSUtils.js">
<!---->
</script>
<script type="text/javascript" language="JavaScript">
<!--
UTILS_InitPage();
-->
</script>
</head>
<frameset border="0" rows="33, *, 25">
<frame name="AdditionalLinks" src="/LoginExternal/Pages/LoginAdditionalLinks.htm" marginwidth="0" title="Additional Links" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#AdditionalLinks" scrolling="no" />
<frameset cols="150, *" border="0">
<frameset border="0" rows="150, *">
<frame name="pPrimaryNavPanel" src="pPrimaryNavPanel.htm" marginwidth="0" title="Navigation" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#Nav" scrolling="no" />
<frame name="Secondary" src="../AMSImages/ALTSS/portal.htm" marginwidth="0" title="Secondary Navigation" target="Display" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#SecondaryNavigator" scrolling="no" />
</frameset>
<frameset id="AltSSLinkFrame" border="0" rows="100, *">
<frame name="Startup" src="https://pbcvssp.co.palm-beach.fl.us/webapp/vssp/AltSelfService;jsessionid=0000CFpQkQ1YDSjZgm-4yMM0lHd:-1?session_id=CFpQkQ1YDSjZgm-4yMM0lHd&page_id=pid_2712&vsaction=pagetransition&vsnavigation=StartPageNav&frame_name=Startup" marginwidth="0" title="Welcome Area" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#PrimaryNav" scrolling="no" vsaction="true" />
<frame name="Display" src="AltSSHomePage.htm" marginwidth="0" title="Display Frame" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#Display" scrolling="auto" />
</frameset>
</frameset>
<frame name="CopyrightInfo" src="/LoginExternal/Pages/LoginCopyrightInfo.html" marginwidth="0" title="Copyright Info" frameborder="0" marginheight="0" longdesc="../AMSImages/ALTSS/SelfServiceFrameDesc.htm#CopyrightInfo" scrolling="no" />
</frameset>
<noframes>
&lt;body&gt;
&lt;p&gt;This page uses frames, but your browser does not support them. FramesetPage requires a Frames-capable browser&lt;/p&gt;
&lt;/body&gt;
</noframes>
How can I make Selenium locate the element with the name attribute "AMSBrowseOpenSolicit"?
Your table is in frame, so you have to switch to it before you can interact with this table. This code snippet will help you to do it:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# find frame and switch to it
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[#title = 'Display Frame']")))
# do your stuff
secondsToWait = 20
wait = WebDriverWait(driver,secondsToWait)
openBidsLinkName = "AMSBrowseOpenSolicit"
openBidsLink = wait.until(
EC.element_to_be_clickable(By.NAME,openBidsLinkName)
)
driver.find_element_by_name(openBidsLinkName)
driver.switch_to.default_content() # switch back to default content

Python Selenium Webdriver - Find nested frame

I am working on an Intranet with nested frames, and am unable to access a child frame.
The HTML source:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>VIS</title>
<link rel="shortcut icon" href="https://bbbbb/ma1/imagenes/iconos/favicon.ico">
</head>
<frameset rows="51,*" frameborder="no" scrolling="no" border="0">
<frame id="cabecera" name="cabecera" src="./blablabla.html" scrolling="no" border="3">
<frameset id="frame2" name="frame2" cols="180,*,0" frameborder="no" border="1">
<frame id="menu" name="menu" src="./blablabla_files/Menu.html" marginwidth="5" scrolling="auto" frameborder="3">
Buscar
<frame id="contenido" name="contenido" src="./blablabla_files/saved_resource.html" marginwidth="5" marginheight="5">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>BUSCAr</title>
</head>
<frameset name="principal" rows="220,*" frameborder="NO">
<frame name="Formulario" src="./BusquedaSimple.html" scrolling="AUTO" noresize="noresize">
<input id="year" name="year" size="4" maxlength="4" value="" onchange="javascript:Orden();" onfocus="this.value='2018';this.select();" type="text">
<frame name="Busqueda" src="./saved_resource(2).html" scrolling="AUTO">
</frameset>
<noframes>
<body>
<p>soporte a tramas.</p>
</body>
</noframes>
</html>
<frame name="frameblank" marginwidth="0" scrolling="no" src="./blablabla_files/saved_resource(1).html">
</frameset>
<noframes>
<P>Para ver esta página.</P>
</noframes>
</frameset>
</html>
I locate the button "Buscar" inside of frame "menu" with:
driver.switch_to_default_content()
driver.switch_to_frame(driver.find_element_by_css_selector("html frameset frameset#frame2 frame#menu"))
btn_buscar = driver.find_element_by_css_selector("#div_menu > table:nth-child(10) > tbody > tr > td:nth-child(2) > span > a")
btn_buscar.click()
I've tried this code to locate the input id="year" inside frame="Formulario":
driver.switch_to_default_content()
try: driver.switch_to_frame(driver.switch_to_frame(driver.find_element_by_css_selector("html frameset frameset#frame2 frame#contenido frameset#principal frame#Formulario")))
print("Ok cabecera -> contenido")
except:
print("cabecera not found")
or
driver.switch_to_frame(driver.switch_to_xpath("//*[#id='year"]"))
but they don't work.
Can you help me?
Thanks!
To be able to handle required iframe you need to switch subsequently to all
ancestor frames:
driver.switch_to.frame("cabecera")
driver.switch_to.frame("menu")
btn_buscar = driver.find_element_by_link_text("Buscar")
btn_buscar.click()
Also note that Webdriver instance has no such method as switch_to_xpath() and switch_to_frame(), switch_to_default_content() methods are deprecated so you'd better use switch_to.frame(), switch_to.default_content()
Assuming your program have the focus on Top Level Browsing Context, to locate and the button with text as Buscar you need to switch() through all the parent frames along with WebDriverWait in association with proper expected_conditions and you can use the following code block :
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.ID,"cabecera"))
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.ID,"menu"))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Buscar"))).click()

In Scrapy, can't extract text of a link with a "#" in it

In the Scrapy shell for the URL http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/, I'm trying to extract the developer, app, and version names from the navigation bar:
I've tried the following XPath selector:
In [6]: response.xpath('//*[#class="breadcrumbs"]//a/text()').extract()
Out[6]: [u'Sony Mobile Communications', u'1.00.40']
However, notice that the app name, Folding#Home, is not among the results. I do not understand this because it does seem to have an <a> tag (as shown using "Inspect" in Chrome):
Moreover, for a similar site, http://www.apkmirror.com/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/oculus-rooms-0-0-2-android-apk-download/, this selector does work:
In [1]: response.xpath('//*[#class="breadcrumbs"]//a/text()').extract()
Out[1]: [u'Oculus VR', u'Oculus Rooms', u'0.0.2']
I am beginning to suspect that this might be some kind of bug in Scrapy whereby it does not select the text() of <a> elements with # symbols. Might this be the case?
As you've already figured out, one of the breadcrumb links is "protected" and is dynamically constructed via JavaScript executed in the browser.
One easy way to approach the problem would be to pass the content of the page through Splash via scrapy-splash middleware. This worked for me:
import scrapy
from scrapy_splash import SplashRequest
class ApkSpider(scrapy.Spider):
name = "apkmirror"
allowed_domains = ['apkmirror.com']
def start_requests(self):
yield SplashRequest(
'http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/',
self.parse_result,
)
def parse_result(self, response):
print(response.xpath('//*[#class="breadcrumbs"]//a/text()').extract())
with the following settings:
SPLASH_URL = 'http://127.0.0.1:8050'
SPLASH_COOKIES_DEBUG = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
and Splash running in a docker container on the port 8050.
Prints:
[u'Sony Mobile Communications', u'Folding#Home', u'1.00.40']
Viewing the page source using Chrome's "View Page Source" option instead of "Inspect", I see that the navigation bar for this particular link contains JavaScript:
<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation">
<div style="color: #013967 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/">Sony Mobile Communications</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/"><span class="__cf_email__" data-cfemail="c781a8aba3aea9a0878fa8aaa2">[email protected]</span><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/">1.00.40</a> </nav>
whereas for the Oculus Rooms page in the second example it doesn't:
<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation">
<div style="color: #646464 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/">Oculus VR</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/">Oculus Rooms</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/">0.0.2</a> </nav>
Handling JavaScript with Scrapy is a known issue (cf. https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/).

Selenium - identifying web element

I am using Python to scrape data from a website. While I have been able to use Selenium to log in, I cannot identify the search field once logged in. It appears the web page loads with frames (not iframes), but I cannot access the frame with the search field.
I have tried changing the frame to the relevant frame (which seems to work - no error is thrown up) but then if I try searching for the search element by CSS / Xpath / Name / id I get a NoSuchElementException. I am using the Chrome webdriver.
Any suggestions? The page source is as follows:
<html>
<head>
<title> XYZ </title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="content-language" content="en" />
<script type="text/javascript">
if (navigator && navigator.appVersion && navigator.appVersion.match("Safari") && !navigator.appVersion.match("Chrome")) {
// hack to force a window redraw
window.onload = function() {
document.getElementsByTagName('html')[0].style.backgroundColor = '#000000';
}
}
</script>
</head>
<frameset id="wc-frameset" rows="82,*" frameborder="no" border="0" framespacing="0">
<frame frameborder="0" src="/frontend/header/" name="top" marginwidth="0" marginheight="0" scrolling="no" noresize="noresize" />
<frameset cols="*,156,850,*" frameborder="NO" border="0" framespacing="0">
<frame frameborder="0" src="/frontend/fillbar/" name="fillbar" marginwidth="0" marginheight="0" scrolling="no" noresize="noresize" />
<frame frameborder="0" src="/frontend/navigation/" name="navigation" marginwidth="0" marginheight="0" scrolling="no" noresize="noresize" />
<frame frameborder="0" src="/frontend/frames/" name="content_area" marginwidth="0" marginheight="0" scrolling="no" noresize>
<frame frameborder="0" src="/frontend/fillbar/" name="fillbar" marginwidth="0" marginheight="0" scrolling="no" noresize="noresize" />
</frameset>
</frameset>
</html>
The code that I have so far is:
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("****")
password.send_keys("****")
driver.find_element_by_class_name("bg-left").click()
#this bit works
driver.switch_to_frame("content_area")
#this seems to work too, got the frame name from the page source
search = driver.find_element_by_id("field-name")
search.send_keys("TEST")
#this fails, no element found
The target frame source code is:
<div id="field-name" class="field field-StringField">
<label for="name">Name</label> <div class="input-con"><input id="name" name="name" type="text" value=""></div>
</div>
It is possible that there are duplicate elements in the page.
Try the following in chrome:
Open url in chrome
Open developer tools F12
Press ESC to open the chrome console
Select your frame
Search for similar elements using xpath in console
$x("//input[#id='name']")
This should list the number of elements.
Maybe you need to wait for the page to load up completely before continuing searching the element. You can try something like:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver.switch_to_frame("content_area")
try:
# this line adds wait for the element to be visible
WebDriverWait(driver, 10).until(EC.visibility_of_element_located(By.ID, 'name'))
except TimeoutException:
# display page timed out error
search = driver.find_element_by_id("name")
search.send_keys("TEST")

Switching iframes with Python/Selenium

I am attempting to use selenium to navigate a website that is using frames.
Here is my working python script for part 1:
from selenium import webdriver
import time
from urllib import request
driver = webdriver.Firefox()
driver.get('http://www.lgs-hosted.com/rmtelldck.html')
driver.switch_to.frame('menu')
driver.execute_script('xSubmit()')
time.sleep(.5)
link = driver.find_element_by_id('ml1T2')
link.click()
Here is the page element:
<html webdriver="true">
<head></head>
<frameset id="menuframe" name="menuframe" border="0" frameborder="0" cols="170,*">
<frameset border="0" frameborder="0" rows="0,*">
<frame scrolling="AUTO" noresize="" frameborder="NO" src="heart.html" name="heart"></frame>
<frame scrolling="AUTO" noresize="" frameborder="NO" src="rmtelldcklogin.html" name="menu"></frame>
</frameset>
<frame scrolling="AUTO" noresize="" frameborder="NO" src="rmtelldcklogo.html" name="update"></frame>
</frameset>
</html>
My issue is switching the frames...its in 'menu' I need to get into 'update':
driver.switch_to.frame('update')
^ does not work....error tells me its not there, even though we can clearly see it is...any ideas?
How do I switch from menu to update?
You need to switch back to the default content before switching to a different frame:
driver.switch_to.default_content()
driver.switch_to.frame("update")
# to prove it is working
title = driver.find_element_by_id("L_DOCTITLE").text
print(title)
Prints:
Civil Case Inquiry

Categories