Dynamic content not rendered when scraping using selenium - python

I am trying to use selenium for scraping (the script used to work in python 3.7).
Last week I had to reset my PC and I installed the latest versions of python and all the packages used in the script.
What I observed was that none of the dynamic values are getting rendered and are displayed with header tags. Please see below some of the outputs:
<tr>
<td class="textsr">Close</td>
<td class="textvalue">{{ScripHeaderData.Header.Close}}</td>
</tr>
<tr>
<td class="textsr">WAP</td>
<td class="textvalue">{{StkTrd.WAP}}</td>
</tr>
<tr>
<td class="textsr">Big Value</td>
<td class="textvalue">{{checknullheader(CompData.BigVal)?'-':(CompData.BigVal)}}</td>
</tr>
I have been using the script for my research purpose and need it back in shape, hence appreciate any guidance.
Here's the snippet for reference:
target_url = q.get(timeout=1)
time.sleep(1)
driver = webdriver.Chrome('./chromedriver',options=opts)
driver.get(target_url)
# this is just to ensure that the page is loaded
time.sleep(5)
html_content = driver.page_source
soup = BeautifulSoup(html_content, features="html.parser")
table_rows = soup.find_all('tr')
for row in table_rows:
table_cols = row.find_all('td')
for col in table_cols:
label_value = col.text

I had referred a lot of forums and tried many suggestions (waits, driver options, changing web drivers, switching content etc.) however my issue seems to be more specific and did not get resolved.
Eventually fell back to my old setup (runs python 3.9.6) and then it went back to working state.
Thanks to you Joe Carboni for your time and inputs on this.
It is a bit frustrating that I could not find the root cause of the issue and a workaround to resolve it. But just posting what I did here in case if it helps someone, cheers.

While it may be tempting to use time.sleep to wait for the page to load, it's better to use Selenium Waits with conditions to wait for, likely related to the elements you want.
https://www.selenium.dev/documentation/webdriver/waits/
Here's another thread with a good answer about Waits and conditions vs. time.sleep: How to sleep Selenium WebDriver in Python for milliseconds

Related

Execute Javascript function on website using Python

Is it possible to call a Javascript function on a website that I'm web scraping and saving the result of the function?
I'm using Requests to establish a connection and saving certain pages that I need and BeautifulSoup to make it readable and accessing certain parts.
There is one part that I'm not sure how to call, or even if it's possible:
<tr class=TRDark>
<td width=100% colspan=3>
<a href="" onclick="OpenPayPlan('payplan.asp?app=******');return false;">
Betalingsplan
</a>
</td>
</tr>
This function will open a new window and calculate some data that I need. Is this possible to do with Python?
I cannot use Selenium or similar programs for this. This must be executed in the terminal and only the terminal.
You need to find a JavaScript interpreter with Python bindings maybe. When you've found one which will fit with your needs you can read the documentation and there you can see how this interpreter works. An example could be pyv8. Python however, does not include a JavaScript interpreter.

Python and selenium: choose option from dropdown in table

I am still new to Python and Selenium. I would like to choose a certain option from a dropdown that is contained in an html table. However I can not get it to work. What am I doing wrong? Any help is appreciated?
Snippet of HTML-Code:
<table class="StdTableAutoCollapse">
<tr>
<td class="StdTableTD150">
<span id="ctl00_ContentPlaceBody_LbLProd1" class="StdLabel150">Prod1:</span>
</td>
<td class="StdTableTD330">
<select name="ctl00$ContentPlaceBody$DropDownListUnitType" onchange="javascript:setTimeout('__doPostBack(\'ctl00$ContentPlaceBody$DropDownListUnitType\',\'\')', 0)" id="ctl00_ContentPlaceBody_DropDownListUnitType" class="StdDropDownList330" Class="option">
<option selected="selected" value="#">- nothing -</option>
<option value="P">Dummy1</option>
</select>
</td>
</tr>
<tr>
I tried the following to select the value "Dummy1"
Python Code:
dropdown1 =
browser.find_element_by_id('ctl00_ContentPlaceBody_DropDownListUnitType')
select = Select(dropdown1)
select.select_by_value("P")
What am I missing or doing wrong? Any help is much appreciated.
EDIT
I get an error on the IPython console in Anaconda with Python 3.6:
NoSuchElementException: Unable to locate element:
[id="ctl00_ContentPlaceBody_DropDownListUnitType"]
EDIT2
I checked whether the problem is due to different iframes as mentioned by comments and in other questions here on stackoverflow. I used the idea mentioned in this https://developer.mozilla.org/en-US/docs/Tools/Working_with_iframes to check for iframes and tried with the example of Alibabas login page. There two different iframes where shown. In the page I am trying to use with selenium there is only one iframe.
It seems, Webdriver is having difficulty in directly reaching to the drop down using its id. You may need to first locate the table and then reach to the drop down. Try following and let me know, whether it works.
dropdown1 =
browser.find_element_by_xpath("//table[#class='StdTableAutoCollapse']/tr[1]/descendant::select[#id='ctl00_ContentPlaceBody_DropDownListUnitType'][1]")
select = Select(dropdown1)
select.select_by_value("P")
The problem was that I was trying to use Selenium 3.0.2 with Firefox 45. This creates issues and thus I could not select the dropdown values.I downgraded to Selenium 2.5.x and the problem went away. The issue was not the select was in a table as I first thought. I hope this helps somebody else in the future.Please see also the following question: Python, Firefox and Selenium 3: selecting value from dropdown does not work with Firefox 45

Can't scrape nested html using BeautifulSoup

I have am interested in scraping "0.449" from the following source code from http://hdsc.nws.noaa.gov/hdsc/pfds/pfds_map_cont.html?Lat=33.146425&Lon=-87.5805543.
<td class="tblInner" id="0-0">
<div style="font-size:110%">
<b>0.449</b>
</div>
"(0.364-0.545)"
</td>
Using BeautifulSoup, I currently have written:
storm=soup.find("td",{"class":"tblInner","id":"0-0"})
which results in:
<td class="tblInner" id="0-0">-</td>
I am unsure of why everything nested within the td is not showing up. When I search the contents of the td, my result is simply "-". How can I scrape the value that I want from this code?
You are likely scraping a website that uses javascript to update the DOM after the initial load.
You have a couple choices:
Find out where did the javascript code that fills the HTML page got the data from and call this instead. The data most likely comes from an API that you can call directly with CURL. That's the best method 99% of the time.
Use a headless browser (zombie.js, ...) to retrieve the HTML code after the javascript changes it. Convenient and fast, but few tools in python to do this (google python headless browser).
Use selenium or splinter to remote control a real browser (chrome, firefox, ...). It's convenient and works in python, but slow as hell
Edit:
I did not see that you posted the url you wanted to scrape.
In your particular case, the data you want comes from an AJAX call to this URL:
http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds
You now only need to understand what each parameter does, and parse the output of that instead of writing an HTML scraper.
Please excuse lack of error checking and modularity, but this should get you what you need, based on #Eloims observation:
import requests
import re
url = 'http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds'
r = requests.get(url)
response = r.text
coord_list_text = re.search(r'quantiles = (.*);', response)
coord_list = eval(coord_list_text.group(1))
print coord_list[0][0]

Cannot switch to a frame

I am having super difficulty understanding a problem I have while working on automating a page using Chromedriver. I am in the login page and here is how the HTML for the page looks:
<frame name="mainFrame" src>
<body>
<table ..>
<tr>
<td ..>
<input type="password" name="ui_pws">
</td>
..
..
..
</frame>
This is gist, the page of course has multiple tables, divs, etc ...
I am trying to enter the password in the input element using xpath //input[#name="ui_pws"].
But the element was not found.
So I thought it might be because of wrong frame and I tried:
driver.switch_to_frame('mainFrame')
and it failed with NoSuchFrameException.
So I switched to:
main_frame = driver.find_element_by_xpath('//frame[#name="mainFrame"]')
driver.switch_to_frame(main_frame)
Then to cross verify I got the current frame element using:
current_frame = driver.execute_script("return window.frameElement")
And to my surprise I got two different elements when printed it out.
Now I am really confused as to what I should be doing to switch frames or access the password field in the webpage. I have had 4 cups of coffee since morning and still have a brain freeze.
Can anyone please guide me with this?
You can try, this is in Java should be almost similar in python too
driver.switchTo().defaultContent();
WebElement frameElement = driver.findElement(By.xpath("//frame[#name='mainFrame']"));
drive.switchTo().frame(frameElement);
SwitchTo defaultContent helps bring in focus properly, and later we can switch to the desired frame in the window.
driver.switchTo().frame(driver.findElement(By.xpath("//frame[# name='mainFrame']")));
//perform operation which you want to perform on web elements present inside the frame(mainFrame), once you finish your operation come back to default
driver.switchTo().defaultContent();

Parsing Web Page's Search Results With Python

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:
"http://www.spanishdict.com/conjugate/beber"
To open the page, I use the following python code:
source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()
This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:
soup = BeautifulSoup(source)
I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:
<tr>
<td class="verb-pronoun-row">
yo </td>
<td class="">
bebo </td>
<td class="">
bebí </td>
<td class="">
bebía </td>
<td class="">
bebería </td>
<td class="">
beberé </td>
</tr>
What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.
Here is my complete code (I used the "++++++" to differentiate the two):
import urllib
from bs4 import BeautifulSoup
source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)
print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)
When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html.
Try to use lxml.html.
Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.
I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

Categories