External iFrame inside a pyramid project. No refresh when its updated - python

we include an iFrame inside a pyramid webpage.
The iFrame is a local html file which is not a pyramid webpage.
Everytime the HTML contents (=the iFrame) gets updated and I refresh or load the pyramid webpage with the iFrame again, the iFrame contents do not get updated. If I force a refresh with my browser then the iFrame has the new contents.
How to solve this issue?

Well, firstly, the question has no relation to Python or Pyramid whatsoever - Pyramid just generated you a blob of text which happened to be a HTML page. After that everything is happening in the browser - I suppose your "other page" has HTTP headers which say that the browser does not need to reload it each time and may cache it.
If you want to force reload of the "other" page each time the "pyramid page" is generated, you may try tricking the browser into thinking you want to load a new page each time. To do that, just add a bogus url parameter with some random number:
<iframe src="http://other.domain.com/somepage.html?blah=1452352235"></iframe>
where the number after blah= may be a timestamp or just a random number.

Related

How is Amazon loading next page without URL?

I'm trying to web scrape the amazon deals page but the problem is that I'm unable to get the URL for the next page. Here is the link to the Amazon today's deals page. At the bottom of the page, there is pagination but when I inspected the page, there is no URL. The href tag only contains "#" in the URL which should only load the page to the top. How is Amazon able to move to the next page? is there any hidden URL? I couldn't find anything using the Network tab in the Inspect menu as well. I'm adding the picture below to show the code of pagination.
Probably some JavaScript wizardry they are running in the background. # seems like a placeholder. Check out the JavaScript code, and there might be more clues there.

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

Python Scrapy : response object different from source code in browser

I'm working on a project using Scrapy.
All wanted fields but one get scraped perfectly. The content of the missing field simply doesn't show up in the Scrapy response (as checked in the scrapy shell), while it does show up when i use my browser to visit the page. In the scrapy response, the expected tags are there, but not the text between the tags.
There's no JavaScript involved, but it is a variable that is provided by the server (it's the current number of visits to that particular page). No iframe involved either.
Already set the user agent (in the settings-file) to match my browser.
Already set the download delay (in the settings-file) to 5.
EDIT (addition):
The page : http://www.fincaraiz.com.co/apartamento-en-venta/bogota/salitre-det-1337688.aspx
Xpath to the wanted element : //*[#id="numAdvertVisits"]
What could be the cause of this mystery ?
It's an ajax/javascript loaded value.
What steps did you take to determine there is no JS involved? I loaded the page w/o javascript, and while that area of the page had the stub content ("Visitas"), the actual data was written there with an ajax request.
You can still load that data using scrapy, it'll just take an additional request to the URL endpoint normally accessed via on-page ajax. The server returns the number of visits in XML, via the script at http://www.fincaraiz.com.co/WebServices/Statistics.asmx/GetAdvertVisits?idAdvert=1337688&idASource=40&idType=1001 (try loading that script and you'll see the # of visits for the page you provided in the original email).
There is another ajax request that returns "True" for that page, but I'm not sure what the data's actual meaning is. Still, it may be useful:
http://www.fincaraiz.com.co/WebServices/Statistics.asmx/DetailAdvert?idAdvert=1337688&idType=1001&idASource=40&strCookie=13/11/2014:19-05419&idSession=10hx5wsfbqybyxsywezx0n1r&idOrigin=44

Selenium. Get content for dynamic page including Ajax objects

In my Selenium test I need simply save web page content after all Ajax objects has been loaded. I have found answers how to wait for Ajax loading, however there is no working solution for saving whole page with Ajax content. Here is source example:
with contextlib.closing(webdriver.Chrome()) as driver:
driver.get(url) # Load page
# Just as an example, wait manually until all Ajax objects are loaded.
raw_input("Done?")
# Save whole page
text = driver.page_source
# text contains original page data, no Ajax elements
I assume I need to tell web driver to check with the browser and update page_source property. Is there API for that? How do you save page containing Ajax objects?
Edit: Thanks for the reply! After re-testing with sample Ajax site I've figured that above code works. The problem was that the site uses frames, therefore I need to switch to a proper one. Here is another post answering that: What does #document mean?
page_source should return the HTML of page as now, and include any HTML generated post page load by AJAX. You should not have to call different methods to get the AJAX generated content.
Is there a public link to the site we can see?
After opening the page then refresh the page and then get the source code
driver.refresh()
text = driver.page_source

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

Categories