In my Selenium test I need simply save web page content after all Ajax objects has been loaded. I have found answers how to wait for Ajax loading, however there is no working solution for saving whole page with Ajax content. Here is source example:
with contextlib.closing(webdriver.Chrome()) as driver:
driver.get(url) # Load page
# Just as an example, wait manually until all Ajax objects are loaded.
raw_input("Done?")
# Save whole page
text = driver.page_source
# text contains original page data, no Ajax elements
I assume I need to tell web driver to check with the browser and update page_source property. Is there API for that? How do you save page containing Ajax objects?
Edit: Thanks for the reply! After re-testing with sample Ajax site I've figured that above code works. The problem was that the site uses frames, therefore I need to switch to a proper one. Here is another post answering that: What does #document mean?
page_source should return the HTML of page as now, and include any HTML generated post page load by AJAX. You should not have to call different methods to get the AJAX generated content.
Is there a public link to the site we can see?
After opening the page then refresh the page and then get the source code
driver.refresh()
text = driver.page_source
Related
I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.
I am trying to scrape names and contact details from this page https://www.realestate.com.au/find-agent/agents/sydney-cbd-nsw. I normally want to click into each of the list items and get the information from the resulting page, but there is no href to follow.
I'm presuming that the class type somehow points to some JS codes. When the list item is clicked the JS redirects you to the new url. Can I get at it somehow using Scrapy?
Note: I don't know much about JS
This will give you all the links you need without JS rendering.
response.css('script::text').re('"url":"(.+?)"')
Don't use Chrome for scraping until there's no other way. It's really bad practice.
I'd recommend using Selenium which will automate an instance of an actual browser. This means that sessions, cookies, javascript execution, etc. is all handled for you.
Example:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://example.com")
button = driver.find_element_by_id('buttonID')
button.click()
I have created a script which will fill the form and submit it.
the website then displays the results.
once i open chrome using selenium, i get the driver.page_source and it gives the correct html output of the initial state.
If i use the driver.page_source after submitting the form,i am only getting the source of the initial state again, that is: no change is reflected even though there is a change in the html.
Question: How do i get the HTML output of the page with changes after submitting the form?
Thanks for the help in advance!
ps: i'm new so yeah..
EDIT:
I found the answer, it was working fine all the while, but the web page hadn't fully loaded yet and hence i was still getting the old source code, so i just made the driver wait before extracting the new source.
thank you!
Once you submit the form before you pull out the page_source to check for the change, it is worth to mention that though the WebClient may have achieved 'document.readyState' equal to "complete" at a certain stage and Selenium gets back the control of program execution, that doesn't guarantees that all the associated Javascript and Ajax Calls on the new page have completed. Until and unless the Javascript and Ajax Calls associated with the DOM Tree gets completed the page is not completely rendered you may not be able to track the intended changes.
An ideal way to check for changes would be to induce WebDriverWait in-conjunction with expected_conditions clause set as title_contains as follows :
driver.find_element_by_xpath("xpath_of_element_changes_page").click()
WebDriverWait(browser, 10).until(EC.title_contains(("full_or_partial_text_of_the_new_page_title")))
source = driver.page_source
Note : While Page Title resides within the <head> tag of the HTML DOM a better solution would be to induce WebDriverWait for the visibility of an element which will be present in all situations within the <body> tag of the DOM Tree as follows :
driver.find_element_by_xpath("xpath_of_element_changes_page").click()
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.ID, "id_of_element_present_in_all_situation")))
source = driver.page_source
You can pass Selenium's current page to the scrapy Selector and use usual css and/or xpath selectors to get data from it:
sel_response = Selector(text=driver.page_source.encode('utf-8'))
sel_response.css(<your_css_selector>).extract()
I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.
we include an iFrame inside a pyramid webpage.
The iFrame is a local html file which is not a pyramid webpage.
Everytime the HTML contents (=the iFrame) gets updated and I refresh or load the pyramid webpage with the iFrame again, the iFrame contents do not get updated. If I force a refresh with my browser then the iFrame has the new contents.
How to solve this issue?
Well, firstly, the question has no relation to Python or Pyramid whatsoever - Pyramid just generated you a blob of text which happened to be a HTML page. After that everything is happening in the browser - I suppose your "other page" has HTTP headers which say that the browser does not need to reload it each time and may cache it.
If you want to force reload of the "other" page each time the "pyramid page" is generated, you may try tricking the browser into thinking you want to load a new page each time. To do that, just add a bogus url parameter with some random number:
<iframe src="http://other.domain.com/somepage.html?blah=1452352235"></iframe>
where the number after blah= may be a timestamp or just a random number.