How is Amazon loading next page without URL? - python

I'm trying to web scrape the amazon deals page but the problem is that I'm unable to get the URL for the next page. Here is the link to the Amazon today's deals page. At the bottom of the page, there is pagination but when I inspected the page, there is no URL. The href tag only contains "#" in the URL which should only load the page to the top. How is Amazon able to move to the next page? is there any hidden URL? I couldn't find anything using the Network tab in the Inspect menu as well. I'm adding the picture below to show the code of pagination.

Probably some JavaScript wizardry they are running in the background. # seems like a placeholder. Check out the JavaScript code, and there might be more clues there.

Related

Scraping a website that URL doesn't change when clicking on "next page" button

I'm trying to scrape a BBC website
https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests
and I would like to get all the news articles. But the URL doesn't change when clicking on the next page button so I can only get the first page information. Can anyone help? I'm using Selenium but familiar with requests too. Thanks!
use developer console in your browser, go to networks tab, disable cache.
you can see api requests being made for each page change. you dont need selenium, you can just use requests or aiohttp.
this is an example:
https://push.api.bbci.co.uk/batch?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2Fd5803bfc-472d-4abf-b334-d3fc4aa8ebf9%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F2%2Fversion%2F1.5.6?timeout=5
type "batch" in the filter bar and you should see only the api calls I believe to be responsible for page change.
you can get the about id(d5803bfc-472d-4abf-b334-d3fc4aa8ebf9) of this topic in the webpage source. in this case in, https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests

Beautifulsoup scrape not showing everything

I am trying to get the img tag from the first image, so I can get the image link.
When I scrape the site with beautifulsoup, there is not a img tag (in image 2).
I don't understand why the website has an img tag for each, but beautifulsoup does not.
It is possible that the images does not load on the site until it gets input from the user.
For example, if you had to click a dropdown or a next arrow to view the image on the website, then it is probably making a new request for that image and updating the html on the site.
Another issue might be JavaScript. Websites commonly have JavaScript code that runs after the page has first been loaded. The Javascript then mades additional requests to update elements on the page.
To see what is happending on the site, in your browers go to the site press F12. Go to the Network tab and reload the page. You will see all the urls that are requested.
If you need to get data that loads by Javascript requests, try using Selenium.
UPDATE
I went to the webiste you posted and pulled just the html using the following code.
import requests
page = requests.get("https://auburn.craigslist.org/search/sss?query=test")
print(page.text)
The requests return the html you would get before any Javascript and other requests run. You can see it here
The image urls are not in this either. This means that in the initial request the image html is not returned. What we do see are data tags, see line 2192 of the pastebin. These are commonly used by JavaScript to make additional requests so it knows which images to go and get.
Result: The img tags you are looking for are not in the html returned from your request. Selenium will help you here, or investigate how thier javascript is using those data-ids to determine which images to request.

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

web scraping a problem site

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

Categories