How can I bypass Amazon search result 400 pages limit? - python

I am currently working a webscraper which should extract all item's description from a whole category on Amazon. I am writing this script with Python - Selenium - PhantomJS driver. How can I bypass the 400 page limit?

Amazon does't offer access to this data in his API. They only have information for "Pro sellers" (not standard sellers) and related to his own sales, shipping or products (you can find information in the Amazon marketplace Feed API page).
The only way I could find to do it is iterate through the category pages.
To do it you must start in the page category you're interested, retrieve description, price... and with your webscraper search for an element with Id "pagnNextLink". Then load the next page and repeat the process until you could not find this element.
And remenber that you must iterate this pages one by one (you can't jump to a different page altering the parameter "sr_pg_" in the link), because Amazon include in the links references to the session and this link is generated in every new page.

Related

Getting stats about a page on my site and links on that page from the Google API

I have a blog with the Google Analytics tag in the various pages. I also have links on my site pointing to pages on my site as well as external pages. I have not set up custom events or anything like that.
For a given url/page on my site, within a certain date range, I want to programatically get (ideally from the GA API):
Search words/traffic sources that unique users/traffic from outside my website (e.g. organic traffic searching on Google) used to land on and view that page
For specific links on that page - both internal and external - I want to know the number of unique users who clicked on the link and the number of clicks
For specific links on that page - both internal and external - I want to know the search terms/sources of the users/clicks of the links vs. the visitors that didn't click on the links
Is there a way I can fire a given link on my blog into the Google Analytics API to get this data? I already have a 2-column table that has all of the pages on my site (column 1) and all of the links/urls on those pages (column 2).
I am using Python for all of this btw.
Thanks in advance for any help!
Regarding the information you're looking for:
You won't get organic keywords via the GA API: what you will get most of the time is (not provided) (here is some info and workarounds). You can get this data in the GA UI by linking the search console, but that data won't be exposed via the GA API, only the Search Console API (formerly "webmasters"), which unfortunately you won't be able to link with your GA data.
You will need to implement custom events if you want to track link clicks, as by default GA doesn't do it (here is an example which you can use for both internal and external links). Once you have the events implemented, you can use the ga:eventAction or ga:eventLabel to filter on your links (depending on how you implemented the events), and ga:totalEvents / ga:uniqueEvents to get the total / unique number of clicks.
You will need to create segments in order to define conditions about what users did or did not do. What I advise you to do is to create and test your segments via the UI to make sure they're correct, then simply refer to the segments via ID using the API.
As for the GA API implementation, before coding I advise you to get familiar with the API using:
The query explorer
Google Sheets + GA API plugin
Once you get the results you're looking for, you can automate with the Google Python Client (it's the same client for (nearly) all Google APIs), GA being a service you use with the client, and you'll find some python samples here.

My script fetches few of the content among several

I've written a script to get all the reviews, reviewer names and ratings from yelp using their api. My below script can produce the three reviews, reviewer names and ratings from that api. However, I can see 44 of such reviews in that landing page where I collected their api from using chrome dev tools. How can i get all of them?
link to the landing page
This is my try:
import requests
res = requests.get("https://eatstreet.com/api/v2/restaurants/40225?yelp_site=")
name = res.json()['name']
for texualreviews in res.json()['yelpReviews']:
reviewtext = texualreviews['message']
revname = texualreviews['reviewerName']
rating = texualreviews['rating']
print(f'{name}\n{reviewtext}\n{revname}\n{rating}\n')
As I said earlier: my above script can produce three of the reviews whereas there are 44 of them. How can i grab them all?
Screenshot of those reviews (location to find them in that landing page).
Yelp's own API doesn't allow for query of more than 3 reviews; for whatever reason they limit the amount of reviews you can get (the same way Google limits their API to displaying only 5 reviews). If you are scraping, scrape the Yelp page directly; the site which you are hitting is using the API to display 3 reviews (the max) with a call back directly to that locations Yelp site (where all the reviews are shown); there is sadly no native way to extract all the reviews from Yelp;
The API URL You queried off Google's Developer Tools inspector in Chrome (https://eatstreet.com/api/v2/restaurants/40225?yelp_site=) is calling on Fusion's (Yelp's API) to pull the yelpReviews array in the JSON; limited to 3 by default, even if you were to register your own Fusion app you won't be able to pull more than 3 reviews, that's a hard cap set by Yelp.
You could search for some makeshift scripts out there though, I've seen many people make attempts to create libraries for pulling review data where the API's are limited. A good example is one I wrote here: https://github.com/ilanpatao/Yelp-Reviews-API
Best,
Ilan

Get all products from specific amazon store using scrapy

Is there is a way to get all items of specific seller on amazon?
When I try to submit requests using different forms of urls to the store (the basic is ("https://www.amazon.com/shops/"), I'm getting 301 with no additional info.
even before the spider itself, from the scrapy shell (some random shop from amazon)
scrapy shell "https://www.amazon.com/shops/A3TJVJMBQL014A"
There is 301 response code:
request <GET https://www.amazon.com/shops/A3TJVJMBQL014A>
response <301 https://www.amazon.com/shops/A3TJVJMBQL014A>
In the browser it will be redirected to https://www.amazon.com/s?marketplaceID=ATVPDKIKX0DER&me=A3TJVJMBQL014A&merchant=A3TJVJMBQL014A&redirect=true
using resulting URL also leads to 301 response.
I was using scrapy shell, while as answered by #PadraicCunningham it doesn't support location header.
Running code from spider resolved the issue.
Since you want a list of all goods sold by one specific seller, you can analyze the page of that seller specifically.
Here, I am going to take Kindle E-readers Seller as an example.
Open the console in your browser and select the max page count element on the seller's page, you can see the number of max pages of this seller is inside a tag <span class="pagnLink"> </span>, so you can find this tag and extract the max page count from it.
you can see there is a slight change in the url when you move to next page of this seller's goods list (from page=1 to page=2), so you can easily construct a new url when you wanna move to next page.
set a loop whose limitation is the number of max page count you got in the first step.
analyze the specific data you wanna get on that page, analyze what html tags they are inside and use some text analyze libraries to help you extract the data. (re, BeautifulSoup .etc)
Briefly, you have to analyze the page before writing codes.
When you start coding, you should first making requests, then get response from your request, then extracting useful data from the response(according to the rules you analyzed before writing codes).

Python scraping using inspect element or firebug

As I am going through this youtube scraping tutorial https://www.youtube.com/watch?v=qbEN3boz7_M, I was introduced that instead of scraping from the "public" page loaded heavily with all other stuff, there is a way to find a "private" page to scrape the necessary information much more efficiently using inspect element/firebug.
google chrome > inspect element > network > XHR
The person in the youtube video uses stock price as an example and be able to locate a "private" page to scrape much quickly and less intensive to the server. Though when I tried to look at sites I wanted to scrape, for example, http://www.rottentomatoes.com/m/grigris/, going through the inspect element (chrome) > Network > XHR > checking the headers' request URL and preview, I didn't seem to find anything useful.
Am I missing something? How can I ensure if a raw or condensed information is hidden somewhere? Using the Rottentomatoes.com page as an example, how can I tell if there is 1) a "private page" that gives the title and year of the movie and 2) a summary page (in csv-like format) that "stores" all the movies' titles and year in one page?
You can only find XHR requests, if the page is dynamically loading data. In your example, the only thing of note is this URL:
http://www.rottentomatoes.com/api/private/v1.0/users/current/ratings/771355871
Which contains some information about the movie in JSON.
{"media":{"type":"movie","id":771355871,"title":"Grigris","url":"http://www.rottentomatoes.com/m/grigris/","year":2014,"mpaa":"Unrated","runtime":"1 hr. 40 min.","synopsis":"Despite a bum leg, 25-year-old Grigris has hopes of becoming a professional dancer, making some extra cash putting his killer moves to good use on the...","thumbnail":"http://content6.flixster.com/movie/11/17/21/11172196_mob.jpg","cast":[{"name":"Souleymane Démé","id":"771446344"},{"name":"Anaïs Monory","id":"771446153"}]}}
Make sure you have the chrome developer tools open when you load the site. If not, the developer tools don't capture any requests. You can open them and refresh the page, then you should see them under the XHR filter.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories