(scrapy) HTTP GET response different from what's being displayed on browser - python

I am new to web scraping and scrapy.
I am trying to scrape items from a website by parsing the GET response which is in json.
However, I am noticing that instead of having just the 90 or so elements that are shown on the website, the raw json response contains 140 + elements.
Just by inspecting the json array, there doesn't seem to be any difference between items that do end up getting displayed in the browser vs those that don't.
Is it possible for me to capture with scrapy the filtered array of items instead of the raw information?
So I've realized that when the website loads, it makes 1 request for product details and 1 for stock availability. By cross-checking those responses, I realized that only those products with items available are displayed.
Now my questions is, can these 2 requests be handled in one scrapy spider class?

I'd recommend scraping all the items, and then filtering them in a custom pipeline.
You would simply get the stock data in open_spider(), and filter out the items you don't need in process_item().

Related

Where is the information stored in a html? (web-scraping)

A beginner here.
I want to extract all the jobs from Barclays (https://search.jobs.barclays/search-jobs)
I got through scraping the first page but am struggling to go to the next page, as the url don't change.
I tried to scrape the url on the next page button, but that href brings me back to the homepage.
Does that mean that all the job data is actually stored within the original html?
If so, how can I extract it?
Thanks!
So I analyzed the website, and it communicates with the server using an API, so you can get data directly from it as a JSON file.
This is the API link in this specific case(for my computer): https://search.jobs.barclays/search-jobs/results?ActiveFacetID=44699&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=
For you the url might be different, but the concept is the same:
As you ca see, there is a 'CurrentPage=2' inside the url which you can use to get any of the pages using requests, then extract what you need from the json.

Unable to get all the links within a page

I am trying to scrape this page:
https://www.jny.com/collections/bottoms
It has a total of 55 products listed with only 24 listed once the page is loaded. However, the div contains list of all the 55 products. I am trying to scrape that using scrappy like this :
def parse(self, response):
print("in herre")
self.product_url = response.xpath('//div[#class = "collection-grid js-filter-grid"]//a/#href').getall()
print(len(self.product_url))
print(self.product_url)
It only gives me a list of length 25. How do I get the rest?
I would suggest scraping it through the API directly - the other option would be rendering Javascript using something like Splash/Selenium, which is really not ideal.
If you open up the Network panel in the Developer Tools on Chrome/Firefox, filter down to only the XHR Requests and reload the page, you should be able to see all of the requests being sent out. Some of those requests can help us figure out how the data is being loaded into the HTML. Here's a screenshot of what's going on there behind the scenes.
Clicking on those requests can give us more details on how the requests are being made and the request structure. At the end of the day, for your use case, you would probably want to send out a request to https://www.jny.com/collections/bottoms/products.json?limit=250&page=1 and parse the body_html attribute for each Product in the response (perhaps using scrapy.selector.Selector) and use that however you want. Good luck!

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Get all products from specific amazon store using scrapy

Is there is a way to get all items of specific seller on amazon?
When I try to submit requests using different forms of urls to the store (the basic is ("https://www.amazon.com/shops/"), I'm getting 301 with no additional info.
even before the spider itself, from the scrapy shell (some random shop from amazon)
scrapy shell "https://www.amazon.com/shops/A3TJVJMBQL014A"
There is 301 response code:
request <GET https://www.amazon.com/shops/A3TJVJMBQL014A>
response <301 https://www.amazon.com/shops/A3TJVJMBQL014A>
In the browser it will be redirected to https://www.amazon.com/s?marketplaceID=ATVPDKIKX0DER&me=A3TJVJMBQL014A&merchant=A3TJVJMBQL014A&redirect=true
using resulting URL also leads to 301 response.
I was using scrapy shell, while as answered by #PadraicCunningham it doesn't support location header.
Running code from spider resolved the issue.
Since you want a list of all goods sold by one specific seller, you can analyze the page of that seller specifically.
Here, I am going to take Kindle E-readers Seller as an example.
Open the console in your browser and select the max page count element on the seller's page, you can see the number of max pages of this seller is inside a tag <span class="pagnLink"> </span>, so you can find this tag and extract the max page count from it.
you can see there is a slight change in the url when you move to next page of this seller's goods list (from page=1 to page=2), so you can easily construct a new url when you wanna move to next page.
set a loop whose limitation is the number of max page count you got in the first step.
analyze the specific data you wanna get on that page, analyze what html tags they are inside and use some text analyze libraries to help you extract the data. (re, BeautifulSoup .etc)
Briefly, you have to analyze the page before writing codes.
When you start coding, you should first making requests, then get response from your request, then extracting useful data from the response(according to the rules you analyzed before writing codes).

Categories