Scrapy: Maintain location cookie for redirects - python

Code:
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesspiderItem
from scrapy.http import Request
class LowesSpider(scrapy.Spider):
name = 'lowes'
def start_requests(self):
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
for url in start_urls:
yield Request(url, cookies={'sn':'2333'}) #Added cookie to bypass location req
def parse(self, response):
items = response.css('.grid-container')
for product in items:
item = LowesspiderItem()
#get product price
productPrice = product.css('.art-pd-price::text').get()
#get lowesNum
productLowesNum = response.url.split("/")[-1]
#get SKU
productSKU = product.css('.met-product-model::text').get()
item["productLowesNum"] = productLowesNum
item["productSKU"] = productSKU
item["productPrice"] = productPrice
yield item
Output:
{'productLowesNum': '1001440644',
'productPrice': None,
'productSKU': '8654RM-42'}
Now, I'll have a list of SKU's so that's how I'm going to format start_urls, so,
start_urls = ['https://www.lowes.com/search?searchTerm=('some sku)']
This url would redirect me to this link: https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644
That's handled by scrapy
Now the problem
When I have:
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
I get the SKU but not the price.
However when I use the actual URL in start_urls
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644']
then my output is fine:
{'productLowesNum': '1001440644',
'productPrice': '1,449.95',
'productSKU': '8654RM-42'}
So, I believe using a URL that has to be redirected causes for my scraper to not get the price for some reason, but I still get the SKU.
Here's my guess: I had to preset a location cookie because the Lowes website does not allow you to see the price unless the user gives them a zip code/ location. so I'd assume I would have to move or adjust cookies={'sn':'2333'} to make my program work as expected.

Problem
The main issue here is that some of your cookies which are set by the first request
https://www.lowes.com/search?searchTerm=8654RM-42
are carried forward to the request after the redirect which is
https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644
These cookies are overriding the cookies set by you.
Solution
You need to send explict cookies to each request and prevent the previous cookies from being added to the next request.
There is a setting in scrapy called dont_merge_cookies which is used for this purpose. You need to set this setting in your request meta to prevent cookies from previous requests being appended to the next request.
Now you need to explicitly set the cookies in request header. Something like this:
def start_requests(self):
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
for url in start_urls:
yield Request(url, headers={'Cookie': 'sn=2333;'}, meta={'dont_merge_cookies': True})

Related

Scrapy response url not exatly the same as the one i defined on start urls

I have a spider i give it this url https://tuskys.dpo.store/#!/~/search/keyword=dairy milk
However when i try to get the url in scrapy parse method the url looks like https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk
Here is a demo code to demonstrate my problem
import scrapy
class TuskysDpoSpider(scrapy.Spider):
name = "Tuskys_dpo"
#allowed_domains = ['ebay.com']
start_urls = ['https://tuskys.dpo.store/#!/~/search/keyword=dairy milk']
def parse(self, response):
yield{'url':response.url}
results: {"url": "https://tuskys.dpo.store/?_escaped_fragment_=%2F%7E%2Fsearch%2Fkeyword%3Ddairy%2520milk"}
Why is my scrapy response url not exactly the same as the url i defined and is there a way to go around this?
You should use response.request.url because you are redirected from your start url, so response.url is the url you are redirected to.

How I can take data from all pages?

it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)

Scrapy: How To Start Scraping Data From a Search Result that uses Javascript

I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}

Soundcloud Scrapy Spider

I'm trying to build a Scrapy Spider to parse the artist and track info from SoundCloud.
Using the developer tools in FireFox I've determined an API call can be made that returns a JSON object that converts to a python dictionary. This API call needs an artist ID, and as far as I can tell these IDs have been auto-incremented. This means I don't need to crawl the site, and can just have a list of starting URLs that make the initial API call and then parse the pages that follow from that. I believe this should make me more friendly to the site?
From the returned response the artists' URL can be obtained, and visiting and parsing this URL will give more information about the artist
From the artists' URL we can visit their tracks and scrape a list of tracks alongside the tracks' attributes.
I think the issues I'm having stem from not understanding Scrapy's framework...
If I directly put in the artists' URL is start_urls Scrapy passes a scrapy.http.response.html.HtmlResponse Object to parse_artist. This allows me to extract the data I need (I didn't include all the code to parse the page to keep the code snippet shorter). However, if I pass that same Object to the same function from the function parse_api_call it results in an error...
I cannot understand why this is, and any help would be appreciated.
Side Note:
The initial API call grabs tracks from the artist, and the offset and limit can be changed and the function called recursively to collect the tracks. This, however, has proven unreliable, and even when it doesn't result in an error that terminates the program, it doesn't get a full list of tracks from the artist.
Here's the current code:
"""
Scrapes SoundCloud websites for artists and tracks
"""
import json
import scrapy
from ..items import TrackItem, ArtistItem
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class SoundCloudBot(scrapy.Spider):
name = 'soundcloudBot'
allowed_domains = ['soundcloud.com']
start_urls = [
'https://api-v2.soundcloud.com/users/7436630/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/4803918/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/17364233/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/19697240/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/5949564/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en'
]
# This is added for testing purposes. When these links are added directly to the
# start_urls the code runs as expected, when these links are extracted using parse_api_call
# is when problems arise
# start_urls.extend([
# 'https://soundcloud.com/futureisnow',
# 'https://soundcloud.com/bigsean-1',
# 'https://soundcloud.com/defjam',
# 'https://soundcloud.com/ymcmbofficial',
# 'https://soundcloud.com/walefolarin',
# # 'https://soundcloud.com/futureisnow/tracks',
# # 'https://soundcloud.com/bigsean-1/tracks',
# # 'https://soundcloud.com/defjam/tracks',
# # 'https://soundcloud.com/ymcmbofficial/tracks',
# # 'https://soundcloud.com/walefolarin/tracks'
# ])
def parse(self, response):
url = response.url
if url[:35] == 'https://api-v2.soundcloud.com/users':
self.parse_api_call(response)
# 'https://soundcloud.com/{artist}'
elif url.replace('https://soundcloud.com', '').count('/') == 1: # One starting forward slash for artist folder
self.parse_artist(response)
# 'https://soundcloud.com/{artist}/{track}'
elif url.replace('https://soundcloud.com', '').count('/') == 2 and url[-6:] == 'tracks':
self.parse_tracks(response)
def parse_api_call(self, response):
data = json.loads(response.text)
artistItem = ArtistItem()
first_track = data['collection'][0]
artist_info = first_track.get('user')
artist_id = artist_info.get('id')
artist_url = artist_info.get('permalink_url')
artist_name = artist_info.get('username')
artistItem['artist_id'] = artist_id
artistItem['username'] = artist_name
artistItem['url'] = artist_url
artist_response = scrapy.http.response.html.HtmlResponse(artist_url)
self.parse_artist(artist_response)
# Once the pipelines are written this will be changed to yeild
return artistItem
def parse_artist(self, response):
# This prints out <class 'scrapy.http.response.html.HtmlResponse'>
# It doesn't matter if start_urls get extend with artists' URLS or not
print(type(response))
data = response.css('script::text').extract()
# This prints out a full HTML response if the function is called directly
# With scrapy, or an empty list if called from parse_api_call
print(data)
track_response = scrapy.http.response.html.HtmlResponse(f'{response.url}/tracks')
self.parse_tracks(track_response)
def parse_tracks(self, response):
pass
You have to use
Request(url)
to get data from new url. But you can't execute it as normal function and get result at once. You have to use return Request() or yield Request() and scrapy puts it in queue to get data later.
After it gets data it uses method parse() to parse data from response. But you can set own method in request
Request(url, self.parse_artist)
But in parse_artist() you will not have access to data which you get in previous function so you have to send it in request using meta - ie.
Request(artistItem['url'], self.parse_artist, meta={'item': artistItem})
Full working code. You can put all in one file and run it without creating project.
It also saves result in output.csv
import scrapy
from scrapy.http import Request
import json
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['soundcloud.com']
start_urls = [
'https://api-v2.soundcloud.com/users/7436630/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/4803918/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/17364233/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/19697240/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en',
'https://api-v2.soundcloud.com/users/5949564/tracks?offset=0&limit=20&client_id=Q11Oe0rIPEuxvMeMbdXV7qaowYzlaESv&app_version=1556892058&app_locale=en'
]
def parse(self, response):
data = json.loads(response.text)
if len(data['collection']) > 0:
artist_info = data['collection'][0]['user']
artistItem = {
'artist_id': artist_info.get('id'),
'username': artist_info.get('username'),
'url': artist_info.get('permalink_url'),
}
print('>>>', artistItem['url'])
# make requests to url artistItem['url'],
# parse response in parse_artist,
# send artistItem to parse_artist
return Request(artistItem['url'], self.parse_artist, meta={'item': artistItem})
else:
print("ERROR: no collections in data")
def parse_artist(self, response):
artistItem = response.meta['item']
data = response.css('script::text').extract()
# add data to artistItem
#print(data)
artistItem['new data'] = 'some new data'
#print('>>>', response.urljoin('tracks'))
print('>>>', response.url + '/tracks')
# make requests to url artistItem['url'],
# parse response in parse_tracks,
# send artistItem to parse_tracks
return Request(response.url + '/tracks', self.parse_tracks, meta={'item': artistItem})
def parse_tracks(self, response):
artistItem = response.meta['item']
artistItem['tracks'] = 'some tracks'
# send to CSV file
return artistItem
#------------------------------------------------------------------------------
# run it without creating project
#------------------------------------------------------------------------------
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(MySpider)
c.start()
ouput.csv
artist_id,username,url,new data,tracks
17364233,Def Jam Recordings,https://soundcloud.com/defjam,some new data,some tracks
4803918,Big Sean,https://soundcloud.com/bigsean-1,some new data,some tracks
19697240,YMCMB-Official,https://soundcloud.com/ymcmbofficial,some new data,some tracks
5949564,WALE,https://soundcloud.com/walefolarin,some new data,some tracks

Scrapy callback after redirect

I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it?
Here's my code.
from scrapy.contrib.spiders import CrawlSpider
from scrapy import log
from scrapy import Request
class DmozSpider(CrawlSpider):
name = "dmoz"
handle_httpstatus_list = [302]
allowed_domains = ["http://www.exmaple.net/"])
f = open("C:\\python27\\1a.csv",'r')
url = 'http://www.exmaple.net/Query?indx='
start_urls = [url+row for row in f.readlines()]
def parse(self, response):
print response.meta.get('redirect_urls', [response.url])
print response.status
print (response.headers.get('Location'))
I've also tried something like that:
def parse(self, response):
return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url)
def parse_my_url(self, response):
print response.status
print (response.headers.get('Location'))
And it doesn't work either.
By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.
def start_requests(self):
requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse)) for u in self.start_urls]
return requests

Categories