Scrapy fixed URLs - python

Trying to wrap my head around this... I have a fixed list of 100,000 URLs I would like to scrape, which is fine, I know how to handle that. But first I need to get a cookie from an initial form post and use it for the subsequent requests. Would that be like a nested spider? Just trying to understand the architecture for that use case.
Thanks!

scrapy will do the cookies things automatically.
All you need to do is form post first, then yield the requests for your 100,000 URLs.
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
'https://example.com/login', #login page
)
def __init__(self, *args, **kwargs):
self.url_list = [] #your url lists
return super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
data = {}
return scrapy.FormRequest.from_response(
response,
formdata=data,
callback=self.my_start_requests
)
def my_start_requests(self, response):
# ignore the login callback response
for url in self.url_list:
# scrapy will take care the cookies
yield scrapy.Request(url, callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
# your code here
pass

Related

How to return information from the parse function to the start_requests function?

This problem is starting to frustrate me very much as I feel like I have no clue how scrapy works and that I can't wrap my head around the documentation.
My question is simple. I have the most standard of spiders.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
test = scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
return site_number_of_pages
I just want to somehow get the number of pages from the parse function back into the start requests function so I can start a for loop to go through all the pages on the website, using the same parse function again. The code above does illustrated the principle only but would not work if put in practice. Variable test would be a Request class and not my plain Joe integer that I want.
How would I accomplish what I am trying to do?
EDIT:
This is what I have tried up till now
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = ..
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
site_number_of_pages = int(response.xpath(..))
for count in range(2,site_number_of_pages):
url = url + str(count)
yield scrapy.Request(url=url, callback=self.parse, headers = header)
Scrapy is asynchronous framework. Here is no any possibility to.. return to start_urls - only Requests followed by it's callbacks.
On general case if Requests appeared as result of some response parsing (on your case - site_number_of_pages from first url) - it is not start_requests
The easiest thing You can do in this case - is to yield requests from parse method.
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
for i in range(site_number_of_pages):
...
yield Request(url=...
Instead of grabbing the number of pages and looping through all of them, I grabbed the "next page" feature of the web page. So everytime self.parse is activated it will grab the next page and call itself again. This will go on until there is no next page and it will just error out.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
..
next_page = response.xpath(..)
url = "www.website.whatever/search/filter1=.../&page=" + next_page
yield scrapy.Request(url=url, callback=self.parse, headers = header)

Scrapy to login and then grab data from Weibo

I am still trying to use Scrapy to collect data from pages on Weibo which need to be logged in to access.
I now understand that I need to use Scrapy FormRequests to get the login cookie. I have updated my Spider to try to make it do this, but it still isn't working.
Can anybody tell me what I am doing wrong?
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
def start_requests(self):
return [
scrapy.Request("https://www.weibo.com/u/2247704362/home?wvr=5&lf=reg", callback=self.parse_item)
]
def parse_item(self, response):
return scrapy.FormRequest.from_response(response, formdata={'user': 'user', 'pass': 'pass'}, callback=self.parse)
def parse(self, response):
print(response.body)
When I run this spider. Scrapy redirects from the URL under start_requests, and then returns the following error:
ValueError: No element found in <200 https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fweibo.com%2Fu%2F2247704362%2Fhome%3Fwvr%3D5%26lf%3Dreg&domain=.weibo.com&ua=php-sso_sdk_client-0.6.28&_rand=1585243156.3952>
Does that mean I need to get the spider to look for something other than Form data in the original page. How do I tell it to look for the cookie?
I have also tried a spider like this below based on this post.
import scrapy
class LoginSpider(scrapy.Spider):
name = 'WB'
login_url = "https://www.weibo.com/overseas"
test_url = 'https://www.weibo.com/u/2247704362/'
def start_requests(self):
yield scrapy.Request(url=self.login_url, callback=self.parse_login)
def parse_login(self, response):
return scrapy.FormRequest.from_response(response, formid="W_login_form", formdata={"loginname": "XXXXX", "password": "XXXXX"}, callback=self.start_crawl)
def start_crawl(self, response):
yield Request(self.test_url, callback=self.parse_item)
def parse_item(self, response):
print("Test URL " + response.url)
But it still doesn't work, giving the error:
ValueError: No element found in <200 https://www.weibo.com/overseas>
Would really appreciate any help anybody can offer as this is kind of beyond my range of knowledge.

How to keep track of a request in scrapy

I'm scraping a list of pages, I have
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
Now, when I make the scraping, if the page exist, the url it change, when I try:
response.url
or
response.request
I don't get
'page_1_id', 'page_2_id', 'page_1_2', 'page_3_id'
since scrapy make asyncronous request I need the 'id' to match the data back, so what I need is to pass the 'id; as argument in each request, I thougtht on a list
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
id = ['id_1','id_2','id_3']
But have to issues, first of all I don't know how to pass this arguments, and second it won't work since I don't the order at wich request are been made. So I would probably need to use a dictionary , there is a way to make something like this:
start_urls = {'page_1_id':id_1, 'page_2_id':id_2, 'page_1_3':id_3, 'page_4_id':id_4}
My spider is quite simple, I just need to get a link and the id back:
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[#id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/#href').extract())
return myItem
Just need to add the 'id'
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[#id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/#href').extract(),id)
return myItem
You can override how scrapy starts yielding requests by overriding start_requests() method. Seems like you want to do that and then put the id in request.meta attribute to carry it over to parse callback. Something like:
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url,
meta={'page_id': url.split('_',1)[-1] # 1_id})
def parse(self, response):
print(response.meta['page_id'])
# 1_id

scrapy crawl multiple page using Request

I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.

Scrapy callback after redirect

I have a very basic scrapy spider, which grabs urls from the file and then downloads them. The only problem is that some of them got redirected to a slightly modified url within same domain. I want to get them in my callback function using response.meta, and it works on a normal urls, but then url is redirected callback doesn't seem to get called. How can I fix it?
Here's my code.
from scrapy.contrib.spiders import CrawlSpider
from scrapy import log
from scrapy import Request
class DmozSpider(CrawlSpider):
name = "dmoz"
handle_httpstatus_list = [302]
allowed_domains = ["http://www.exmaple.net/"])
f = open("C:\\python27\\1a.csv",'r')
url = 'http://www.exmaple.net/Query?indx='
start_urls = [url+row for row in f.readlines()]
def parse(self, response):
print response.meta.get('redirect_urls', [response.url])
print response.status
print (response.headers.get('Location'))
I've also tried something like that:
def parse(self, response):
return Request(response.url, meta={'dont_redirect': True, 'handle_httpstatus_list': [302]}, callback=self.parse_my_url)
def parse_my_url(self, response):
print response.status
print (response.headers.get('Location'))
And it doesn't work either.
By default scrapy requests are redirected, although if you don't want to redirect you can do like this, use start_requests method and add flags in request meta.
def start_requests(self):
requests =[(Request(self.url+u, meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse)) for u in self.start_urls]
return requests

Categories