I'm scraping a list of pages, I have
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
Now, when I make the scraping, if the page exist, the url it change, when I try:
response.url
or
response.request
I don't get
'page_1_id', 'page_2_id', 'page_1_2', 'page_3_id'
since scrapy make asyncronous request I need the 'id' to match the data back, so what I need is to pass the 'id; as argument in each request, I thougtht on a list
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
id = ['id_1','id_2','id_3']
But have to issues, first of all I don't know how to pass this arguments, and second it won't work since I don't the order at wich request are been made. So I would probably need to use a dictionary , there is a way to make something like this:
start_urls = {'page_1_id':id_1, 'page_2_id':id_2, 'page_1_3':id_3, 'page_4_id':id_4}
My spider is quite simple, I just need to get a link and the id back:
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[#id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/#href').extract())
return myItem
Just need to add the 'id'
def parse(self, response):
myItem = Item()
myItem = Item(link=response.xpath('//*[#id="container"]/div/table/tbody/tr[1]/td/h4[1]/a/#href').extract(),id)
return myItem
You can override how scrapy starts yielding requests by overriding start_requests() method. Seems like you want to do that and then put the id in request.meta attribute to carry it over to parse callback. Something like:
start_urls = ['page_1_id', 'page_2_id', 'page_1_2', 'page_3_id']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url,
meta={'page_id': url.split('_',1)[-1] # 1_id})
def parse(self, response):
print(response.meta['page_id'])
# 1_id
Related
This problem is starting to frustrate me very much as I feel like I have no clue how scrapy works and that I can't wrap my head around the documentation.
My question is simple. I have the most standard of spiders.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
test = scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
return site_number_of_pages
I just want to somehow get the number of pages from the parse function back into the start requests function so I can start a for loop to go through all the pages on the website, using the same parse function again. The code above does illustrated the principle only but would not work if put in practice. Variable test would be a Request class and not my plain Joe integer that I want.
How would I accomplish what I am trying to do?
EDIT:
This is what I have tried up till now
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = ..
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
site_number_of_pages = int(response.xpath(..))
for count in range(2,site_number_of_pages):
url = url + str(count)
yield scrapy.Request(url=url, callback=self.parse, headers = header)
Scrapy is asynchronous framework. Here is no any possibility to.. return to start_urls - only Requests followed by it's callbacks.
On general case if Requests appeared as result of some response parsing (on your case - site_number_of_pages from first url) - it is not start_requests
The easiest thing You can do in this case - is to yield requests from parse method.
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
for i in range(site_number_of_pages):
...
yield Request(url=...
Instead of grabbing the number of pages and looping through all of them, I grabbed the "next page" feature of the web page. So everytime self.parse is activated it will grab the next page and call itself again. This will go on until there is no next page and it will just error out.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
..
next_page = response.xpath(..)
url = "www.website.whatever/search/filter1=.../&page=" + next_page
yield scrapy.Request(url=url, callback=self.parse, headers = header)
I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}
I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.
I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work
Trying to wrap my head around this... I have a fixed list of 100,000 URLs I would like to scrape, which is fine, I know how to handle that. But first I need to get a cookie from an initial form post and use it for the subsequent requests. Would that be like a nested spider? Just trying to understand the architecture for that use case.
Thanks!
scrapy will do the cookies things automatically.
All you need to do is form post first, then yield the requests for your 100,000 URLs.
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
'https://example.com/login', #login page
)
def __init__(self, *args, **kwargs):
self.url_list = [] #your url lists
return super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
data = {}
return scrapy.FormRequest.from_response(
response,
formdata=data,
callback=self.my_start_requests
)
def my_start_requests(self, response):
# ignore the login callback response
for url in self.url_list:
# scrapy will take care the cookies
yield scrapy.Request(url, callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
# your code here
pass