I'm trying to load some XPATH rules from a database using Scrapy.
The code I've written so far works fine, however after some debugging I've realised that Scrapy is parsing each item asynchronously, meaning I have no control over the order of which item is being parsed.
What I want to do is figure out which item from the list is currently being parsed when it hits the parse() function so I can reference that index to the rows in my database and acquire the correct XPATH query. The way I'm currently doing this is by using a variable called item_index and incrementing it after each item iteration. Now I realise this is not enough and I'm hoping there's some internal functionality that could help me achieve this.
Does anyone know the proper way of keeping track of this? I've looked through the documentation but couldn't find any info about it. I've also looked at the Scrapy source code but I can't seem to figure out how the list of URL's actually get stored.
Here's my code to explain my problem further:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
item_index = 0
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for row in self.rows:
yield self.make_requests_from_url(row["product_url"])
def parse(self, response):
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[self.item_index]['product_id']
item['merchant_id'] = self.rows[self.item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[self.item_index]['xpath_rule']).extract()
self.item_index+=1
return item
Any guidance would be greatly appreciated!
Thanks
Here's the solution I came up with just in case anyone needs it.
As #toothrot suggested, you need to overload methods within the Request class to be able to access meta information.
Hope this helps someone.
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for indx, row in enumerate(self.rows):
self.start_urls.append( row["product_url"] )
yield self.make_requests_from_url(row["product_url"], {'index': indx})
def make_requests_from_url(self, url, meta):
return Request(url, callback=self.parse, dont_filter=True, meta=meta)
def parse(self, response):
item_index = response.meta['index']
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[item_index]['product_id']
item['merchant_id'] = self.rows[item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[item_index]['xpath_rule']).extract()
return item
You can pass the index (or the row id from the database) along with the request using Request.meta. It's a dictionary you can access from Response.meta in your handler.
For example, when you're building your request:
Request(url, callback=self.some_handler, meta={'row_id': row['id']})
Using a counter like you've attempted won't work because you can't guarantee the order in which the responses are handled.
Related
it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
strong textMy code based on examples that I searched did not seem to function as intended so I decided to use a working model found on github: https://github.com/scrapy/quotesbot/blob/master/quotesbot/spiders/toscrape-xpath.py
I then modified it slightly to showcase what I am running into. The code below works great as intended but my ultimate goal is to pass the scraped data from first "parse" to a second "parse2" function so that I can combine data from 2 different pages. But for now I wanted to start very simple so I can follow what is happening, hence the heavily stripped code below.
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item['tinfo'] =
quote.xpath('./span[#class="text"]/text()').extract_first()
yield item
but then when I modify the code as below:
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item['tinfo'] =
quote.xpath('./span[#class="text"]/text()').extract_first()
yield Request("http://quotes.toscrape.com/",
callback=self.parse2, meta={'item':item})
def parse2(self, response):
item = response.meta['item']
yield item
I only have one item scraped and it says the rest are duplicates. It also looks like "parse2" is not even read at all. I have played with the indentation and the brackets thinking I am missing something simple, but without much success. I have looked at many examples to see if I can make sense of what could be the issue but I still am not able to make it work. I am sure its a very simple issue for those gurus out there, so I yelp "Help!" somebody!
also my items.py file looks like below and I think those two files items.py and toscrape-xpath.py are the only ones in action as far as I can tell since I am quite new to all this.
# -*- coding: utf-8 -*-`enter code here`
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuotesbotItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class MyItems(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
tinfo = scrapy.Field()
pass
Thank you very much to any and all help you can provide
# -*- coding: utf-8 -*-
import scrapy
from quotesbot.items import MyItems
from scrapy import Request
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
item = MyItems()
for quote in response.xpath('//div[#class="quote"]'):
item =
{'tinfo':quote.xpath('./span[#class="text"]/text()').extract_first()}
**yield response.follow**('http://quotes.toscrape.com', self.parse_2,
meta={'item':item})
def parse_2(self, response):
print "almost there"
item = response.meta['item']
yield item
Your spider logic is very confusing:
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
yield Request("http://quotes.toscrape.com/",
callback=self.parse2, meta={'item':item})
For every quote you find on quotes.toscrape.com you schedule another request with to the same webpage?
What happens is that these new scheduled requests get filtered out by scrapys duplicate request filter.
Maybe you should just yield the item right there:
def parse(self, response):
for quote in response.xpath('//div[#class="quote"]'):
item = MyItems()
item['tinfo'] = quote.xpath('./span[#class="text"]/text()').extract_first()
yield item
To illustrate why your current crawler does nothing see this drawing:
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items
You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.
I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.
Hi can someone help me out I seem to be stuck, I am learning how to crawl and save into mysql us scrapy. I am trying to get scrapy to crawl all of the website pages. Starting with "start_urls", but it does not seem to automatically crawl all of the pages only the one, it does save into mysql with pipelines.py. It does also crawl all pages when provided with urls in a f = open("urls.txt") as well as saves data using pipelines.py.
here is my code
test.py
import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *
class GotP(CrawlSpider):
name = "gotp"
allowed_domains = ["www.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
rules = [
Rule(SgmlLinkExtractor(
allow=('')),
callback ="parse",
follow=True
)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
prices = hxs.select("//div[#class="sliderforward arrow"]")
for price in prices:
item = GotPItem()
item ["price"] = price.select("text()").extract()
yield item
If I understand correctly, you are trying to follow the pagination and extract the results.
In this case, you can avoid using CrawlSpider and use regular Spider class.
The idea would be to parse the first page, extract total results count, calculate how much pages to go and yield scrapy.Request instances to the same URL providing s GET parameter value.
Implementation example:
import scrapy
class GotP(scrapy.Spider):
name = "gotp"
allowed_domains = ["www.sfbay.craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/sss"]
results_per_page = 100
def parse(self, response):
total_count = int(response.xpath('//span[#class="totalcount"]/text()').extract()[0])
for page in xrange(0, total_count, self.results_per_page):
yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)
def parse_result(self, response):
results = response.xpath("//p[#data-pid]")
for result in results:
try:
print result.xpath(".//span[#class='price']/text()").extract()[0]
except IndexError:
print "Unknown price"
This would follow the pagination and print prices on the console. Hope this is a good starting point.