I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date.
Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)
Whats the best way to do this?
Use CrawSpider Rule:
Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),
Then in your spider:
def check_moddate(self, request):
def dateisnew():
# check the date
if dateisnew():
return request
Related
I already have some working Spiders and code to achieve what I want, but I was looking for advice on how to consolidate things more efficiently for the project I'm working on.
My current process involves:
Within Scrapy: manually create Items using scrapy.Item
Within Scrapy: crawl, outputing each Item row to a JSON Lines (JL) file
Current Pipeline:
#pipelines.py
class MyPipeline(object):
def process_item(self, item, spider):
for field in item.fields:
item.setdefault(field, None)
return item
Outside Scrapy w/ SQL Alchemy: truncate incoming table, bulk insert JL file using Pandas to_sql
Outside Scrapy w/ SQL Alchemy: update incoming table row_uuid column (md5 hash of all pertinent columns)
Outside Scrapy w/ SQL Alchemy: upsert (insert...on conflict...where row_uuid is distinct from) incoming data table into source data table
Outside Scrapy w/ SQL Alchemy: delete from source table as necessary (404 errors, etc)
Ideally, I want to perform all these actions within Scrapy using a proper pipeline. I've seen dataset mentioned. Would some combination of open_spider, process_item, and close_spider help? Some questions I have:
Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
Is it possible to bulk insert X Items at a time, rather than one Item at a time?
Would the below be a potential approach? I assume other changes would be necessary...
#pipelines.py
class SqlPipeline(object):
def __init__(self, db_conn):
#Connect to DB here?
def open_spider(self, spider):
#Truncate specific incoming table here?
def process_item(self, item, spider):
#(Bulk) insert item row(s) into specific incoming table?
#Where would you define the table for this?
def close_spider(self, spider):
#Update row_uuid for specific incoming table?
#Do upsert and delete rows for specific source table?
#Close DB connection here?
Thanks for your help!
The pipelines in Scrapy are used to do exactly what you are saying.
Answering your questions:
Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
I don't understand the "listening the columns manually". I am going to guess that you have a table in a database with a bunch of columns. Those columns have to be defined in your items because they will be mapped to the DB. If not, how do you expect map every field into the column in the table?
If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
Yes. You can define multiple pipelines (with their weight) in order to separate different processes per Item and separated properly.
Is it possible to bulk insert X Items at a time, rather than one Item at a time?
Would the below be a potential approach?
Yes. Of course! You have to define it into your pipeline. Your logic can be different in each one!
Scrapy has several points/places where allowed processing scraped data: spider, items and spider middlewares. But I don't understand where I should do it right. I can process some scraped data in all these places. Could you explain to me differences between them in detail?
For example: downloader middleware returns some data to the spider (number, short string, url, a lot of HTML, list and other). And what and where i should do with them? I understand what to do, but is not clear where to do it...
Spiders are the main point where you define how to extract data, as items. When in doubt, implement your extraction logic in your spider only, and forget about the other Scrapy features.
Item loaders, item pipelines, downloader middlewares, spider middlewares and extensions are used mainly for code sharing in scraping projects that have several spiders.
If you ever find yourself repeating the same code in two or more spiders, and you decide to stop repeating yourself, then you should go into those components and choose which ones to use to simplify your codebase my moving existing, duplicate code into one or more components of these types.
It is generally a better approach than simply using class inheritance between Spider subclasses.
As to how to use each component:
Item loaders are for shared extraction logic (e.g. XPath and CSS selectors, regular expressions), as well as pre- and post-processing of field values.
For example:
If you were writing spiders for websites that use some standard way of tagging the data to extract, like schema.org, you could write extraction logic on an item loader and reuse it across spiders.
If you want to switch the value of an item field to uppercase always, you would use an output processor on the item loader class, and reuse that item loader across spiders.
Item pipelines are for post-processing of items (not just item data in a specific item).
Common use cases include dropping duplicate items (by keeping track of uniquely-identifying data of every item parsed) or sending items to database servers or other forms of storage (as a flexible alternative to feed exports).
Downloader middlewares are used for shared logic regarding the handling of request of responses.
Common use cases include implementing anti-bot software detection and handling or proxy handling. (built-in downloader middlewares)
Spider middlewares are used for any other shared logic between spiders. It is the closes to a spider base class that there is. It can handle exceptions from spiders, the initial requests, etc. (built-in spider middlewares)
Extensions are used for more general changes to Scrapy itself. (built-in extensions)
I will try to explain in order
Spider is the one where you decide which URLs to make requests to
DownloadMiddleware has a process_request method which is called before a request to URL is made, and it has process_response method which is called once response from that URL is received
Pipeline is the thing where data is sent when you yield a dictionary from your Spider
So I have an item['html'] field that is needed for MyExamplePipeline, but after processing it isn't needed to store into a database with i.e, MongoDBPipeline. Is there a way in scrapy to just drop the field html and keep the rest of the item? It's needed as part of the item to pass the page html from the spider to the pipeline, but I'm not able to figure out how to drop it. I looked in this SO post that mentioned using FEED_EXPORT_FIELDS OR fields_to_export, but the problem is that I don't want to use an item exporter, I just want to feed the item into the next MongoDBPipeline. Is there a way to do this in Scrapy? Thanks!
You can easily do that. In your MongoDBPipeline you need to do something like below
del item['html']
If that impacts the item in another pipeline then use copy.deepcopy and create a copy of item object and then delete html before inserting into mongodb
I was just wondering how I can reset the dupefilter process to avoid a certain number of url to be filtered.
Indeed, I tested a crawler many times before succeeding, and now that I want to run it with something like scrapy crawl quotes -o test_new.csv -s JOBDIR=crawls/quotes_new-1
It keeps telling me that some url are duplicated and then not visited..
Would be definitely OK to remove all url from that crawler
Would appreciate to know where the duplicate url are filtered (then I could edit?)
The request No-filter is not possible with my problem because it will loop
I can add my code but as it's a general question I felt it would be more confusing than anything. Just ask if you need it :)
Thank you very much,
You can set scrapys DUPEFILTER_CLASS setting with your own dupefilter class or just extend the default RFPDupeFilter(source code) class with your changes.
This documentation pages explains a bit more:
The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function.
In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).
I'm asking about scrapy framework.
I'm scraping a business page. First I add Brand Item (I have Brands table) with business name, then I want add several business locations (BusinessLocations table) but I need database BrandId to insert business location to database. Then I add few records about departments for each business location and again I need database BusinessLocationId to insert each Department.
Let's assume I insert Items to database in pipeline.
Can I simply assume that items processed earlier already left pipeline and are in database? In this case I can simply select needed Id's from database using some text unique field passed via meta data.
However I suppose there might be race condition since scrapy process multiple requests simultaneously. By race condition I mean that BusinessLocation item is being added before appropriate Brand is inserted to database. Is there a risk of that kind of race condition?
Can I simply assume that items processed earlier already left pipeline and are in database?
Generally, no.
It highly highly on what you do in the pipeline. For example, if you use the images pipeline, then the items with images will be hold by the images pipeline until all the images are retrieved meanwhile an item with no images or very few images will pass to the next pipeline before the previous item.
You could collect the sub-items in the main item object passing around the item to the sub-requests, but then you will have to care about whether to handle the errors to not lose an incomplete item. Another approach can be storing the items in a staging database and later consolidate looking for orphan records.
I found solution how to wait, until all data are scraped - close_spider method of pipeline is called after spider is closed.
class BlsPipeline100(object):
def __init__(self):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
processAllItems()
Now I can create hierarchy having all items accessible.