So I have an item['html'] field that is needed for MyExamplePipeline, but after processing it isn't needed to store into a database with i.e, MongoDBPipeline. Is there a way in scrapy to just drop the field html and keep the rest of the item? It's needed as part of the item to pass the page html from the spider to the pipeline, but I'm not able to figure out how to drop it. I looked in this SO post that mentioned using FEED_EXPORT_FIELDS OR fields_to_export, but the problem is that I don't want to use an item exporter, I just want to feed the item into the next MongoDBPipeline. Is there a way to do this in Scrapy? Thanks!
You can easily do that. In your MongoDBPipeline you need to do something like below
del item['html']
If that impacts the item in another pipeline then use copy.deepcopy and create a copy of item object and then delete html before inserting into mongodb
Related
I have an azure devops work item with some custom fields:
I can set some of these fields using the azure api python package, like so for 'RTCID':
jpo.path = "/fields/Custom.RTCID"
But when I try to set the targeted release, I can't find what the field path is for this variable, I've tried
jpo.path = "/fields/Custom.TargetedRelease"
But that results in an error.
I know my organization id, is there any way I can list all the variable path IDs in a ticket?
I tried going to https://dev.azure.com/{organization}/{project}/_apis/wit/workitemtypes/Epic/fields to see all the fields, but ctrl+f searching for 'targeted' brings up no results
To save the response time when calling a Azure DevOps REST API, many times it will not load the complete properties to the request body.
If you want to view more properties, you can try to use the parameter $expand to expand the complete properties.
GET https://dev.azure.com/{organization}/{project}/_apis/wit/workitemtypes/{type}/fields?$expand=all&api-version=7.1-preview.3
In addition, you also can use the API "Work Items - Get Work Item" to get a work item that is the work item type you require, and use the parameter $expand to expand all the fields.
GET https://dev.azure.com/{organization}/{project}/_apis/wit/workitems/{id}?$expand=fields&api-version=7.1-preview.3
This also can list all the fields on the work item type.
I was wondering if I could set some conditions that have to be met for the information to be stored (doing web-scraping with Scrapy version 1.7.3).
For example, only storing the movies with a rating greater than 7 while scraping IMDB's website.
Or would I have to do it manually when looking through the output file? (I am currently outputting the data as a CSV file)
This is an interesting question, and yes, scrapy can totally help you with this. There some approaches you can take, if it is only for manipulating the items before actually "returning" them (which means it is already an output) maybe I'd recommend to use Item Loaders which basically helps you setup rules per field on each item.
For actually dropping items with the respective rules I'd suggest you to use and Item Pipeline which serves as a final filter before again returning the items, in this case it would be interesting for you to combine it with something like Cerberus which helps you define whole item schemas and according to that, drop or return an item.
I want to use the scrapy-elasticsearch pipeline in my scrapy project. In this project I have different items / models. These items are stored in a mysql server. In addition I want to index ONE of these items in an ElasticSearchServer.
In the documentation, however, I only find the way to index all defined items like in the code example from the settings.py below.
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'
Like you can see the ELASTICSEARCH_TYPE shows that alle items have to be indexed. Is there a possibility to limit this to only one item?
The current implementation does not support sending only some items.
You could create a subclass of the original pipeline and override the process_item method to do what you want.
If you have the time, you could also send a pull request upstream with a proposal to allow filtering items before sending them to Elasticsearch.
I need to return some items from inside a spider before starting to parse requests. Because I need to make sure some parent items exist in database before parsing child items.
I now yield them from the parse method first thing, and this seem to work fine. But I was wondering if there is a better way to do this?
Instead of yielding the items, write them into the database directly on the constructor of the pipeline where you add regular items to the database.
I made a Scrapy crawler that collects some data from forum threads. On the list page, i can see the last modified date.
Based on that date, i want to decide whether to crawl the thread again or not. I store the data in mysql, using pipeline. While processing the list page with my CrawlSpider, i want to check a record in the mysql, and based on that record i either want to yield a Request or not. (I DO NOT want to load the url unless there is a new post.)
Whats the best way to do this?
Use CrawSpider Rule:
Rule(SgmlLinkExtractor(), follow=True, process_request='check_moddate'),
Then in your spider:
def check_moddate(self, request):
def dateisnew():
# check the date
if dateisnew():
return request