scrapy-elasticsearch pipeline only for specific item - python

I want to use the scrapy-elasticsearch pipeline in my scrapy project. In this project I have different items / models. These items are stored in a mysql server. In addition I want to index ONE of these items in an ElasticSearchServer.
In the documentation, however, I only find the way to index all defined items like in the code example from the settings.py below.
ELASTICSEARCH_INDEX = 'scrapy'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'
Like you can see the ELASTICSEARCH_TYPE shows that alle items have to be indexed. Is there a possibility to limit this to only one item?

The current implementation does not support sending only some items.
You could create a subclass of the original pipeline and override the process_item method to do what you want.
If you have the time, you could also send a pull request upstream with a proposal to allow filtering items before sending them to Elasticsearch.

Related

How to view work item xml ids in azure ticket

I have an azure devops work item with some custom fields:
I can set some of these fields using the azure api python package, like so for 'RTCID':
jpo.path = "/fields/Custom.RTCID"
But when I try to set the targeted release, I can't find what the field path is for this variable, I've tried
jpo.path = "/fields/Custom.TargetedRelease"
But that results in an error.
I know my organization id, is there any way I can list all the variable path IDs in a ticket?
I tried going to https://dev.azure.com/{organization}/{project}/_apis/wit/workitemtypes/Epic/fields to see all the fields, but ctrl+f searching for 'targeted' brings up no results
To save the response time when calling a Azure DevOps REST API, many times it will not load the complete properties to the request body.
If you want to view more properties, you can try to use the parameter $expand to expand the complete properties.
GET https://dev.azure.com/{organization}/{project}/_apis/wit/workitemtypes/{type}/fields?$expand=all&api-version=7.1-preview.3
In addition, you also can use the API "Work Items - Get Work Item" to get a work item that is the work item type you require, and use the parameter $expand to expand all the fields.
GET https://dev.azure.com/{organization}/{project}/_apis/wit/workitems/{id}?$expand=fields&api-version=7.1-preview.3
This also can list all the fields on the work item type.

Scrapy - Drop item field in pipeline ?

So I have an item['html'] field that is needed for MyExamplePipeline, but after processing it isn't needed to store into a database with i.e, MongoDBPipeline. Is there a way in scrapy to just drop the field html and keep the rest of the item? It's needed as part of the item to pass the page html from the spider to the pipeline, but I'm not able to figure out how to drop it. I looked in this SO post that mentioned using FEED_EXPORT_FIELDS OR fields_to_export, but the problem is that I don't want to use an item exporter, I just want to feed the item into the next MongoDBPipeline. Is there a way to do this in Scrapy? Thanks!
You can easily do that. In your MongoDBPipeline you need to do something like below
del item['html']
If that impacts the item in another pipeline then use copy.deepcopy and create a copy of item object and then delete html before inserting into mongodb

Django Fallback to model lookup from external API

I'm using Django REST framework to serve up JSON content for a website front end. On the back end, I have two Django models, Player and Match, that each reference multiple of the other. A Match contains multiple Players, and a Player contains multiple Matches. This data is originally retrieved from a third-party API.
Matches and Players must be fetched separately from the API, and can only be fetched one at a time. When an object is fetched, its data is converted from the external JSON format into my Django model. At this point, the Match/Player will live forever in Django. The hard part is that I want this external fetching to be seamless. If I query for a player or match and it's in the DB, then just serve what we have there. Otherwise, I want to fetch that object from the external DB.
My question is, does Django provide any convenient way of handling this? Ideally, any query along the lines of Match.objects.get(id=...) will handle this API fallback transparently (I don't mind the fact that this query may take significantly longer in some cases).
If a way is "convenient" depends on your expectations ...
You could create a custom QuerySet where you override the get() method to include your fetch-from-API logic. Then you create a custom manager based on that QuerySet, like the docs show here.
Finally add that custom manager to your model.
See also this question from 2011.

Return track-list using musicbrainzngs.search_releases()

I'm getting acquainted with musicbrainzngs and have run into a snag. All of the track-lists which are returned from the following are empty. Are there additional parameters I need to provide or is this a bug?
releases = musicbrainzngs.search_releases(
query='arid:' + musicbrainz_arid
)
This is expected. You have three ways of retrieving data from the MusicBrainz web service (using musicbrainzngs or directly):
lookup/get info for one entity by id: lots of info for that id
browse a list of entities: possibility to get long list, medium amount of information
search for entities: powerful to find things, but not much data given
When you know an entity by id you can look it up directly. You can even add includes to get very detailed information.
When you not only want one entity, but a list (like a list of releases for one artist) you can browse. Even for these you can add includes.
And only when you don't know the id of the entity (or an attached entity) or if you want to cut down the list of entities you search.
In your case you know the artist id and want to get the list of releases. In that case you should use browse_releases and set an include for recordings:
releases = musicbrainzngs.browse_releases(artist=musicbrainz_arid,
inc=["recordings"])

Mailchimp API 2.0: count members of a saved segment

I have three list with thousands of members. I'm creating a little stats module for the admins of my Django site. The three lists have an extra custom field called Language (es_ES, en_CA, fr_CM,...). I want to show the count of members of each list filtered by 'this' Language.
Browsing the Mailchimp API I can see that it's possible to create a saved segment with "filter" options (filtering by Language in my case), but when you get via API those segments you can't get the count of members for the segment. It's not in the return value of Mailchimp. It's possible to get it with a static segment, but not with a saved one.
Any help to get the count for a filtered and saved segment?
You can check the "list related methods" of the MChimp API here: https://apidocs.mailchimp.com/api/2.0/#lists-methods
I'm trying to solve the same problem myself.
I believe this can be done relatively easily with the segment-test method. All you have to do is pass in (besides your API key), parameters list_id, and (inside an options object) saved_segment_id. The call will return the total number of subscribers that match the saved segment. For example:
curl -X POST https://us1.api.mailchimp.com/2.0/lists/segment-test.json --data '{"apikey":"MYAPIKEY","list_id":"MYLISTID","options":{"saved_segment_id":MYSEGMENTID}}'
To get the segment ID for all the saved segments in your list, you can first call the segments method, like this:
curl -X POST https://us1.api.mailchimp.com/2.0/lists/segments.json --data '{"apikey":"MYAPIKEY","id":"MYLISTID"}'
Note that the list ID is passed as "id" in the segments method, but as "list_id" in the segment-test method.

Categories