Periodically process and update documents in elasticsearch index - python

I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently. I do not have to look at documents that I processed before.
My setting is that I have a long running process, which continuously inserts documents to an index, say approx. 500 documents per hour (think about the common logging example).
I need to find a solution to update some amount of documents periodically (via cron job, e.g) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields. I want to do this to offer more fine grained aggregations on the index. In the logging analogy, this could be, e.g., I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.
So my approach would be:
Get some amount of documents (or even all) that I haven't looked at before. I could query them by combining must_not and exists, for instance.
Run my code on these documents (run the parser, compute some new stuff, whatever).
Update the documents obtained previously (probably most preferably via bulk api).
I know there is the Update by query API. But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.
I am accessing elasticsearch via python.
The problem is now that I don't know how to implement the above approach. E.g. what if the amount of document obtained in step 1. is larger than myindex.settings.index.max_result_window?
Any ideas?

I considered #Jay's comment and ended up with this pattern, for the moment:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan
from my_module.postprocessing import post_process_doc
es = Elasticsearch(...)
es.ping()
def update_docs( docs ):
""""""
for idx,doc in enumerate(docs):
if idx % 10000 == 0:
print( 'next 10k' )
new_field_value = post_process_doc( doc )
doc_update = {
"_index": doc["_index"],
"_id" : doc["_id"],
"_op_type" : "update",
"doc" : { <<the new field>> : new_field_value }
}
yield doc_update
docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )
bulk( es, update_docs( docs ) )
Comments:
I learned that elasticsearch keeps a view of the search results when you do a scroll and pass the corresponding ids with the query request. The scan abstraction method will handle that for you. The scroll-parameter in the method above tells elasticsearch how long the view will be open, i.e., how long the view will be consistant.
As stated in my comment the documentation says that they no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging use .. point in time (PIT), but I haven't tried it yet.
In my implementation, I needed to pass preserve_over=True, otherwise an error was thrown.
Remember to update the mapping of the index beforehand, e.g., when you want to add a nested fields as another field in your document.

Related

How do I not create a new document if the previous document in the collection contains a certain value?

I am using the python library to programmatically create documents in a collection as such:
user = client.query(q.create(q.collection("my_collection"), {
"data": {
"UTC_datetime": str(datetime.now(pytz.UTC)),
"item_one": str(value_one),
"item_two": str(value_two),
"item_three": str(value_three)
}
}))
Upon certain conditions being met the python app executes again.
If item_two on the next app execution has the same value again I do not want a new document to be created.
How do I craft the above query to perform this?
Currently, I am reading the previous document, extracting the value from item_two and performing an if/else statement to either proceed to store a new document or sys.exit().
I'm positive there is a more elegant solution that is based within Fauna's logic instead of Python's, however, I have not been able to achieve this.
You can create a unique index (https://docs.fauna.com/fauna/current/api/fql/indexes) for item_two to ensure that duplicates are not possible. You may also want to an upsert implementation
https://forums.fauna.com/t/multi-document-upsert/488/3
https://forums.fauna.com/t/does-fauna-supports-upserts/208
q.If(
q.Exists(q.Match(q.Index('unique_item_two'), str(value_two))),
q.Update(...),
q.Create(...)
)

How to paginate an aggregation pipeline result in pymongo?

I have a web app where I store some data in Mongo, and I need to return a paginated response from a find or an aggregation pipeline. I use Django Rest Framework and its pagination, which in the end just slices the Cursor object. This works seamlessly for Cursors, but aggregation returns a CommandCursor, which does not implement __getitem__().
cursor = collection.find({})
cursor[10:20] # works, no problem
command_cursor = collection.aggregate([{'$match': {}}])
command_cursor[10:20] # throws not subscriptable error
What is the reason behind this? Does anybody have an implementation for CommandCursor.__getitem__()? Is it feasible at all?
I would like to find a way to not fetch all the values when I need just a page. Converting to a list and then slicing it is not feasible for large (100k+ docs) pipeline results. There is a workaround with based on this answer, but this only works for the first few pages, and the performance drops rapidly for pages at the end.
Mongo has certain aggregation pipeline stages to deal with this, like $skip and $limit that you can use like so:
aggregation_results = list(collection.aggregate([{'$match': {}}, {'$skip': 10}, {'$limit': 10}]))
Specifically as you noticed Pymongo's command_cursor does not have implementation for __getitem__ hence the regular iterator syntax does not work as expected. I would personally recommend not to tamper with their code unless you're interested in becoming a contributer to their package.
The MongoDB cursor for find and aggregate functions in a different way since cursor result from aggregation query is a result of precessed data (in most cases) which is not the case for find-cursors as they are static and hence documents can be skipped and limitted to your will.
You can add the paginator limits as $skip and $limit stages in the aggregation pipeline.
For Example:
command_cursor = collection.aggregate([
{
"$match": {
# Match Conditions
}
},
{
"$skip": 10 # No. of documents to skip (Should be `0` for Page - 1)
},
{
"$limit": 10 # No. of documents to be displayed on your webpage
}
])

Mongodb: How to change an element of a nested arrary?

From what I have read it is impossible to update an element in an nested array using the positional operator $ in mongo. The $ only works one level deep. I see it is a requested feature in mongo 2.7.
Updating the whole document one level up is not an option because of write conflicts. I need to just be able to change the 'username' for a particular reward program for instance.
One of the ideas would to be pull, modify, and push the entire 'reward_programs' element but then I would loose the order. Order is important.
Consider this document:
{
"_id:"0,
"firstname":"Tom",
"profiles" : [
{
"profile_name": "tom",
"reward_programs:[
{
'program_name':'American',
'username':'tomdoe',
},
{
'program_name':'Delta',
'username':'tomdoe',
}
]
}
]
}
How would you go about specifically changing the 'username' of 'program_name'=Delta?
After doing more reading it looks like this is unsupported in mongodb at the moment. Positional updates are only supported for one level deep. The feature might be added for mongodb 2.7.
The are a couple of work arounds.
1) Flatten out your database structure. In this case, make 'reward_programs' it's own collection and do your operation on that.
2) Instead of arrays of dicts, use dicts of dicts. That way you can just have an absolute path down to the object you need to modify. This can have drawbacks to query flexibility.
3) Seems hacky to me but you can also walk the list on the nested array find it's position index in the array and do something like this:
users.update({'_id': request._id, 'profiles.profile_name': profile_name}, {'$set': {'profiles.$.reward_programs.{}.username'.format(index): new_username}})
4) Read in the whole document, modify, write back. However, this has possible write conflicts
Setting up your database structure initially is extremely important. It really depends on how you are going to use it.
A simple way to do this:
doc = collection.find_one({'_id': 0})
doc['profiles'][0]["reward_programs"][1]['username'] = 'new user name'
#replace the whole collection
collection.save(doc)

Add a field to existing document in CouchDB

I have a database with a bunch of regular documents that look something like this (example from wiki):
{
"_id":"some_doc_id",
"_rev":"D1C946B7",
"Subject":"I like Plankton",
"Author":"Rusty",
"PostedDate":"2006-08-15T17:30:12-04:00",
"Tags":["plankton", "baseball", "decisions"],
"Body":"I decided today that I don't like baseball. I like plankton."
}
I'm working in Python with couchdb-python and I want to know if it's possible to add a field to each document. For example, if I wanted to have a "Location" field or something like that.
Thanks!
Regarding IDs
Every document in couchdb has an id, whether you set it or not. Once the document is stored you can access it through the doc._id field.
If you want to set your own ids you'll have to assign the id value to doc._id. If you don't set it, then couchdb will assign a uuid.
If you want to update a document, then you need to make sure you have the same id and a valid revision. If say you are working from a blog post and the user adds the Location, then the url of the post may be a good id to use. You'd be able to instantly access the document in this case.
So what's a revision
In your code snippet above you have the doc._rev element. This is the identifier of the revision. If you save a document with an id that already exists, couchdb requires you to prove that the document is still the valid doc and that you are not trying to overwrite someone else's document.
So how do I update a document
If you have the id of your document, you can just access each document by using the db.get(id) function. You can then update the document like this:
doc = db.get(id)
doc['Location'] = "On a couch"
db.save(doc)
I have an example where I store weather forecast data. I update the forecasts approximately every 2 hours. A separate process is looking for data that I get from a different provider looking at characteristics of tweets on the day.
This looks something like this.
doc = db.get(id)
doc_with_loc = GetLocationInformationFromOtherProvider(doc) # takes about 40 seconds.
doc_with_loc["_rev"] = doc["_rev"]
db.save(doc_with_loc) # This will fail if weather update has also updated the file.
If you have concurring processes, then the _rev will become invalid, so you have to have a failsave, eg. this could do:
doc = db.get(id)
doc_with_loc = GetLocationInformationFromAltProvider(doc)
update_outstanding = true
while update_outstanding:
doc = db.get(id) //reretrieve this to get
doc_with_loc["_rev"] = doc["_rev"]
update_outstanding = !db.save(doc_with_loc)
So how do I get the Ids?
One option suggested above is that you actively set the id, so you can retrieve it. Ie. if a user sets a given location that is attached to a URL, use the URL. But you may not know which document you want to update - or even have a process that finds all the document that don't have a location and assign one.
You'll most likely be using a view for this. Views have a mapper and a reducer. You'll use the first one, forget about the last one. A view with a mapper does the following:
It returns a simplyfied/transformed way of looking at your data. You can return multiple values per data or skip some. It gives the data you emit a key, and if you use the _include_docs function it will give you the document (with _id and rev alongside).
The simplest view is the default view db.view('_all_docs') this will return all documents and you may not want to update all of them. Views for example will be stored as a document as well when you define these.
The next simple way is to have view that only returns items that are of the type of the document. I tend to have a _type="article in my database. Think of this as marking that a document belongs to a certain table if you had stored them in a relational database.
Finally you can filter elements that have a location so you'd have a view where you can iterate over all those docs that still need a location and identify this in a separate process. The best documentation on writing view can be found here.

How to insert clean data into Firebase via REST API and Python

I have a simple Firebase that I mostly interact with via Javascript, which works really well. However, I also have a Python program that needs to get data from existing children and put/update data on existing children. I tried python-firebasin, which would do what I want, but it is unreliable (hangs, fails, etc.).
So I'm looking at the python-firebase REST wrapper. This seems efficient, and works well. However, every time I try to post() data, I get not just the data I'm posting, but some kind of unique string paired with it, all inserted as a child.
For example, via Javascript, I might say:
db = new Firebase('https://myfirebase.firebaseio.com/testval/');
db.transaction(function(current) { return 1; });
This would then give me a Firebase that looked like:
|---testval: 1
But when I try to do something similar with the Python Firebase REST wrapper, such as:
db = firebase.FirebaseApplication('https://myfirebase.firebaseio.com/')
db.post('/testval/',1)
My Firebase looks something like this:
|---testval:
|---JI4BiBbICSEAnM9mDXf: 1
In other words, it inserts a new child, gives it a new string, and then appends the data. Is there any way to insert/modify data on my Firebase using the REST wrapper that would do it cleanly like I'm doing with Javascript? Without adding children, without adding these unique strings?
Try this instead:
db.put(1)
db.post() is the equivalent of .push() in the JavaScript API, so it creates a unique ID for you. db.put() is equivalent to .set() and will just set the data, which appears to be what you want.
Note that there is no equivalent for transactions in the REST API, but your example was just using a transaction to do a .set() so hopefully you don't actually need them.
Try this:
db = firebase.FirebaseApplication('https://myfirebase.firebaseio.com/')
db.put('', 'testval', 1)
put takes three arguments : first is url or path, second is the key name or the snapshot name and third is the data(json)

Categories