Upsert function for elasicsearch? - python

I would like to periodically update the data in elasticsearch.
In the file I send in for update, there may be data that already exist in elasticsearh (for update) and data that are new docs (for insert).
Since the data in elasticsearch is managed by auto-created ID,
I have to search the ID by a column "code"(unique) to make sure if a doc already exists, if exists update, otherwise insert.
I wonder if there is any method that is faster than the codes I think of as below.
es = Elasticsearch()
# get doc ID by searching(exact match) a code to check if ID exists
res = es.search(index=index_name, doc_type=doc_type, body=body_for_search)
id_dict = dict([('id', doc['_id'])]) for doc in res['hits']['hits’]
# if id exists, update the current doc by id
# else insert with auto-created id
If id_dict['id']:
es.update(index=index_name, id=id_dict['id'], doc_type=doc_type, body=body)
else:
es.index(index=index_name, doc_type=doc_type, body=body)
For example, could there be a method where elasticsearch search the exact match col["code"] for you and you can simply "upsert" the data without specifying id?
Any advice would be much appreciated and thank you for your reading.
ps- if we make the id = col["code"] it could be much simpler and faster, but for management issue we can't do it at current stage.

As #Archit said, use your own ID to lookup document faster
Use upsert API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
Be sure your ID structure respects Lucene good practice:
If you are using your own ID, try to pick an ID that is friendly to
Lucene. Examples include zero-padded sequential IDs, UUID-1, and
nanotime; these IDs have consistent, sequential patterns that compress
well. In contrast, IDs such as UUID-4 are essentially random and offer
poor compression and slow down Lucene.

Related

compare data from one database table to another database table in python, sqlite3 more efficiently

Edit: The end goal is to essentially sync both databases which are updated by different software on both ends.
Here's the setup, the main database (dbOne.db) and the new database (dbTwo.db) both contain the same table (myTable) and have the same Columns (account_id, guid, state, last_update).
The goal is to iterate through dbOne and search dbTwo for account_id and guid, if no results are found then I'm modifying the guid with some regex and searching again before moving on to the next result, if I find any information then I'm updating it based on the last_update.
Here is what the code essentially looks like.
for m in dbOne.execute("SELECT account_id, guid, state, last_update From myTable"):
n = dbTwo.execute("SELECT account_id, guid, state, last_update From myTable WHERE account_id={dbOne.account_id} AND guid={dbOne.guid}'")
# I would check to see if there is any results from n using n.fetchone()
if n.fetchone() is None:
# I modify the guid using regex
n = dbTwo.execute("SELECT account_id, guid, state, last_update From myTable WHERE account_id={dbOne.account_id} AND guid={dbOne.guid_modified}'")
if n.fetchone() is None:
continue
# Compare dbOne.last_update with dbTwo.last_update then do stuff
Just to note that this isn't the actual code, I've just tried to simplify it as much as possible to give a better understanding of what I've done, the code I have works but I feel like there could be a better way to do this.
I've very little experience with SQL but through my searches, I've discovered things like ATTACH and JOIN as well as an IIF or CASE statements that can be used, so I had this idea of throwing it into a query somehow and doing something like IF guid does not match anything in the database then try {guid_modified}, else continue. if there is a match then it would return the account_id, guid, state, and last_update from both databases, then I would compare last_update from both to find the newer date and then update state in one of the databases.
One thing I don't know, is if it would be possible to modify the guid within the query since regex strips the guid then uses an api to get some information in order to modify it, so I think I would have to recall the query again regardless, but I was just hoping for a second opinion as I am still in the midst of learning python but I want to ensure my code is as efficient as possible.
Thanks in advance.

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)
The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]
This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

API Get method to get all tweets with hashtag count greater than within MongoDB in JSON format

I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.

Elasticsearch backfill two fields into one new field after calculations

Question. I have been tasked with researching how to backfill data in Elasticsearch. So far coming up a bit empty. The basic gist is:
Notes: All documents are stored under daily indexes, with ~200k documents per day.
I need to be able to reindex about 60 days worth of data.
I need to take two fields for each doc payload.time_sec and payload.time_nanosec, take there values and do some math on them (time_sec * 10**9 + time_nanosec) and then return this as a single field into the reindexed document
I am looking at the Python API documentation with bulk helpers:
http://elasticsearch-py.readthedocs.io/en/master/helpers.html
But I am wondering if this is even possible.
My thoughts were to use:
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Anyone done this? Maybe something with a groovy script?
Thanks!
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Basically, yes:
use /_search?scroll to fetch the docs
perform your operation
send /_bulk update requests
Other options are:
use the /_reindex APIProbably not so good if you don't want to create a new index
use the /_update_by_query API
Both support scripting which, if I understood it correctly, wold be the perfect choice because your update does not depend on external factors so this could as well be done directly within the server.
Here is where I am at (roughly):
Ive been working with a Python and the bulk helpers and so far am around here:
doc = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index=INDEX, scroll='5m', raise_on_error=False)
for x in doc:
x['_index'] = NEW_INDEX
try:
time_sec = x['_source']['payload']['time_sec']
time_nanosec=x['_source']['payload']['time_nanosec']
duration = (time_sec * 10**9) + time_nanosec
except KeyError: pass
count = count + 1
x['_source']['payload']['duration'] = duration
new_index_data.append(x)
helpers.bulk(es,new_index_data)
From here I am just using the bulk python helper to insert into a new index. However I will experiment changing and testing with bulk update to an existing index.
This look like a right approach?

Python-Eve: Prevent inserting duplicates without using unique fields

I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.
You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()
Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.

Categories