How to get auto generate id in ES by using Python - python

I need to store the unique auto id like using XPOST
because every time I start the program, the data is overwritten.
but, I can't find an example that automatically generate id in Python
Could you tell me if there are any good examples?
my code:
def saveES(output,es):
bodys=[]
i=0
while i<len(output)-1: #output[len(output)-1] is space
json_doc=json.dumps(output[i])
body = {
"_index":"crawler",
"_type":"typed",
"_id":saveES.counter,
"_source":json_doc
}
i+=1
bodys.append(body)
saveES.counter+=1
helpers.bulk(es,bodys)

You don't need to do this in python -- if you index documents without an id, Elasticsearch will automatically create a unique id. However, if for some reason you want to generate the id in python, you could use uuid.

If you are using the ES Python Client you will not be able to use es.create(...) without an ID. You should instead es.index(...) instead.
Both of them call the Elasticsearch Index API, but es.create sets the op_type parameter to create, and es.index to index.
The create operation is designed to fail if the ID already exists, so it will not accept being called without an ID.

Related

Upsert function for elasicsearch?

I would like to periodically update the data in elasticsearch.
In the file I send in for update, there may be data that already exist in elasticsearh (for update) and data that are new docs (for insert).
Since the data in elasticsearch is managed by auto-created ID,
I have to search the ID by a column "code"(unique) to make sure if a doc already exists, if exists update, otherwise insert.
I wonder if there is any method that is faster than the codes I think of as below.
es = Elasticsearch()
# get doc ID by searching(exact match) a code to check if ID exists
res = es.search(index=index_name, doc_type=doc_type, body=body_for_search)
id_dict = dict([('id', doc['_id'])]) for doc in res['hits']['hits’]
# if id exists, update the current doc by id
# else insert with auto-created id
If id_dict['id']:
es.update(index=index_name, id=id_dict['id'], doc_type=doc_type, body=body)
else:
es.index(index=index_name, doc_type=doc_type, body=body)
For example, could there be a method where elasticsearch search the exact match col["code"] for you and you can simply "upsert" the data without specifying id?
Any advice would be much appreciated and thank you for your reading.
ps- if we make the id = col["code"] it could be much simpler and faster, but for management issue we can't do it at current stage.
As #Archit said, use your own ID to lookup document faster
Use upsert API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
Be sure your ID structure respects Lucene good practice:
If you are using your own ID, try to pick an ID that is friendly to
Lucene. Examples include zero-padded sequential IDs, UUID-1, and
nanotime; these IDs have consistent, sequential patterns that compress
well. In contrast, IDs such as UUID-4 are essentially random and offer
poor compression and slow down Lucene.

Elasticsearch backfill two fields into one new field after calculations

Question. I have been tasked with researching how to backfill data in Elasticsearch. So far coming up a bit empty. The basic gist is:
Notes: All documents are stored under daily indexes, with ~200k documents per day.
I need to be able to reindex about 60 days worth of data.
I need to take two fields for each doc payload.time_sec and payload.time_nanosec, take there values and do some math on them (time_sec * 10**9 + time_nanosec) and then return this as a single field into the reindexed document
I am looking at the Python API documentation with bulk helpers:
http://elasticsearch-py.readthedocs.io/en/master/helpers.html
But I am wondering if this is even possible.
My thoughts were to use:
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Anyone done this? Maybe something with a groovy script?
Thanks!
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Basically, yes:
use /_search?scroll to fetch the docs
perform your operation
send /_bulk update requests
Other options are:
use the /_reindex APIProbably not so good if you don't want to create a new index
use the /_update_by_query API
Both support scripting which, if I understood it correctly, wold be the perfect choice because your update does not depend on external factors so this could as well be done directly within the server.
Here is where I am at (roughly):
Ive been working with a Python and the bulk helpers and so far am around here:
doc = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index=INDEX, scroll='5m', raise_on_error=False)
for x in doc:
x['_index'] = NEW_INDEX
try:
time_sec = x['_source']['payload']['time_sec']
time_nanosec=x['_source']['payload']['time_nanosec']
duration = (time_sec * 10**9) + time_nanosec
except KeyError: pass
count = count + 1
x['_source']['payload']['duration'] = duration
new_index_data.append(x)
helpers.bulk(es,new_index_data)
From here I am just using the bulk python helper to insert into a new index. However I will experiment changing and testing with bulk update to an existing index.
This look like a right approach?

How to delete the last item of a collection in mongodb

I made a program with python and mongodb to do some diaries. Like this
Sometimes I want to delete the last sentence, just by typing "delete!"
But I dont know how to delete in a samrt way. I dont want to use "skip".
Is there a good way to do it?
Be it first or last item, MongoDB maintains unique _id key for each record and thus you can just pass that id field in your delete query either using deleteOne() or deleteMany(). Since only one record to delete you need to use deleteOne() like
db.collection_name.deleteOne({"_id": "1234"}) // replace 1234 with actual id

Store dictionary in database

I create a Berkeley database, and operate with it using bsddb module. And I need to store there information in a style, like this:
username = '....'
notes = {'name_of_note1':{
'password':'...',
'comments':'...',
'title':'...'
}
'name_of_note2':{
#keys same as previous, but another values
}
}
This is how I open database
db = bsddb.btopen['data.db','c']
How do I do that ?
So, first, I guess you should open your database using parentheses:
db = bsddb.btopen('data.db','c')
Keep in mind that Berkeley's pattern is key -> value, where both key and value are string objects (not unicode). The best way in your case would be to use:
db[str(username)] = json.dumps(notes)
since your notes are compatible with the json syntax.
However, this is not a very good choice, say, if you want to query only usernames' comments. You should use a relational database, such as sqlite, which is also built-in in Python.
A simple solution was described by #Falvian.
For a start there is a column pattern in ordered key/value store. So the key/value pattern is not the only one.
I think that bsddb is viable solution when you don't want to rely on sqlite. The first approach is to create a documents = bsddb.btopen['documents.db','c'] and store inside json values. Regarding the keys you have several options:
Name the keys yourself, like you do "name_of_note_1", "name_of_note_2"
Generate random identifiers using uuid.uuid4 (don't forget to check it's not already used ;)
Or use a row inside this documents with key=0 to store a counter that you will use to create uids (unique identifiers).
If you use integers don't forget to pack them with lambda x: struct.pack('>q', uid) before storing them.
If you need to create index. I recommend you to have a look at my other answer introducting composite keys to build index in bsddb.

How to update data in django table from csv file while keeping the primary key value as it is.?

I have a table which already contains some data in it. Now i want to upload new data from a csv file and want to update some of the previous values of that table. I am using django 1.3 and sqlite3 as database. But i am not able to update the table.
If you do not want to change Primay key values or do not to add new objects to the table, which could be duplicate of the old info - then you need some kind of data which you can use as lookup parameters in your database.
If you have model which represents this data, then its really easy using
m = Model.objects.get(somecolumn = somedata)
m.someothervalue = someotherdata
m.save()
But why include django in this anyway? If you have CSV table, then updating this info is really a case of writing queries. and programs like Excel and openoffice make this very easy.
If you already have data in CSV format, then just open the data as spreadsheet and use excels/openoffice's Concactenate function to create update queries
Update mytable set value1 = data1, value2 = data2 where somevalue = somedata;
If you used openoffice for this, then openoffice has this nifty Text to columns function (under data in program menu), which turns concactenated values into string. Then you can copypaste those strings into command prompt or phppgadmin and run.. and voila, you get updated data in your database.
Edit (In response to you comment.):
Look into this: https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
If you want to use django for this, then use get_or_create. But you need to remember here, that if any of the parameters you use in get_or_create method have changed, then new object will be created. Thats why i said in the beginning of the post, that you need some kind of data, which will not change.
for example (taken from the link above)
obj, created = Person.objects.get_or_create(first_name='John', last_name='Lennon',
defaults={'birthday': date(1940, 10, 9)})
will create new obj(Person) when used first time. But if used 2nd time and the date has changed, then new Person with same name and last name but new date will be created.
So to avoid this, you'll still need to do something like
obj.someothervalue = someotherdata
obj.save()
if you want to have more control over the data, that could have been changed.
Alan.

Categories