Is updating an array in Elasticsearch document thread safe?

Is updating an array in Elasticsearch document thread safe? - python

I have a document which has an array of strings as one of it's properties.
There is a case where multiple update queries are executed on the index from two threads and potentially can update the same document.
For example given a document where the value of my_array is:
my_array = [1,2,3,4]
The two threads execute the following update query which runs a painless script on the index with different params.
thread_0 -> my_item=3
thread_1 -> my_item=2
int index_of_my_item = ctx._source.my_array.indexOf(params.my_item);
if (-1 != index_of_my_item) {
ctx._source.my_array.remove(index_of_item);
}
Is the execution thread safe? and the property value in the document will be:
my_array = [1,4]
Or are there race conditions to consider?
Thanks,

ES works on Optimistic concurrency control and it uses version number to implement this, so there is no concept of thread safe (from ES perspective).
Just go through these links and it will give you fair idea what exactly update means for ES
https://www.elastic.co/guide/en/elasticsearch/guide/master/version-control.html
https://www.elastic.co/guide/en/elasticsearch/guide/master/optimistic-concurrency-control.html
When updating a document with the index API, we read the original document, make our changes, and then reindex the whole document in one go. The most recent indexing request wins: whichever document was indexed last is the one stored in Elasticsearch. If somebody else had changed the document in the meantime, their changes would be lost.

Related

How do I not create a new document if the previous document in the collection contains a certain value?

I am using the python library to programmatically create documents in a collection as such:
user = client.query(q.create(q.collection("my_collection"), {
"data": {
"UTC_datetime": str(datetime.now(pytz.UTC)),
"item_one": str(value_one),
"item_two": str(value_two),
"item_three": str(value_three)
}
}))
Upon certain conditions being met the python app executes again.
If item_two on the next app execution has the same value again I do not want a new document to be created.
How do I craft the above query to perform this?
Currently, I am reading the previous document, extracting the value from item_two and performing an if/else statement to either proceed to store a new document or sys.exit().
I'm positive there is a more elegant solution that is based within Fauna's logic instead of Python's, however, I have not been able to achieve this.

You can create a unique index (https://docs.fauna.com/fauna/current/api/fql/indexes) for item_two to ensure that duplicates are not possible. You may also want to an upsert implementation
https://forums.fauna.com/t/multi-document-upsert/488/3
https://forums.fauna.com/t/does-fauna-supports-upserts/208
q.If(
q.Exists(q.Match(q.Index('unique_item_two'), str(value_two))),
q.Update(...),
q.Create(...)
)

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?

It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/

Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.

I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Neo4J / py2neo -- cursor-based query?

If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.

Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.

py2neo: depending batch insertion

I use py2neo (v 1.9.2) to write data to a neo4j db.
batch = neo4j.WriteBatch(graph_db)
current_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Current_Relationship")
touched_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Touched_Relationship")
get_rel = current_relationship_index.get(some_key1, some_value1)
if len(get_rel) == 1:
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, get_rel[0])
elif len(get_rel) == 0:
created_rel = current_relationship_index.create(some_key3, some_value3, (my_start_node, "KNOWS", my_end_node))
batch.add_indexed_relationship(touched_relationship_index, some_key4, "touched", created_rel)
batch.submit()
Is there a way to replace current_relationship_index.get(..) and current_relationship_index.create(...) with a batch command? I know that there is one, but the problem is, that I need to act depending on the return of these commands. And I would like to have all statements in a batch due to performance.
I have read that it is rather uncommon to index relationships but the reason I do it is the following: I need to parse some (text) file everyday and then need to check if any of the relations have changed towards the previous day, i.e. if a relation does not exist in the text file anymore I want to mark it with a "replaced" property in neo4j. Therefore, I add all "touched" relationships to the appropriate index, so I know that these did not change. All relations that are not in the touched_relationship_index obviously do not exist anymore so I can mark them.
I can't think of an easier way to do so, even though I'm sure that py2neo offers one.
EDIT: Considering Nigel's comment I tried this:
my_rel = batch.get_or_create_indexed_relationship(current_relationship_index, some_key, some_value, my_start_node, my_type, my_end_node)
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, my_rel)
batch.submit()
This obviously does not work, because i can't refer to "my_rel" in the batch. How can I solve this? Refer with "0" to the result of the previous batch statement? But consider that the whole thing is supposed to run in a loop, so the numbers are not fixed. Maybe use some variable "batch_counter" which refers to the current batch statement and is always incremented, whenever a statement is added to the batch??

Have a look at WriteBatch.get_or_create_indexed_relationship. That can conditionally create a relationship based on whether or not one currently exists and operates atomically. Documentation link below:
http://book.py2neo.org/en/latest/batches/#py2neo.neo4j.WriteBatch.get_or_create_indexed_relationship
There are a few similar uniqueness management facilities in py2neo that I recently blogged about here that you might want to read about.

NDB .order returns an empty result

I have two entities in my database which are connected. We'll call them A and B. I have an instance of A in memory (we'll call him a), and the following query currently works:
B.query(B.parent == a.key).fetch(limit=None)
But the following code returns en empty set, even in dev mode with indexes being automatically created:
B.query(B.parent == a.key).order(B.foo, B.bar).fetch(limit=None)
I've tried every combination I can think of, and I'm completely stumped.

Turns out the fields in question were made as TextProperty by a previous dev, which are un-indexable, and thus un-searchable.

This is what you want:
B.query(ancestor=a.key)
I don't believe any of the snippets you posted will even work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is updating an array in Elasticsearch document thread safe? - python

Related

How do I not create a new document if the previous document in the collection contains a certain value?

Efficient way to get data from lotus notes view

Neo4J / py2neo -- cursor-based query?

py2neo: depending batch insertion

NDB .order returns an empty result

Categories

Resources