I am trying to find a workaround to the following problem. I have seen it quasi-described in this SO question, yet not really answered.
The following code fails, starting with a fresh graph:
from py2neo import neo4j
def add_test_nodes():
# Add a test node manually
alice = g.get_or_create_indexed_node("Users", "user_id", 12345, {"user_id":12345})
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch requests
a = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
batch.set_properties(a, new_node_data) #<-- I'm the problem
# execute batch requests and clear
batch.run()
batch.clear()
if __name__ == '__main__':
# Initialize Graph DB service and create a Users node index
g = neo4j.GraphDatabaseService()
users_idx = g.get_or_create_index(neo4j.Node, "Users")
# run the test functions
add_test_nodes()
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
print alice
do_batch(g)
# get alice back and assert additional properties were added
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
assert "name" in alice
In short, I wish, in one batch transaction, to update existing indexed node properties. The failure is occurring at the batch.set_properties line, and it is because the BatchRequest object returned by the previous line is not being interpreted as a valid node. Though not entirely indentical, it feels like I am attempting something like the answer posted here
Some specifics
>>> import py2neo
>>> py2neo.__version__
'1.6.0'
>>> g = py2neo.neo4j.GraphDatabaseService()
>>> g.neo4j_version
(2, 0, 0, u'M06')
Update
If I split the problem into separate batches, then it can run without error:
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch request 1
batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
# execute batch request and clear
alice = batch.submit()
batch.clear()
# batch request 2
batch.set_properties(a, new_node_data)
# execute batch request and clear
batch.run()
batch.clear()
This works for many nodes as well. Though I do not love the idea of splitting the batch up, this might be the only way at the moment. Anyone have some comments on this?
After reading up on all the new features of Neo4j 2.0.0-M06, it seems that the older workflow of node and relationship indexes are being superseded. There is presently a bit of a divergence on the part of neo in the way indexing is done. Namely, labels and schema indexes.
Labels
Labels can be arbitrarily attached to nodes and can serve as a reference for an index.
Indexes
Indexes can be created in Cypher by referencing Labels (here, User) and node property key, (screen_name):
CREATE INDEX ON :User(screen_name)
Cypher MERGE
Furthermore, the indexed get_or_create methods are now possible via the new cypher MERGE function, which incorporate Labels and their indexes quite succinctly:
MERGE (me:User{screen_name:"SunPowered"}) RETURN me
Batch
Queries of the sort can be batched in py2neo by appending a CypherQuery instance to the batch object:
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
cypher_merge_user = neo4j.CypherQuery(graph_db,
"MERGE (user:User {screen_name:{name}}) RETURN user")
def get_or_create_user(screen_name):
"""Return the user if exists, create one if not"""
return cypher_merge_user.execute_one(name=screen_name)
def get_or_create_users(screen_names):
"""Apply the get or create user cypher query to many usernames in a
batch transaction"""
batch = neo4j.WriteBatch(graph_db)
for screen_name in screen_names:
batch.append_cypher(cypher_merge_user, params=dict(name=screen_name))
return batch.submit()
root = get_or_create_user("Root")
users = get_or_create_users(["alice", "bob", "charlie"])
Limitation
There is a limitation, however, in that the results from a cypher query in a batch transaction cannot be referenced later in the same transaction. The original question was in reference to updating a collection of indexed user properties in one batch transaction. This is still not possible, as far as I can muster. For example, the following snippet throws an error:
batch = neo4j.WriteBatch(graph_db)
b1 = batch.append_cypher(cypher_merge_user, params=dict(name="Alice"))
batch.set_properties(b1, dict(last_name="Smith")})
resp = batch.submit()
So, it seems that although there is a bit less overhead in implementing the get_or_create over a labelled node using py2neo because the legacy indexes are no longer necessary, the original question still needs 2 separate batch transactions to complete.
Your problem seems not to be in batch.set_properties() but rather in the output of batch.get_or_create_in_index(). If you add the node with batch.create(), it works:
db = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(db)
# create a node instead of getting it from index
test_node = batch.create({'key': 'value'})
# set new properties on the node
batch.set_properties(test_node, {'key': 'foo'})
batch.submit()
If you have a look at the properties of the BatchRequest object returned by batch.create() and batch.get_or_create_in_index() there is a difference in the URI because the methods use different parts of the neo4j REST API:
test_node = batch.create({'key': 'value'})
print test_node.uri # node
print test_node.body # {'key': 'value'}
print test_node.method # POST
index_node = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
print index_node.uri # index/node/Users?uniqueness=get_or_create
print index_node.body # {u'value': 12345, u'key': 'user_id', u'properties': {}}
print index_node.method # POST
batch.submit()
So I guess batch.set_properties() somehow can't handle the URI of the indexed node? I.e. it doesn't really get the correct URI for the node?
Doesn't solve the problem, but could be a pointer for somebody else ;) ?
Related
For a university project I am using Neo4j together with flask and pyneo for a shift scheduling algorithm. On saving the scheduled shifts to Neo4j I realized that relationships go missing, from 330 only 91 get inserted.
On printing them before/after inserting, they are in the list to be inserted, and I also moved the transaction around to check if this changes the result.
I have the following structure:
(w:Worker)-[r:works_during]->(s:Shift) with
r.day, r.month, r.year as set parameters for the relationship and multiple connections between each worker and each shift, which can be filtered via the relation then.
my code looks like the following:
header = df.columns.tolist()
header.remove("index")
header.remove("worker")
tuplelist = []
for index, row in df.iterrows():
for i in header:
worker = self.driver.nodes.match("Worker", id=int(row["worker"])).first()
if row[i] == 1:
# Shifts are in the format {day}_{shift_of_day}
shift_id = str(i).split("_")[1]
shift_day = str(i).split("_")[0]
shift = self.driver.nodes.match("Shift", id=int(shift_id)).first()
rel = Relationship(worker, "works_during", shift)
rel["day"] = int(shift_day)
rel["month"] = int(month)
rel["year"] = int(year)
tuplelist.append(rel)
print(len(tuplelist))
for i in tuplelist:
connection = self.driver.begin()
connection.create(i)
connection.commit()
Is there any special behaviour in pyneo which I need to be aware of that could cause this issue?
Pyneo allows just one connection from the same type between node A and node B.
If multiple connections of the same type (even with different attributes) are needed, it is necessary to use plain Cypher Querying as pyneo will merge this edges to a single edge.
After having read the following question and various answers ("Least Astonishment" and the Mutable Default Argument), as well as the official documentation (https://docs.python.org/3/tutorial/controlflow.html#default-argument-values), I've written my ResultsClass so that each instance of it has a separate list without affecting the defaults (at least, this is what should be happening from my new-gained understanding):
class ResultsClass:
def __init__(self,
project = None,
badpolicynames = None,
nonconformpolicydisks = None,
diskswithoutpolicies = None,
dailydifferences = None,
weeklydifferences = None):
self.project = project
if badpolicynames is None:
self.badpolicynames = []
if nonconformpolicydisks is None:
self.nonconformpolicydisks = []
if diskswithoutpolicies is None:
self.diskswithoutpolicies = []
if dailydifferences is None:
self.dailydifferences = []
if weeklydifferences is None:
self.weeklydifferences = []
By itself, this works as expected:
i = 0
for result in results:
result.diskswithoutpolicies.append("count is " + str(i))
print(result.diskswithoutpolicies)
i = i+1
['count is 0']
['count is 1']
['count is 2']
['count is 3']
etc.
The context of this script is that I'm trying to obtain information from each project within our Google Cloud infrastructure; predominantly in this instance, a list of disks with a snapshot schedule associated with them, a list of the scheduled snapshots of each disk within the last 24 hours, those that have bad schedule names that do not fit our naming convention, and the disks that do not have any snapshot schedules associated with them at all.
Within the full script, I use this exact same ResultsClass; yet when used within multiple for loops, the append again seems to be adding to the default values, and in all honesty I don't understand why.
The shortened version of the code is as follows:
# Code to obtain a list of projects
results = [ResultsClass() for i in range((len(projects)))]
for result in results:
for project in projects:
result.project = project
# Code to obtain each zone in the project
for zone in zones:
# Code to get each disk in zone
for disk in disks:
resourcepolicy = disk.get('resourcePolicies')
if resourcepolicy:
# Code to action if a resource policy exists
else:
result.badpolicynames.append(resourcepolicy[0].split('/')[-1])
result.nonconformpolicydisks.append(disk['id'])
else:
result.diskswithoutpolicies.append(disk['id'])
pprint(vars(result))
This then comes back with the results:
{'badpolicynames': [],
'dailydifferences': None,
'diskswithoutpolicies': ['**1098762112354315432**'],
'nonconformpolicydisks': [],
'project': '**project0**',
'weeklydifferences': None}
{'badpolicynames': [],
'dailydifferences': None,
'diskswithoutpolicies': ['**1098762112354315432**'],
['**1031876156872354739**'],
'nonconformpolicydisks': [],
'project': '**project1**',
'weeklydifferences': None}
Does a for loop (or multiple for loops) somehow negate the separate lists created within the ResultsClass? I need to understand why this is happening within Python and then how I can correct it.
Based on my best understanding, one of the glaring problem is you're nesting both the results and projects loop together, whereas you should be looping only either of those. I'd suggest looping the projects and creating a result in each instead of instantiating the classes in a list before.
results = []
for project in projects:
result = ResultsClass(project)
# Code to obtain each zone in the project
for zone in zones:
# Code to get each disk in zone
for disk in disks:
resourcepolicy = disk.get('resourcePolicies')
if resourcepolicy:
# Code to action if a resource policy exists
else:
result.badpolicynames.append(resourcepolicy[0].split('/')[-1])
result.nonconformpolicydisks.append(disk['id'])
else:
result.diskswithoutpolicies.append(disk['id'])
results.append(result)
pprint(vars(result))
With that, results is the list of your ResultsClass, and each result contain only one project, whereas your previous attempt would end with each ResultsClass with the same, last project.
I'm not sure if i get what you are trying to achieve correctly, but you are trying to transfer data from each project to a single result, right?
if so, you might want to use zip to have a single result per project:
for result, project in zip(results, projects):
# rest of the code
otherwise you are overriding the result for each previous project in the next loop iteration.
Another option would be to create the result in the loop:
results = []
for project in projects:
result = ResultsClass()
# ... your fetching code ...
results.append(result)
I have what is essentially a table which is a pool of available codes/sequences for unique keys when I create records elsewhere in the DB.
Right now I run a transaction where I might grab 5000 codes out of an available pool of 1 billion codes using the slice operator [:code_count] where code_count == 5000.
This works fine, but then for every insert, I have to run through each code and insert it into the record manually when I use the code.
Is there a better way?
Example code (omitting other attributes for each new_item that are similar to all new_items):
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
for pool_cd in pool_cds:
new_item = Item.objects.create(
pool_cd=pool_cd.unique_code,
)
new_item.save()
cursor = connection.cursor()
update_sql = 'update CodePool set free_ind=%s where pool_cd.id in %s'
instance_param = ()
#Create ridiculously long list of params (5000 items)
for pool_cd in pool_cds:
instance_param = instance_param + (pool_cd.id,)
params = [False, instance_param]
rows = cursor.execute(update_sql, params)
As I understand how it works:
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
ids = []
for pool_cd in pool_cds:
Item.objects.create(pool_cd=pool_cd.unique_code)
ids += [pool_cd.id]
CodePool.objects.filter(id__in=ids).update(free_ind=False)
By the way if you created object using queryset method create, you don't need call save method. See docs.
I use mongodb to store compressed html files .
Basically, a complete document of mongod is like:
{'_id': 1, 'p1': data, 'p2': data2, 'p3': data3}
where data, data1, data3 are :bson.binary.Binary(zlib_compressed_html)
I have 12 Million ids and dataX are each one average 90KB,
so each document has at least size 180KB + sizeof(_id) + some_overhead.
The total data size would be at least 2TB.
I would like to notice that '_id' is index.
I insert to mongo with the following way:
def _save(self, mongo_col, my_id, page, html):
doc = mongo_col.find_one({'_id': my_id})
key = 'p%d' % page
success = False
if doc is None:
doc = {'_id': my_id, key: html}
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
else:
try:
mongo_col.update({'_id': my_id}, {'$set': {key: html}})
success = True
except:
log.exception('Exception updating mongodb')
return success
As you can see first I lookup the collection to see if a document with
my_id exists.
If it does not exist then I create it and save it to mongo else I update it.
The problem with the above is that although it was super fast, at some point it became really slow.
I will give you some numbers:
When it was fast I was doing 1.500.000 per 4 hours and after 300.000 per 4 hours.
I suspect that this affects the speed:
Note
When performing update operations that increase the document size beyond the allocated space for that document, the update operation relocates the document on disk and may reorder the document fields depending on the type of update.
As of these driver versions, all write operations will issue a getLastError command to confirm the result of the write operation:
{ getLastError: 1 }
Refer to the documentation on write concern in the Write Operations document for more information.
the above is from : http://docs.mongodb.org/manual/applications/update/
I am saying that because we could have the following :
{'_id: 1, 'p1': some_data}, ...., {'_id': 10000000, 'p2': some_data2}, ...{'_id': N, 'p1': sd3}
and imagine that I am calling the above _save method as:
_save(my_collection, 1, 2, bin_compressed_html)
Then it should update the doc with _id 1 . But if the thing that mongo site is the case,
because I am adding a key to the document it does not fit and should rearrange the document.
It is possible to move the document in the end of the collection, which could be very far on the disk. Could this slow things down?
Or speed slow down has to do with the size of the collection?
In any way to you think it should be more efficient to modify my structure to be like:
{'_id': ObjectId, 'mid': 1, 'p': 1, 'd': html}
where mid=my_id, p=page, d=compressed html
and modify _save method to do only inserts?
def _save(self, mongo_col, my_id, page, html):
doc = {'mid': my_id, 'p': page, 'd': html}
success = False
try:
mongo_col.save(doc, safe=True)
success = True
except:
log.exception('Exception saving to mongodb')
return success
this way I avoid the update (so the rearrange on disk) and one lookup (find_one)
but the documents would be 3x mores and I would have 2 indexes ( _id and mid ) .
What do you suggest?
Document relocation could be an issue if you continue to add pages of html as new attributes. Would it really be an issue to move pages to a new collection where you could simply add them one record each? Also I don't really think MongoDB is a good fit for your use case. E.g. Redis would be much more efficient.
Another thing you should take care of is to have enough ram for your _id index. Use db.mongocol.stats() to check the index size.
When inserting new Documents into MongoDB, a Document can grow without moving it up to a certain point. Because the DB is analyzing the incoming Data and adds a padding to the Document.
So do deal with less Document movements you can do two things:
manually tweaking the padding factor
preallocate space (attributes) for each document.
See Article about Padding or MongoDB Docs for more Information about the padding factor.
Btw. insetad of using save for creating new documents, you should use .insert() which will throw a duplicate key error if the _id is already there (.save() will overwrite your document)
Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.