neo4j CYPHER - at ON MATCH SET create new nodes on condition - python

To import XML data into a neo4j DB I first parse the XML to a python dictionary and then use CYPHER queries:
WITH $pubmed_dict as pubmed_article
UNWIND pubmed_article as particle
...
FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!')})
ON CREATE SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
ON MATCH SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
MERGE (p)<-[:WROTE]-(a)
)
Unfortunately, Authors don't have unique IDs in the database, so it might be that different authors have the same last names but different initials or affiliations.
...
<Author ValidYN="Y">
<LastName>Smith</LastName>
<ForeName>A L</ForeName>
<Initials>AL</Initials>
<AffiliationInfo>
<Affiliation>University X</Affiliation>
</AffiliationInfo>
</Author>
...
<Author ValidYN="Y">
<LastName>Smith</LastName>
<ForeName>A L</ForeName>
<Initials>AL</Initials>
<AffiliationInfo>
<Affiliation>University BUMBABU</Affiliation>
</AffiliationInfo>
</Author>
My intention was to MERGE on author.LastName but ON MATCH check if the author has the same ForeName OR the same Affiliation and if not create a new node instead.
How would I do that using CYPHER queries?
EDIT 1
Node Key constraints are the solution, which is an Enterprise Edition feature, though. Looking for a workaround for that.
EDIT 2
This code is working almost perfectly:
WITH $pubmed_dict as pubmed_article
UNWIND pubmed_article as particle
MERGE (p:Publication {pmid: particle.MedlineCitation.PMID.text})
ON CREATE SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)
ON MATCH SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)
FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!'), first_name: COALESCE(author.ForeName, 'FIRST NAME MISSING!')})
MERGE (p)<-[:WROTE]-(a)
)
To sum it up:
For every author I want to create a new author IF LastName OR ForeName OR Affiliation are different. I also need NEW Nodes for authors where LAST NAME MISSING! and FIRST NAME MISSING!
Is it possible to achieve this result WITHOUT Key Node Constraints? (because this is an Enterprise Edition feature...)

The authors do have a unique ID in Neo4j, the node ID. That can be used to identify the node and then the set the properties. Maybe something like this:
Match (a:Author{LastName:'xxx',ForeName:'yyy'})
with a, id(a) as ID
where ID > -1
match (b) where id(b)=ID set b.first_name = author.ForeName, b.affiliation = author.AffiliationInfo.Affiliation
The node's ID is not necessarily stable or predictable, so you have to access it directly before using it.
Because you are using python code, you might to better with a global query to pull down the author node data:
match (a:Author{LastName:'xxx',ForeName:'yyy'}) return a.LastName,a.ForeName,id(a) as ID
then, you can write a csv file to bulk upload the desired info. The csv could look like this:
> "ID","ForeName","LastName","Affiliation"
"26","David","Smith","Johns Hopkins"
etc.
The python code could do the filtering of nodes that do not need processing.
Then load the file:
LOAD CVS with HEADER file:///'xxx.csv' as line
match (a) where id(a)=toInteger(line.ID)
set a.Affiliation=line.toString(line.Affiliation")

You can use constraints, then neo4j will check uniqueness for you.
From documentation:
To create a Node Key ensuring that all nodes with a particular label have a set of defined properties whose combined value is unique, and where all properties in the set are present
CREATE CONSTRAINT ON (author:Author) ASSERT (author.first_name, author.last_name, author.affiliation) IS NODE KEY

Related

Runtime Foreign Key vs Integerfield

I have a problem. I already have two solution for my problem, but i was wondering which of those is the faster solution.
I guess that the second solution is not only more convienient- to use but also faster, but i want to be sure, so thats the reason why im asking.
My problem is i want to group multiple rows together. The group won't hold any meta data. So im only interested in runtime.
On the one hand i can use a Integer field and filter it later on when i need to get all entries that belong to the group. I guess runtime of O(n).
class SingleEntries(models.Model):
name = models.CharField(max_length=20)
group = models.IntegerField(null=True)
def find_all_group_members(id):
return SingleEntries.objects.filter(group=id)
The second solution and probably the more practicle way would be to create a foreign key to another model only using the pk there.
Then i can use the reverse relation to find all the entries that belong to the group.
class Group(models.Model):
id = models.AutoField(primary_key=True)
class SingleEntries(models.Model):
name = models.CharField(max_length=20)
group = models.ForeignKey(Group,on_delete=models.CASCADE,null=True)
def find_all_group_members(id):
return Group.objects.get(id=id).singleentries_set.all()
The first is more efficient, since this will use one query, whereas the latter will first fetch the Group, and then another one for the SingleEntries.
Indeed, if you work with:
SingleEntries.objects.filter(group=id)
this will make a simple query:
SELECT appname_singleentries.*
FROM appname_singleentries
WHERE appname_singleentries.group_id = id
It thus does not first fetch the Group into memory.
The latter will however make two queries. Indeed, it will first make a query to retrieve the Group, and then it will make a query like the one above to fetch the SingleEntries.
The two are also semantically not entirely the same: if there is no such group, then the former will return an empty QuerySet, whereas the latter will raise a Group.DoesNotExists exception.
But you can model this with:
class Group(models.Model):
pass
class SingleEntries(models.Model):
name = models.CharField(max_length=20)
group = models.ForeignKey(Group,on_delete=models.CASCADE,null=True)
def find_all_group_members(id):
return SingleEntries.objects.filter(group_id=id)
So you can use a Group model without having to retrieve the Group first.
If the groups are static in nature, that means if you don't see more groups coming to your system, you can use choices in Django.
Define choices as below
class GroupType(models.IntegerChoices):
GROUP_0 = 0, "Group 0 name"
GROUP_1 = 1, "Group 1 name"
GROUP_2 = 2, "Group 2 name"
And use it as choices field in the SingleEntries model as below
class SingleEntries(models.Model):
name = models.CharField(max_length=20)
group = models.IntegerField(choices=GroupChoices.choices, default=<set default here>)
If the groups are dynamic, meaning users can create groups whenever they want, in that case, go with your second approach of having another model for group.

Updating a node with merge using Py2Neo

I'm trying to merge and then update a graph using the py2neo library. My code looks roughly like
from py2neo import Graph, Node, Relationship
graph = Graph(host, auth=(user, password,))
tx = graph.begin()
alice = Node("Person", name="Alice")
bob = Node("Person", name="Bob")
KNOWS = Relationship(alice, "KNOWS", bob)
tx.create(KNOWS)
graph.commit(tx)
This creates the nodes and edges as expected as
(:Person {name: "Alice"})-[:KNOWS]->(:Person {name: "Bob"})
If I try to modify alice in a new transaction though, I get no change
e.g.
new_tx = graph.begin()
alice["age"] = 32
new_tx.merge(alice, "Person", "name")
graph.commit(new_tx)
I suspect I have misunderstood how the Transaction works here. I would expect the above to be equivalent to either finding Alice and updating with the new property or creating a new node.
Update: I have discovered the Graph.push method, but would still appreciate advice on best practice.
You need to define a primary key to let the MERGE know which property to use as a primary key. From the documentation:
The primary property key used for Cypher MATCH and MERGE operations.
If undefined, the special value of "id" is used to hinge
uniqueness on the internal node ID instead of a property. Note that
this alters the behaviour of operations such as Graph.create() and
Graph.merge() on GraphObject instances.
It's probably best practice to define a custom class for every node type and define the primary key there.
class Person(GraphObject):
__primarykey__ = "name"
name = Property()
born = Property()
acted_in = RelatedTo(Movie)
directed = RelatedTo(Movie)
produced = RelatedTo(Movie)

referencing an entity by its key before it gets saved to the ndb

I would like to be able to relate an entity of one class to another entity at the moment of the creation of both entities (one entity will have the other as it's parent and the other would have a key pointing to the other entity). It seems I am unable to obtain the key of an entity prior it gets saved to the Datastore. Is there any way to achieve the above without having to save one of the entities twice?
Below is the example:
class A(ndb.Model):
key_of_b = ndb.KeyProperty(kind='B')
class B(ndb.Model):
pass
What I am trying to do:
a = A()
b = B(parent=a.key)
a.key_of_b = b.key
a.put()
b.put()
If the key doesn't get assigned prior to the entity being saved, is there anyway I could construct it myself? Is there any way to achieve this or would the only solution be to save one of the entities twice?
You could do this with named keys but then you have to be sure you can name the two entities with unique keys:
# It is possible to construct a key for an entity that does not yet exist.
keyname_a = 'abc'
keyname_b = 'def'
key_a = ndb.Key(A, keyname_a)
key_b = ndb.Key(A, keyname_a, B, keyname_b)
a = A(id=keyname_a)
a.key_of_b = key_b
b = B(id=keyname_b, parent=key_a)
a.put()
b.put()
However, I would suggest thinking about why you would need the key_of_b property in the first place. If you only set A as the parent of B then you will always be able to navigate from from A to B and the other way around:
# If you have the A entity from somewhere and want to find B.
b = B.query(ancestor=entity_a.key).get()
# You have the B entity from somewhere and want to find A.
a = entity_b.key.parent().get()
This also gives you the opportunity to create one-to-many relationships between A and B.

Understanding ndb key class vs KeyProperty

I've looked through the documentation, the docs and SO questions and answers and am still struggling with understanding a small piece of this. Which should you choose and when?
This is what I've read so far (just sample):
ndb documentation
movie database structure on SO
Parent Key issues
The key class seems pretty straightforward to me. When you create an ndb entity the datastore automatically creates for you a key usually in the form of key(Kind, id) where the id is created for you .
So say you have these two models:
class Blah(ndb.Model):
last_name = ndb.StringProperty()
class Blah2(ndb.Model):
first_name = ndb.StringProperty()
blahkey = ndb.KeyProperty()
So just using the key kind and you want to make Blah1 a parent (or have several family members with the same last name)
lname = Blah(last_name = "Bonaparte")
l_key = lname.put() **OR**
l_key = lname.key.id() # spits out some long id
fname_key = l_key **OR**
fname_key = ndb.Key('Blah', lname.last_name) # which is more readable..
then:
lname = Blah2( parent=fname_key, first_name = "Napoleon")
lname.put()
lname2 = Blah2( parent=fname_key, first_name = "Lucien")
lname2.put()
So far so good (I think). Now about the KeyProperty for Blah2. Assume Blah1 is still the same.
lname3 = Blah2( first_name = "Louis", blahkey = fname_key)
lname3.put()
Is this correct ?
How to query various things
Query Last Name:
Blah.query() # all last names
Blah.query(last_name='Bonaparte') # That specific entity.
First Name:
Blah2.query()
napol = Blah2.query(first_name = "Napoleon")
bonakey = napol.key.parent().get() # returns Bonaparte's key ??
bona = bonakey.get() # I think this might be redundant
this is where I get lost. How to look for Bonaparte from first name by using either key or keyproperty. I didn't add it here and perhaps should have and that is the discussion of parents, grand parents, great grand parents since Keys keep track of ancestors/parents.
How and why would you use KeyProperty vs the inherent key class. Also imagine you had 3 sensors s1, s2, s3. Each sensor had thousands of readings but you want to keep readings associated with s1 so that you could graph say All readings for today for s1. Which would you use? KeyProperty or the key class ? I apologize if this has been answered elsewhere but I didn't see a clear example/guide about choosing which and why/how.
I think the confusion comes from using a Key. A Key is not associated with any properties inside of an entity, it is only a unique identifier to locate a single entity. It can be either a number or a string.
Fortunately, all your code looks good except for this one line:
fname_key = ndb.Key('Blah', lname.last_name) # which is more readable..
Constructing a Key takes a unique ID, which is not the same as a property. That is, it won't associate the variable lname.last_name with the property last_name. Instead, you can create your record like this:
lname = Blah(id = "Bonaparte")
lname.put()
lname_key = ndb.Key('Blah', "Bonaparte")
You are guaranteed to have only one Blah entity with that ID. In fact, if you use a string like last_name as the ID, you don't need to store it as a separate property. Think of the entity ID as an extra string property that is unique.
Next, Be careful not to assume that Blah.last_name and Blah2.first_name are unique in your queries:
lname = Blah2( parent=fname_key, first_name = "Napoleon")
lname.put()
If you do this more than once, there will be multiple entities with a first_name of Napoleon (all with the same parent key).
Continuing with your code from above:
napol = Blah2.query(first_name = "Napoleon")
bonakey = napol.key.parent().get() # returns Bonaparte's key ??
bona = bonakey.get() # I think this might be redundant
napol holds a Query, not a result. You need to call napol.fetch() to get all entities with "Napolean" (or napol.get() if you're sure there is just one entity).
bonakey is the opposite, it holds the parent entity because of the get() and not Bonaparte's key. If you left the .get() off, then bona would correctly have the parent.
Finally, your question about sensors. You may not need KeyProperty or "inherent" keys. If you have a Readings class like this:
class Readings(ndb.Model):
sensor = ndb.StringProperty()
reading = ndb.IntegerProperty()
then you can store them all in a single table without keys. (You may want to include a timestamp or other attribute.) Later, you can retrieve then with this query:
s1_readings = Readings.query(Readings.sensor == 'S1').fetch()
I'm new to NDB also, and I'm still not understanding all for now, but I think that when you create Blah2 with a parent for Napoleon, you will need the parent to query it or will not appear. For example:
napol = Blah2.query(first_name = "Napoleon")
will not get anything (and you are not using the right format for NDB), but using the parent will do:
napol = Blah2.query(ancestor=fname_key).filter(Blah2.first_name == "Napoleon").get
Don't know if this puts some light for your question.

SQLAlchemy/Elixir - querying to check entity's membership in a many-to-many relationship list

I am trying to construct a sqlalchemy query to get the list of names of all professors who are assistants professors on MIT. Note that there can be multiple assistant professors associated with a certain course.
What I'm trying to do is roughly equivalent to:
uni_mit = University.get_by(name='MIT')
s = select([Professor.name],
and_(Professor.in_(Course.assistants),
Course.university = uni_mit))
session.execute(s)
This won't work, because in_ is only defined for entity's fields, not for the whole entity.. Can't use Professor.id.in_ as Course.assistants is a list of Professors, not a list of their ids. I also tried contains but I didn't work either.
My Elixir model is:
class Course(Entity):
id = Field(Integer, primary_key=True)
assistants = ManyToMany('Professor', inverse='courses_assisted', ondelete='cascade')
university = ManyToOne('University')
..
class Professor(Entity):
id = Field(Integer, primary_key=True)
name = Field(String(50), required=True)
courses_assisted = ManyToMany('Course', inverse='assistants', ondelete='cascade')
..
This would be trivial if I could access the intermediate many-to-many entity (the condition would be and_(interm_table.prof_id = Professor.id, interm_table.course = Course.id), but SQLAlchemy apparently hides this table from me.
I'm using Elixir 0.7 and SQLAlchemy 0.6.
Btw: This question is different from Sqlalchemy+elixir: How query with a ManyToMany relationship? in that I need to check the professors against all courses which satisfy a condition, not a single, static one.
You can find the intermediate table where Elixir has hidden it away, but note that it uses fully qualified column names (such as __package_path_with_underscores__course_id). To avoid this, define your ManyToMany using e.g.
class Course(Entity):
...
assistants = ManyToMany('Professor', inverse='courses_assisted',
local_colname='course_id', remote_colname='prof_id',
ondelete='cascade')
and then you can access the intermediate table using
rel = Course._descriptor.find_relationship('assistants')
assert rel
table = rel.table
and can access the columns using table.c.prof_id, etc.
Update: Of course you can do this at a higher level, but not in a single query, because SQLAlchemy doesn't yet support in_ for relationships. For example, with two queries:
>>> mit_courses = set(Course.query.join(
... University).filter(University.name == 'MIT'))
>>> [p.name for p in Professor.query if set(
... p.courses_assisted).intersection(mit_courses)]
Or, alternatively:
>>> plist = [c.assistants for c in Course.query.join(
... University).filter(University.name == 'MIT')]
>>> [p.name for p in set(itertools.chain(*plist))]
The first step creates a list of lists of assistants. The second step flattens the list of lists and removes duplicates through making a set.

Categories