Optimizing py2neo's cypher insertion - python

I am using py2neo to import several hundred thousand nodes. I've created a defaultdict to map neighborhoods to cities. One motivation was to more efficiently import these relationships having been unsuccessful with Neo4j's load tool.
Because the batch documentation suggests to avoid using it, I veered away from an implementation like the OP of this post. Instead the documentation suggests I use Cypher. However, I like the being able to create nodes from the defaultdict I have created. Plus, I found it too difficult importing this information as the first link demonstrates.
To reduce the speed of the import, should I create a Cypher transaction (and submit every 10,00) instead of the following loop?
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
city_node = graph.find_one(label="City", property_key="Name", property_value=city_name)
for neighborhood_name in neighborhood_names:
neighborhood_node = Node("Neighborhood", Name=neighborhood_name)
rel = Relationship(neighborhood_node, "IN", city_node)
graph.create(rel)
I get a time-out, and it appears to be pretty slow when I do the following. Even when I break up the transaction so it commits every 1,000 neighborhoods, it still processes very slowly.
tx = graph.cypher.begin()
statement = "MERGE (city {Name:{City_Name}}) CREATE (neighborhood { Name : {Neighborhood_Name}}) CREATE (neighborhood)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
for neighborhood_name in neighborhood_names:
tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})
tx.commit()
It would be fantastic to save pointers to each city so I don't need to look it up each time with the merge.

It may be faster to do this in two runs, i.e. CREATE all nodes first with unique constraints (which should be very fast) and then CREATE the relationships in a second round.
Constraints first, use Labels City and Neighborhood, faster MATCH later:
graph.schema.create_uniqueness_constraint('City', 'Name')
graph.schema.create_uniqueness_constraint('Neighborhood', 'Name')
Create all nodes:
tx = graph.cypher.begin()
statement = "CREATE (:City {Name: {name}})"
for city_name in city_neighborhood_map.keys():
tx.append(statement, {"name": city_name})
statement = "CREATE (:Neighborhood {Name: {name}})"
for neighborhood_name in neighborhood_names: # get all neighborhood names for this
tx.append(statement, {name: neighborhood_name})
tx.commit()
Relationships should be fast now (fast MATCH due to constraints/index):
tx = graph.cypher.begin()
statement = "MATCH (city:City {Name: {City_Name}}), MATCH (n:Neighborhood {Name: {Neighborhood_Name}}) CREATE (n)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
for neighborhood_name in neighborhood_names:
tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})
tx.commit()

Related

Pyneo Inserts Limited Amount of Edges

For a university project I am using Neo4j together with flask and pyneo for a shift scheduling algorithm. On saving the scheduled shifts to Neo4j I realized that relationships go missing, from 330 only 91 get inserted.
On printing them before/after inserting, they are in the list to be inserted, and I also moved the transaction around to check if this changes the result.
I have the following structure:
(w:Worker)-[r:works_during]->(s:Shift) with
r.day, r.month, r.year as set parameters for the relationship and multiple connections between each worker and each shift, which can be filtered via the relation then.
my code looks like the following:
header = df.columns.tolist()
header.remove("index")
header.remove("worker")
tuplelist = []
for index, row in df.iterrows():
for i in header:
worker = self.driver.nodes.match("Worker", id=int(row["worker"])).first()
if row[i] == 1:
# Shifts are in the format {day}_{shift_of_day}
shift_id = str(i).split("_")[1]
shift_day = str(i).split("_")[0]
shift = self.driver.nodes.match("Shift", id=int(shift_id)).first()
rel = Relationship(worker, "works_during", shift)
rel["day"] = int(shift_day)
rel["month"] = int(month)
rel["year"] = int(year)
tuplelist.append(rel)
print(len(tuplelist))
for i in tuplelist:
connection = self.driver.begin()
connection.create(i)
connection.commit()
Is there any special behaviour in pyneo which I need to be aware of that could cause this issue?
Pyneo allows just one connection from the same type between node A and node B.
If multiple connections of the same type (even with different attributes) are needed, it is necessary to use plain Cypher Querying as pyneo will merge this edges to a single edge.

Pandas dataframe to doc2vec.LabeledSentence

I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)

Google's radarsearch API results

I'm trying to geolocate all the businesses related to a keyword in my city using, first, the radarsearch API in order to retrieve the Place ID and later using the Places API to get more information of each Place ID (such as the name, or the formatted address).
In my first approach I splitted my city in 9 circumferences, each one with radius 22km and avoiding rural zones, where there's no supposed to be a business. This way I obtained (once removing duplicated results, due to the circumferences overlapping) approximately 150 businesses. This result is not reliable because the official webpage of the company asserts there are 245.
In order to retrieve ALL the businesses, I split my city in circumferences of radius 10km. Therefore with approx 50 pairs of coordinates I fill the city, including now all zones, both rural and non-rural. Now, surprisingly I obtain only 81 businesses! How can this be possible?
I'm storing all the information in separated dictionaries and I noticed the amount of data of each dictionary increases with the increasing of the radius and is always the same (for a fixed radius).
Now, apart from the previous question, is there any way to limit the amount of results each request yields?
The code I'm using is the following:
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
else:
pass
Finally I managed to solve this problem.
The thing was the dict initialization must be done in each loop iteration. Now it stores all the information and I retrieve what I wanted from the beginning.
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
dict1 = {}
else:
pass

sqlalchemy query using joinedload exponentially slower with each new filter clause

I have this sqlalchemy query:
query = session.query(Store).options(joinedload('salesmen').
joinedload('comissions').
joinedload('orders')).\
filter(Store.store_code.in_(selected_stores))
stores = query.all()
for store in stores:
for salesman in store.salesmen:
for comission in salesman.comissions:
#generate html for comissions for each salesman in each store
#print html document using PySide
This was working perfectly, however I added two new filter queries:
filter(Comissions.payment_status == 0).\
filter(Order.order_date <= self.dateEdit.date().toPython())
If I add just the first filter the application hangs for a couple of seconds, if I add both the application hangs indefinitely
What am I doing wrong here? How do I make this query fast?
Thank you for your help
EDIT: This is the sql generated, unfortunately the class and variable names are in Portuguese, I just translated them to English so it would be easier to undertand,
so Loja = Store, Vendedores = Salesmen, Pedido = Order, Comission = Comissao
Query generated:
SELECT "Loja"."CodLoja", "Vendedores_1"."CodVendedor", "Vendedores_1"."NomeVendedor", "Vendedores_1"."CodLoja", "Vendedores_1"."PercentualComissao",
"Vendedores_1"."Ativo", "Comissao_1"."CodComissao", "Comissao_1"."CodVendedor", "Comissao_1"."CodPedido",
"Pedidos_1"."CodPedido", "Pedidos_1"."CodLoja", "Pedidos_1"."CodCliente", "Pedidos_1"."NomeCliente", "Pedidos_1"."EnderecoCliente", "Pedidos_1"."BairroCliente",
"Pedidos_1"."CidadeCliente", "Pedidos_1"."UFCliente", "Pedidos_1"."CEPCliente", "Pedidos_1"."FoneCliente", "Pedidos_1"."Fone2Cliente", "Pedidos_1"."PontoReferenciaCliente",
"Pedidos_1"."DataPedido", "Pedidos_1"."ValorProdutos", "Pedidos_1"."ValorCreditoTroca",
"Pedidos_1"."ValorTotalDoPedido", "Pedidos_1"."Situacao", "Pedidos_1"."Vendeu_Teflon", "Pedidos_1"."ValorTotalTeflon",
"Pedidos_1"."DataVenda", "Pedidos_1"."CodVendedor", "Pedidos_1"."TipoVenda", "Comissao_1"."Valor", "Comissao_1"."DataPagamento", "Comissao_1"."StatusPagamento"
FROM "Comissao", "Pedidos", "Loja" LEFT OUTER JOIN "Vendedores" AS "Vendedores_1" ON "Loja"."CodLoja" = "Vendedores_1"."CodLoja"
LEFT OUTER JOIN "Comissao" AS "Comissao_1" ON "Vendedores_1"."CodVendedor" = "Comissao_1"."CodVendedor" LEFT OUTER JOIN "Pedidos" AS "Pedidos_1" ON "Pedidos_1"."CodPedido" = "Comissao_1"."CodPedido"
WHERE "Loja"."CodLoja" IN (:CodLoja_1) AND "Comissao"."StatusPagamento" = :StatusPagamento_1 AND "Pedidos"."DataPedido" <= :DataPedido_1
Your FROM clause is producing a Cartesian product and includes each table twice, once for filtering the result and once for eagerly loading the relationship.
To stop this use contains_eager instead of joinedload in your options. This will look for the related attributes in the query's columns instead of constructing an extra join. You will also need to explicitly join to the other tables in your query, e.g.:
query = session.query(Store)\
.join(Store.salesmen)\
.join(Store.commissions)\
.join(Store.orders)\
.options(contains_eager('salesmen'),
contains_eager('comissions'),
contains_eager('orders'))\
.filter(Store.store_code.in_(selected_stores))\
.filter(Comissions.payment_status == 0)\
.filter(Order.order_date <= self.dateEdit.date().toPython())

Failed WriteBatch Operation with py2neo

I am trying to find a workaround to the following problem. I have seen it quasi-described in this SO question, yet not really answered.
The following code fails, starting with a fresh graph:
from py2neo import neo4j
def add_test_nodes():
# Add a test node manually
alice = g.get_or_create_indexed_node("Users", "user_id", 12345, {"user_id":12345})
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch requests
a = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
batch.set_properties(a, new_node_data) #<-- I'm the problem
# execute batch requests and clear
batch.run()
batch.clear()
if __name__ == '__main__':
# Initialize Graph DB service and create a Users node index
g = neo4j.GraphDatabaseService()
users_idx = g.get_or_create_index(neo4j.Node, "Users")
# run the test functions
add_test_nodes()
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
print alice
do_batch(g)
# get alice back and assert additional properties were added
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
assert "name" in alice
In short, I wish, in one batch transaction, to update existing indexed node properties. The failure is occurring at the batch.set_properties line, and it is because the BatchRequest object returned by the previous line is not being interpreted as a valid node. Though not entirely indentical, it feels like I am attempting something like the answer posted here
Some specifics
>>> import py2neo
>>> py2neo.__version__
'1.6.0'
>>> g = py2neo.neo4j.GraphDatabaseService()
>>> g.neo4j_version
(2, 0, 0, u'M06')
Update
If I split the problem into separate batches, then it can run without error:
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch request 1
batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
# execute batch request and clear
alice = batch.submit()
batch.clear()
# batch request 2
batch.set_properties(a, new_node_data)
# execute batch request and clear
batch.run()
batch.clear()
This works for many nodes as well. Though I do not love the idea of splitting the batch up, this might be the only way at the moment. Anyone have some comments on this?
After reading up on all the new features of Neo4j 2.0.0-M06, it seems that the older workflow of node and relationship indexes are being superseded. There is presently a bit of a divergence on the part of neo in the way indexing is done. Namely, labels and schema indexes.
Labels
Labels can be arbitrarily attached to nodes and can serve as a reference for an index.
Indexes
Indexes can be created in Cypher by referencing Labels (here, User) and node property key, (screen_name):
CREATE INDEX ON :User(screen_name)
Cypher MERGE
Furthermore, the indexed get_or_create methods are now possible via the new cypher MERGE function, which incorporate Labels and their indexes quite succinctly:
MERGE (me:User{screen_name:"SunPowered"}) RETURN me
Batch
Queries of the sort can be batched in py2neo by appending a CypherQuery instance to the batch object:
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
cypher_merge_user = neo4j.CypherQuery(graph_db,
"MERGE (user:User {screen_name:{name}}) RETURN user")
def get_or_create_user(screen_name):
"""Return the user if exists, create one if not"""
return cypher_merge_user.execute_one(name=screen_name)
def get_or_create_users(screen_names):
"""Apply the get or create user cypher query to many usernames in a
batch transaction"""
batch = neo4j.WriteBatch(graph_db)
for screen_name in screen_names:
batch.append_cypher(cypher_merge_user, params=dict(name=screen_name))
return batch.submit()
root = get_or_create_user("Root")
users = get_or_create_users(["alice", "bob", "charlie"])
Limitation
There is a limitation, however, in that the results from a cypher query in a batch transaction cannot be referenced later in the same transaction. The original question was in reference to updating a collection of indexed user properties in one batch transaction. This is still not possible, as far as I can muster. For example, the following snippet throws an error:
batch = neo4j.WriteBatch(graph_db)
b1 = batch.append_cypher(cypher_merge_user, params=dict(name="Alice"))
batch.set_properties(b1, dict(last_name="Smith")})
resp = batch.submit()
So, it seems that although there is a bit less overhead in implementing the get_or_create over a labelled node using py2neo because the legacy indexes are no longer necessary, the original question still needs 2 separate batch transactions to complete.
Your problem seems not to be in batch.set_properties() but rather in the output of batch.get_or_create_in_index(). If you add the node with batch.create(), it works:
db = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(db)
# create a node instead of getting it from index
test_node = batch.create({'key': 'value'})
# set new properties on the node
batch.set_properties(test_node, {'key': 'foo'})
batch.submit()
If you have a look at the properties of the BatchRequest object returned by batch.create() and batch.get_or_create_in_index() there is a difference in the URI because the methods use different parts of the neo4j REST API:
test_node = batch.create({'key': 'value'})
print test_node.uri # node
print test_node.body # {'key': 'value'}
print test_node.method # POST
index_node = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
print index_node.uri # index/node/Users?uniqueness=get_or_create
print index_node.body # {u'value': 12345, u'key': 'user_id', u'properties': {}}
print index_node.method # POST
batch.submit()
So I guess batch.set_properties() somehow can't handle the URI of the indexed node? I.e. it doesn't really get the correct URI for the node?
Doesn't solve the problem, but could be a pointer for somebody else ;) ?

Categories