How to do batch insertion on Neo4J with Python - python

I have a code that insert many nodes and relationships:
from neo4jrestclient.client import GraphDatabase
from neo4jrestclient import client
import psycopg2
db = GraphDatabase("http://127.0.0.1:7474",username="neo4j", password="1234")
conn = psycopg2.connect("\
dbname='bdTrmmTest'\
user='postgres'\
host='127.0.0.1'\
password='1234'\
");
inicio = 0
while(inicio <= 4429640):
c = conn.cursor()
c.execute("SELECT p.latitude, p.longitude, h.precipitacaoh, h.datah, h.horah FROM pontos AS p, historico AS h WHERE p.gid = h.gidgeo_fk LIMIT 1640 OFFSET %d"%(inicio))
sensorlatlong = db.labels.create("LaLo")
sensorprecip = db.labels.create("Precipitacao")
sensordata = db.labels.create("Data")
sensorhora = db.labels.create("Hora")
records = c.fetchall()
for i in records:
s2 = db.nodes.create(precipitacao=i[2])
sensorprecip.add(s2)
s5 = db.nodes.create(horah=i[4])
sensorhora.add(s5)
s5.relationships.create("REGISTROU",s2)
q = 'MATCH (s:LaLo) WHERE s.latitude = "%s" AND s.longitude = "%s" RETURN s'%(str(i[0]),str(i[1]))
results = db.query(q, returns=(client.Node))
q2 = 'MATCH (s:LaLo)-->(d:Data)-->(h:Hora)-->(p:Precipitacao) WHERE s.latitude = "%s" AND s.longitude = "%s" AND d.datah = "%s" RETURN d'%(str(i[0]), str(i[1]), str(i[3]))
results1 = db.query(q2, returns=(client.Node))
if (len(results) > 0):
n = results[0].pop()
if(len(results1) > 0):
n1 = results1[0].pop()
n1.relationships.create("AS", s5)
else:
s4 = db.nodes.create(datah=i[3])
sensordata.add(s4)
n.relationships.create("EM", s4)
s4.relationships.create("AS",s5)
else:
s3 = db.nodes.create(latitude=i[0],longitude=i[1])
sensorlatlong.add(s3)
if(len(results1) > 0):
n1 = results1[0].pop()
n1.relationships.create("AS", s5)
else:
s4 = db.nodes.create(datah=i[3])
sensordata.add(s4)
s3.relationships.create("EM", s4)
s4.relationships.create("AS",s5)
inicio = inicio+1640
But it takes so many days to insert. How do batch insert in this code to decreasing the insertion time? I read this post http://jexp.de/blog/2012/10/parallel-batch-inserter-with-neo4j/ but it is in Java.

I haven't used Neo4j from Python, but I'm pretty sure the client works the same way as in other languages, and that means your code will generate a lot of distinct HTTP connections, manipulating the low-level node and relationship endpoints. That means lots of latency.
It also generates lots of distinct queries, because it does string replacement instead using of parameterized queries, and Neo4j will have to parse each and everyone of them.
You'd be much better of with a small number of parameterized Cypher queries, or even one.
If I've read the documentation for neo4jrestclient correctly, I think it would look something like that:
c.execute("SELECT p.latitude, p.longitude, h.precipitacaoh, h.datah, h.horah FROM pontos AS p, historico AS h WHERE p.gid = h.gidgeo_fk LIMIT 1640 OFFSET %d"%(inicio))
records = c.fetchall()
q = """
MERGE (lalo:LaLo {latitude: {latitude}, longitude: {longitude}})
WITH lalo
MERGE (lalo)-[:EM]->(data:Data {datah: {datah}})
WITH data
CREATE (data)-[:AS]->(hora:Hora {horah: {horah}})
CREATE (hora)-[:REGISTROU]->(:Precipitacao {precipitacao: {precipitacao}})
"""
for i in records:
params = {
"latitude": str(i[0]),
"longitude": str(i[1]),
"precipitacao": i[2],
"datah": i[3],
"horah": i[4],
}
db.query(q=q, params=params)
Of course, it will run faster if you have indices, so you'd need to create those first (at least the first 2), for example right before the loop, or outside of the process:
CREATE INDEX ON :LaLo(latitude)
CREATE INDEX ON :LaLo(longitude)
CREATE INDEX ON :Data(datah)
The last thing you could do to speed things up is use transactions, so writes happen in batches.
Open a transaction
tx = db.transaction(for_query=True)
Append (for example) up to a thousand queries (or less if you reach the end of the rows)
params = // ...
tx.append(q=q, params=params)
Commit the transaction
tx.execute()
Repeat until you've run out of rows from the SQL database

Related

Problems querying with python to BigQuery (Python String Format)

I am trying to make a query to BigQuery in order to modify all the values of a row (in python). When I use a simple string to query, I have no problems. Nevertheless, when I introduce the string formatting the query does not work. As follows I'm presenting the same query, but diminishing the number of columns that I am modifying.
I already made the connection to BigQuery, by defining the Client, etc (and works properly).
I tried:
"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}".format(inf = df.informaci_meteorol_gica[index], ri = df.risc[index], obj_id = df.objectid[index])
To specify the input values in format:
df.informaci_meteorol_gica[index] = 'Neu' , also a string for df.risc[index] and df.objectid[index] = 3
I am obtaining the following error message:
BadRequest: 400 Braced constructors are not supported at [1:77]
Instead of using format method of string, I propose you another approach with the f string formating in Python :
def build_query():
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"
return f"UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = {inf}, risc = {ri} WHERE objectid = {obj_id}"
if __name__ == '__main__':
query = build_query()
print(query)
The result is :
UPDATE `riscos-dev.survey_test.data-test-bdrn` SET informaci_meteorol_gica = 'test_inf', risc = 'test_ri' WHERE objectid = 'test_obj_id'
I mocked the query params in my example with :
inf = "'test_inf'"
ri = "'test_ri'"
obj_id = "'test_obj_id'"

Looking for a better strategy for an SQLAlchemy bulk upsert

I have a Flask application with a RESTful API. One of the API calls is a 'mass upsert' call with a JSON payload. I am struggling with performance.
The first thing I tried was to use merge-result on a Query object, because...
This is an optimized method which will merge all mapped instances, preserving the structure of the result rows and unmapped columns with less method overhead than that of calling Session.merge() explicitly for each value.
This was the initial code:
class AdminApiUpdateTasks(Resource):
"""Bulk task creation / update endpoint"""
def put(self, slug):
taskdata = json.loads(request.data)
existing = db.session.query(Task).filter_by(challenge_slug=slug)
existing.merge_result(
[task_from_json(slug, **task) for task in taskdata])
db.session.commit()
return {}, 200
A request to that endpoint with ~5000 records, all of them already existing in the database, takes more than 11m to return:
real 11m36.459s
user 0m3.660s
sys 0m0.391s
As this would be a fairly typical use case, I started looking into alternatives to improve performance. Against my better judgement, I tried to merge the session for each individual record:
class AdminApiUpdateTasks(Resource):
"""Bulk task creation / update endpoint"""
def put(self, slug):
# Get the posted data
taskdata = json.loads(request.data)
for task in taskdata:
db.session.merge(task_from_json(slug, **task))
db.session.commit()
return {}, 200
To my surprise, this turned out to be more than twice as fast:
real 4m33.945s
user 0m3.608s
sys 0m0.258s
I have two questions:
Why is the second strategy using merge faster than the supposedly optimized first one that uses merge_result?
What other strategies should I pursue to optimize this more, if any?
This is an old question, but I hope this answer can still help people.
I used the same idea as this example set by SQLAlchemy, but I added benchmarking for doing UPSERT (insert if exists, otherwise update the existing record) operations. I added the results on a PostgreSQL 11 database below:
Tests to run: test_customer_individual_orm_select, test_customer_batched_orm_select, test_customer_batched_orm_select_add_all, test_customer_batched_orm_merge_result
test_customer_individual_orm_select : UPSERT statements via individual checks on whether objects exist and add new objects individually (10000 iterations); total time 9.359603 sec
test_customer_batched_orm_select : UPSERT statements via batched checks on whether objects exist and add new objects individually (10000 iterations); total time 1.553555 sec
test_customer_batched_orm_select_add_all : UPSERT statements via batched checks on whether objects exist and add new objects in bulk (10000 iterations); total time 1.358680 sec
test_customer_batched_orm_merge_result : UPSERT statements using batched merge_results (10000 iterations); total time 7.191284 sec
As you can see, merge-result is far from the most efficient option. I'd suggest checking in batches whether the results exist and should be updated. Hope this helps!
"""
This series of tests illustrates different ways to UPSERT
or INSERT ON CONFLICT UPDATE a large number of rows in bulk.
"""
from sqlalchemy import Column
from sqlalchemy import create_engine
from sqlalchemy import Integer
from sqlalchemy import String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session
from profiler import Profiler
Base = declarative_base()
engine = None
class Customer(Base):
__tablename__ = "customer"
id = Column(Integer, primary_key=True)
name = Column(String(255))
description = Column(String(255))
Profiler.init("bulk_upserts", num=100000)
#Profiler.setup
def setup_database(dburl, echo, num):
global engine
engine = create_engine(dburl, echo=echo)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
s = Session(engine)
for chunk in range(0, num, 10000):
# Insert half of the customers we want to merge
s.bulk_insert_mappings(
Customer,
[
{
"id": i,
"name": "customer name %d" % i,
"description": "customer description %d" % i,
}
for i in range(chunk, chunk + 10000, 2)
],
)
s.commit()
#Profiler.profile
def test_customer_individual_orm_select(n):
"""
UPSERT statements via individual checks on whether objects exist
and add new objects individually
"""
session = Session(bind=engine)
for i in range(0, n):
customer = session.query(Customer).get(i)
if customer:
customer.description += "updated"
else:
session.add(Customer(
id=i,
name=f"customer name {i}",
description=f"customer description {i} new"
))
session.flush()
session.commit()
#Profiler.profile
def test_customer_batched_orm_select(n):
"""
UPSERT statements via batched checks on whether objects exist
and add new objects individually
"""
session = Session(bind=engine)
for chunk in range(0, n, 1000):
customers = {
c.id: c for c in
session.query(Customer)\
.filter(Customer.id.between(chunk, chunk + 1000))
}
for i in range(chunk, chunk + 1000):
if i in customers:
customers[i].description += "updated"
else:
session.add(Customer(
id=i,
name=f"customer name {i}",
description=f"customer description {i} new"
))
session.flush()
session.commit()
#Profiler.profile
def test_customer_batched_orm_select_add_all(n):
"""
UPSERT statements via batched checks on whether objects exist
and add new objects in bulk
"""
session = Session(bind=engine)
for chunk in range(0, n, 1000):
customers = {
c.id: c for c in
session.query(Customer)\
.filter(Customer.id.between(chunk, chunk + 1000))
}
to_add = []
for i in range(chunk, chunk + 1000):
if i in customers:
customers[i].description += "updated"
else:
to_add.append({
"id": i,
"name": "customer name %d" % i,
"description": "customer description %d new" % i,
})
if to_add:
session.bulk_insert_mappings(
Customer,
to_add
)
to_add = []
session.flush()
session.commit()
#Profiler.profile
def test_customer_batched_orm_merge_result(n):
"UPSERT statements using batched merge_results"
session = Session(bind=engine)
for chunk in range(0, n, 1000):
customers = session.query(Customer)\
.filter(Customer.id.between(chunk, chunk + 1000))
customers.merge_result(
Customer(
id=i,
name=f"customer name {i}",
description=f"customer description {i} new"
) for i in range(chunk, chunk + 1000)
)
session.flush()
session.commit()
I think that either this was causing your slowness in the first query:
existing = db.session.query(Task).filter_by(challenge_slug=slug)
Also you should probably change this:
existing.merge_result(
[task_from_json(slug, **task) for task in taskdata])
To:
existing.merge_result(
(task_from_json(slug, **task) for task in taskdata))
As that should save you some memory and time, as the list won't be generated in memory before sending it to the merge_result method.

Creating a Table(array) of Records

If I wanted to store records from two files into a table (an array of records), could I use a format similar to the below code, and just put both file names in def function like def readTable(log1,log2): and then use the same code for both log1 and log2 allowing it to make a table1 and a table2?
def readTable(fileName):
s = Scanner(fileName)
table = []
record = readRecord(s)
while (record != ""):
table.append(record)
record = readRecord(s)
s.close()
return table
Just use *args, and get a list of records?
def readTable(*args):
tables = []
for filename in args:
s = Scanner(fileName)
table = []
record = readRecord(s)
while (record != ""):
table.append(record)
record = readRecord(s)
s.close()
tables.append(table)
return tables
This way, you can pass log1, log2, log3 (any number of logs you like and get back a list of tables for each
Since readTable returns a list, if you want to concatenate the records from 2 logs, use the + operator.
readTable(log1) + readTable(log2)

not able to undertand cursors in appengine

I'm trying to fetch results in a python2.7 appengine app using cursors, but each time I use with_cursor() it fetches the same result set.
query = Model.all().filter("profile =", p_key).order('-created')
if r.get('cursor'):
query = query.with_cursor(start_cursor = r.get('cursor'))
cursor = query.cursor()
objs = query.fetch(limit=10)
count = len(objs)
for obj in objs:
...
Each time through I'm getting same 10 results. I'm thinkng it has to do with using end_cursor, but how do I get that value if query.cursor() is returning the start_cursor. I've looked through the docs but this is poorly documented.
Your formatting is a bit screwy by the way. Looking at your code (which is incomplete and therefore potentially leaving something out.) I have to assume you have forgotten to store the cursor after fetching results (or return to the user - I am assuming r is a request ?).
So after you have fetched some data you need to call cursor() on the query. e.g This function counts all entities using a cursor.
def count_entities(kind):
c = None
count = 0
q = kind.all(keys_only=True)
while True:
if c:
q.with_cursor(c)
i = q.fetch(1000)
count = count + len(i)
if not i:
break
c = q.cursor()
return count
See how after fetch() has been called the c=q.cursor() call and it's is used as the cursor next time through the loop.
Here's what finally worked:
query = Model.all().filter("profile =", p_key).order('-created')
if request.get('cursor'):
query = query.with_cursor(request.get('cursor'))
objs = query.fetch(limit=10)
cursor = query.cursor()
for obj in objs:
...

Limit calls to external database with Python CGI

I've got a Python CGI script that pulls data from a GPS service; I'd like this information to be updated on the webpage about once every 10s (the max allowed by the GPS service's TOS). But there could be, say, 100 users viewing the webpage at once, all calling the script.
I think the users' scripts need to grab data from a buffer page that itself only upates once every ten seconds. How can I make this buffer page auto-update if there's no one directly viewing the content (and not accessing the CGI)? Are there better ways to accomplish this?
Cache the results of your GPS data query in a file or database (sqlite) along with a datetime.
You can then do a datetime check against the last cached datetime to initiate another GPS data query.
You'll probably run into concurrency issues with cgi and the datetime check though...
To get around concurrency issues, you can use sqlite, and put the write in a try/except.
Here's a sample cache implementation using sqlite.
import datetime
import sqlite3
class GpsCache(object):
db_path = 'gps_cache.db'
def __init__(self):
self.con = sqlite3.connect(self.db_path)
self.cur = self.con.cursor()
def _get_period(self, dt=None):
'''normalize time to 15 minute periods'''
if dt.minute < 15:
minute_period = 0
elif 15 <= dt.minute < 30:
minute_period = 15
elif 30 <= dt_minute < 45:
minute_period = 30
elif 45 <= dt_minute:
minute_period = 25
period_dt = datetime.datetime(year=dt.year, month=dt.month, day=dt.day, hour=dt.hour, minute=minute_period)
return period_dt
def get_cache(dt=None):
period_dt = self._get_period(dt)
select_sql = 'SELECT * FROM GPS_CACHE WHERE date_time = "%s";' % period_dt.strftime('%Y-%m-%d %H:%M')
self.cur.execut(select_sql)
result = self.cur.fetchone()[0]
return result
def put_cache(dt=None, data=None):
period_dt = self._get_period(dt)
insert_sql = 'INSERT ....' # edit to your table structure
try:
self.cur.execute(insert_sql)
self.con.commit()
except sqlite3.OperationalError:
# assume db is being updated by another process with the current resutls and ignore
pass
So we have the cache tool now the implementation side.
You'll want to check the cache first then if it's not 'fresh' (doens't return anything), go grab the data using your current method. Then cache the data you grabbed.
you should probably organize this better, but you should get the general idea here.
Using this sample, you just replace your current calls to 'remote_get_gps_data' with 'get_gps_data'.
from gps_cacher import GpsCache
def remote_get_gps_data():
# your function here
return data
def get_gps_data():
data = None
gps_cache = GpsCache()
current_dt = datetime.datetime.now()
cached_data = gps_cache.get_cache(current_dt)
if cached_data:
data = cached_data
else:
data = remote_get_gps_data()
gps_cache.put_cache(current_dt, data)
return data

Categories