I am populating a graph from an SQLite3 database into neo4j, using py2neo with Python 3.2 on Ubuntu linux. Although speed is not of the utmost concern, the graph has only gotten 40K rows (one relation for each sql-row) in about 3hours, out of a total of 5 million rows.
Here is the main loop:
from py2neo import neo4j as neo
import sqlite3 as sql
#select all 5M rows from sql-database
sql_str = """select * from bigram_with_number"""
#loop through each row
for (freq, first, firstfreq, second, secondfreq) in sql_cursor.execute(sql_str):
# create the Cypher query string using cypher 2.0 with merge
# so that nodes are created only if needed
query = neo.CypherQuery(neo4j_db,"""
CYPHER 2.0
merge (n:word {form: {firstvar}, freq: {freqfirst}})
merge(m:word {form: {secondvar}, freq: {freqsecond}})
create unique (n)-[:bigram {freq: {freqbigram}}]->(m) return n, m""")
#execute the string with parameters from sql-query
result = query.execute(freqbigram = freq, firstvar = first, freqfirst=firstfreq, secondvar=second, freqsecond=secondfreq)
Although the database populates nicely, it will take weeks before it is finished.
I suspect it is possible to do this faster.
For bulk loading, you're probably better off bypassing the REST interface and using something lower level such as Michael Hunger's load tools: https://github.com/jexp/neo4j-shell-tools. Even at optimal performance, the REST interface is unlikely to ever achieve the speeds you're looking for.
As an aside, please note that I don't officially support Python 3.2 although I do support 3.3.
Related
I am using MySQL with pandas and sqlalchemy. However, it is extremely slow. A simple query as this one takes more than 11 minutes to complete on a table with 11 milion rows. What actions could improve this performance? The table mentioned does not have a primary key and was indexed only by one column.
from sqlalchemy import create_engine
import pandas as pd
sql_engine_access = 'mysql+pymysql://root:[password]#localhost')
sql_engine = create_engine(sql_engine_access, echo=False)
script = 'select * from my_database.my_table'
df = pd.read_sql(script, con=self.sql_engine)
You can try out our tool connectorx (pip install -U connectorx). It is implemented in Rust and targeting on improving the performance of pandas.read_sql. The API is basically the same with pandas. For example in your case the code would look like:
import connectorx as cx
conn_url = "mysql://root:[password]#localhost:port/my_database"
query = "select * from my_table"
df = cx.read_sql(conn_url, query)
If there is a numerical column that is evenly distributed like ID in your query result, you can also further speed up the process by leveraging multiple cores like this:
df = cx.read_sql(conn_url, query, partition_on="ID", partition_num=4)
This would split the entire query to four small ones by filtering on the ID column and connectorx will run them in parallel. You can check out here for more usage and examples.
Here is the benchmark result loading 60M rows x 16 columns from MySQL to pandas DataFrame using 4 cores:
While perhaps not the entire cause of the slow performance, one contributing factor would be that PyMySQL (mysql+pymysql://) can be significantly slower than mysqlclient (mysql+mysqldb://) under heavy loads. In a very informal test (no multiple runs, no averaging, no server restarts) I saw the following results using df.read_sql_query() against a local MySQL database:
rows retrieved
mysql+mysqldb (seconds)
mysql+pymysql (seconds)
1_000_000
13.6
54.0
2_000_000
25.9
114.1
3_000_000
38.9
171.5
4_000_000
62.8
217.0
5_000_000
78.3
277.4
We are dealing with some performance issues upon inserting the DISTINCT command in our SQL-queries.
The problem occurs only in the following scenario: 100000 entries (or more) with only ~1% (or less) of distinct values in them.
We boiled down the issue to the following minimal python example (but it's not related to python, mysql workbench behaves the same):
import mysql.connector
import time
import numpy as np
conn = mysql.connector.connect(user='user', password='password', host='server',
database='database', raise_on_warnings=True, autocommit=False)
cursor = conn.cursor()
#define amount of entries
max_exponent = 4.7
n_entry = 10**max_exponent
# fill table with 10, 100, ... distinct entries
for n_distinct in np.logspace(1, max_exponent, num=int(max_exponent)):
# Dropping BENCHMARK table if already exists and create new one
cursor.execute("DROP TABLE IF EXISTS BENCHMARK")
cursor.execute('CREATE TABLE BENCHMARK(ID INT)')
# create distinct number set and insert random permutation of it into table
distinct_numbers = range(int(n_distinct))
random_numbers = np.random.randint(len(distinct_numbers), size=int(n_entry))
value_string = ','.join([f"({i_name})" for i_name in random_numbers])
mySql_insert_query = f"INSERT INTO BENCHMARK (ID) VALUES {value_string}"
print(f'filling table with {n_entry:.0f} random values of {n_distinct:.0f} distinct numbers')
cursor.execute(mySql_insert_query)
conn.commit()
# benchmark distinct call
start = time.time()
sql_query = 'SELECT DISTINCT ID from BENCHMARK'
cursor.execute(sql_query)
result = cursor.fetchall()
print(f'Time to read {len(result)} distinct values: {time.time()-start:.2f}')
conn.close()
The extracted benchmark times show a counter-intuitive behaviour, where time suddenly increases for fewer distinct values in the table:
If we make the query without using DISTINCT the times drop to 170ms, independent from amount of distinct entries.
We cannot make any sense of this dependence (except for some "hardware limitation", but 100000 entries should be ... negligible?), so we ask you for insight what the root cause of this behaviour might be.
The machine we are using for the database has the following specs:
CPU: Intel i5 # 3.3GHz (CPU Load goes to 30% during execution)
Ram: 8 GB (mysqld takes about 2.4GB, does not rise during query execution, InnoDB Buffer usage stays at 42%, buffer_size = 4GB)
HDD: 500GB, ~90% empty
OS, Mysql: Windows 10, Mysql Server 8.0.18
Thanks for reading!
Having versus not having an index on id is likely to make a huge difference.
At some point, MySQL shifts gears -- There are multiple ways to do a GROUP BY or DISTINCT query:
Have a hash in memory and count how many of each.
Write to a temp table, sort it, then go through it counting how many distinct values
If there is a usable index, then skip from one value to the next.
The Optimizer cannot necessarily predict the best way for a given situation, so there could be times when it fails to pick the optimal situation. There is probably no way in the old 5.5 version (almost a decade old) to get insight into what the Optimizer chose to do. Newer versions have EXPLAIN FORMAT=JSON and "Optimizer Trace".
Another issue is I/O. Reading data from disk can slow down a query ten-fold. However, this does not seem to be an issue since the table is rather small. And you seem to run the query immediately after building the table; that is, the table is probably entirely cached in RAM (the buffer_pool).
I hope this adds some specifics to the Comments that say that benchmarking is difficult.
Currently im struggle on finding a performant way, running multiple queries with py2neo. My problem is a have a big list of write queries in python that need to be written to neo4j.
I tried multiple ways to solve the issue right now. The best working approach for me was the following one:
from py2neo import Graph
queries = ["create (n) return id(n)","create (n) return id(n)",...] ## list of queries
g = Graph()
t = graph.begin(autocommit=False)
for idx, q in enumerate(queries):
t.run(q)
if idx % 100 == 0:
t.commit()
t = graph.begin(autocommit=False)
t.commit()
It it still takes to long for writing the queries. I also tried the run many from apoc without success, query was never finished. I also tried the same writing method with auto commit. Is there a better way to do this? Are there any tricks like dropping indexes first and then adding them after inserting the data?
-- Edit: Additional information:
I'm using Neo4j 3.4, Py2neo v4 and Python 3.7
You may want to read up on Michael Hunger's tips and tricks for fast batched updates.
The key trick is using UNWIND to transform list elements into rows, and then subsequent operations are performed per row.
There are supporting functions that can easily create lists for you, like range().
As an example, if you wanted to create 10k nodes and add a name property, then return the node name and its graph id, you could do something like this:
UNWIND range(1, 10000) as index
CREATE (n:Node {name:'Node ' + index})
RETURN n.name as name, id(n) as id
Likewise if you have a good amount of data to import, you can create a list of parameter maps, call the query, then UNWIND the list to operate on each entry at once, similar to how we process CSV files with LOAD CSV.
I'm trying to extract information from Oracle to Python lists in order to use them as inputs in functions. I am using the following code:
import cx_Oracle
dsnRiesgos = cx_Oracle.makedsn(host="MYHOST", port ="MYPORT", sid="MYSID")
conect = cx_Oracle.connect(user="USER",password="PASS",dsn=dsnRiesgos)
cursor = conect.cursor()
query = """ MY_QUERY """
Referencias = []
Referencias_Exp = []
cursor.execute(query)
#The result is a view with five columns and 400,000+ rows
for row in cursor:
Referencias.append(row[1])
Referencias_Exp.append([row[1],row[4]])
The problem that I have is that the output from 'query' is 400,000+ rows and it is taking forever to complete the insertion in the lists (I have stopped it after 15 minutes). My intuition tells me that there is a more efficient way to do this but I don't know how.
I am using Windows 7, Python 3.6.2, Oracle client: instantclient-basic-windows.x64-11.2.0.4.0.
I am beginner with Python and it's the first time I connect it with Oracle so 'basic' concepts might be unknown to me.
Since you only seem to need the first and fourth columns, I would advise that you create a query that only gets those.
Then, using fetchall() (http://cx-oracle.readthedocs.io/en/latest/cursor.html) might be of assistance to you. It will bring you a list of tuples, each tuple being one of the rows yielded by your query.
And to move from the 'how' to the 'why': why do you need all 400k rows in a list before processing it? Can't you avoid this step? Minor optimizations aside, this will be inherently slow and would be best avoided.
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.