How to handle Oracle information in Python? - python

I'm trying to extract information from Oracle to Python lists in order to use them as inputs in functions. I am using the following code:
import cx_Oracle
dsnRiesgos = cx_Oracle.makedsn(host="MYHOST", port ="MYPORT", sid="MYSID")
conect = cx_Oracle.connect(user="USER",password="PASS",dsn=dsnRiesgos)
cursor = conect.cursor()
query = """ MY_QUERY """
Referencias = []
Referencias_Exp = []
cursor.execute(query)
#The result is a view with five columns and 400,000+ rows
for row in cursor:
Referencias.append(row[1])
Referencias_Exp.append([row[1],row[4]])
The problem that I have is that the output from 'query' is 400,000+ rows and it is taking forever to complete the insertion in the lists (I have stopped it after 15 minutes). My intuition tells me that there is a more efficient way to do this but I don't know how.
I am using Windows 7, Python 3.6.2, Oracle client: instantclient-basic-windows.x64-11.2.0.4.0.
I am beginner with Python and it's the first time I connect it with Oracle so 'basic' concepts might be unknown to me.

Since you only seem to need the first and fourth columns, I would advise that you create a query that only gets those.
Then, using fetchall() (http://cx-oracle.readthedocs.io/en/latest/cursor.html) might be of assistance to you. It will bring you a list of tuples, each tuple being one of the rows yielded by your query.
And to move from the 'how' to the 'why': why do you need all 400k rows in a list before processing it? Can't you avoid this step? Minor optimizations aside, this will be inherently slow and would be best avoided.

Related

Query writing performance on neo4j with py2neo

Currently im struggle on finding a performant way, running multiple queries with py2neo. My problem is a have a big list of write queries in python that need to be written to neo4j.
I tried multiple ways to solve the issue right now. The best working approach for me was the following one:
from py2neo import Graph
queries = ["create (n) return id(n)","create (n) return id(n)",...] ## list of queries
g = Graph()
t = graph.begin(autocommit=False)
for idx, q in enumerate(queries):
t.run(q)
if idx % 100 == 0:
t.commit()
t = graph.begin(autocommit=False)
t.commit()
It it still takes to long for writing the queries. I also tried the run many from apoc without success, query was never finished. I also tried the same writing method with auto commit. Is there a better way to do this? Are there any tricks like dropping indexes first and then adding them after inserting the data?
-- Edit: Additional information:
I'm using Neo4j 3.4, Py2neo v4 and Python 3.7
You may want to read up on Michael Hunger's tips and tricks for fast batched updates.
The key trick is using UNWIND to transform list elements into rows, and then subsequent operations are performed per row.
There are supporting functions that can easily create lists for you, like range().
As an example, if you wanted to create 10k nodes and add a name property, then return the node name and its graph id, you could do something like this:
UNWIND range(1, 10000) as index
CREATE (n:Node {name:'Node ' + index})
RETURN n.name as name, id(n) as id
Likewise if you have a good amount of data to import, you can create a list of parameter maps, call the query, then UNWIND the list to operate on each entry at once, similar to how we process CSV files with LOAD CSV.

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Any easy way to alter the data that comes from a mysql database?

so I'm using mysql to grab data from a database and feeding it into a python function. I import mysqldb, connect to the database and run a query like this:
conn.query('SELECT info FROM bag')
x = conn.store_result()
for row in x.fetch_row(100):
print row
but my problem is that my data comes out like this (1.234234,)(1.12342,)(3.123412,)
when I really want it to come out like this: 1.23424, 1.1341234, 5.1342314 (i.e. without parenthesis). I need it this way to feed it into a python function. Does anyone know how I can grab data from the database in a way that doesn't have parenthesis?
Rows are returned as tuples, even if there is only one column in the query. You can access the first and only item as row[0]
The first time around in the for loop, row does indeed refer to the first row. The second time around, it refers to the second row, and so on.
By the way, you say that you are using mySQLdb, but the methods that you are using are from the underlying _mysql library (low level, scarcely portable) ... why??
You could also simply use this as your for loop:
for (info, ) in x.fetch_row(100):
print info

Batch select with SQLAlchemy

I have a large set of values V, some of which are likely to exist in a table T. I would like to insert into the table those which are not yet inserted. So far I have the code:
for value in values:
s = self.conn.execute(mytable.__table__.select(mytable.value == value)).first()
if not s:
to_insert.append(value)
I feel like this is running slower than it should. I have a few related questions:
Is there a way to construct a select statement such that you provide a list (in this case, 'values') to which sqlalchemy responds with records which match that list?
Is this code overly expensive in constructing select objects? Is there a way to construct a single select statement, then parameterize at execution time?
For the first question, something like this if I understand your question correctly
mytable.__table__.select(mytable.value.in_(values)
For the second question, querying this by 1 row at a time is overly expensive indeed, although you might not have a choice in the matter. As far as I know there is no tuple select support in SQLAlchemy so if there are multiple variables (think polymorhpic keys) than SQLAlchemy can't help you.
Either way, if you select all matching rows and insert the difference you should be done :)
Something like this should work:
results = self.conn.execute(mytable.__table__.select(mytable.value.in_(values))
available_values = set(row.value for row in results)
to_insert = set(values) - available_values

Categories