Memory efficient way of fetching postgresql uniqueue dates? - python

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.

If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.

You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Related

Sql Select statement Optimization

I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?
If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.

How to handle Oracle information in Python?

I'm trying to extract information from Oracle to Python lists in order to use them as inputs in functions. I am using the following code:
import cx_Oracle
dsnRiesgos = cx_Oracle.makedsn(host="MYHOST", port ="MYPORT", sid="MYSID")
conect = cx_Oracle.connect(user="USER",password="PASS",dsn=dsnRiesgos)
cursor = conect.cursor()
query = """ MY_QUERY """
Referencias = []
Referencias_Exp = []
cursor.execute(query)
#The result is a view with five columns and 400,000+ rows
for row in cursor:
Referencias.append(row[1])
Referencias_Exp.append([row[1],row[4]])
The problem that I have is that the output from 'query' is 400,000+ rows and it is taking forever to complete the insertion in the lists (I have stopped it after 15 minutes). My intuition tells me that there is a more efficient way to do this but I don't know how.
I am using Windows 7, Python 3.6.2, Oracle client: instantclient-basic-windows.x64-11.2.0.4.0.
I am beginner with Python and it's the first time I connect it with Oracle so 'basic' concepts might be unknown to me.
Since you only seem to need the first and fourth columns, I would advise that you create a query that only gets those.
Then, using fetchall() (http://cx-oracle.readthedocs.io/en/latest/cursor.html) might be of assistance to you. It will bring you a list of tuples, each tuple being one of the rows yielded by your query.
And to move from the 'how' to the 'why': why do you need all 400k rows in a list before processing it? Can't you avoid this step? Minor optimizations aside, this will be inherently slow and would be best avoided.

psycopg2 each execute yields different table and cursor not scrolling to the beginning

I am querying a postgresql database through python's psycopg2 package.
In short: The problem is psycopg2.fetchmany() yields a different table everytime I run a psydopg2.cursor.execute() command.
import psycopg2 as ps
conn = ps.connect(database='database', user='user')
nlines = 1000
tab = "some_table"
statement= """ SELECT * FROM """ + tab + " LIMIT %d;" %(nlines)
crs.execute(statement)
then I fetch the data in pieces. Running the following executes just fine and each time when I scroll back to the beginning I get the same results.
rec=crs.fetchmany(10)
crs.scroll(0, mode='absolute')
print rec[-1][-2]
However, if I run the crs.execute(statement) again and then fetch the data, it yields a completely different output. I tried running ps.connect again, do conn.rollback(), conn.reset(), crs.close() and nothing ever resulted in consisted output from the table. I also tried a named cursor with scrollable enabled
crs= conn.cursor(name="cur1")
crs.scrollable=1
...
crs.scroll(0, mode= 'absolute')
still no luck.
You don't have any ORDER BY clause in your query, and Postgres does not guarantee any particular ordering without one. It's particularly likely to change ordering for tables which have lots of churn (i.e. lots of inserts/updates/deletes).
See the Postgres SELECT doc for more details, but the most salient snippet here is this:
If the ORDER BY clause is specified, the returned rows are sorted in
the specified order. If ORDER BY is not given, the rows are returned
in whatever order the system finds fastest to produce.
I wouldn't expect any query, regardless of the type of cursor used, to necessarily return the exact same result set given a query of this type.
What happens when you add an explicit ORDER BY?

Any easy way to alter the data that comes from a mysql database?

so I'm using mysql to grab data from a database and feeding it into a python function. I import mysqldb, connect to the database and run a query like this:
conn.query('SELECT info FROM bag')
x = conn.store_result()
for row in x.fetch_row(100):
print row
but my problem is that my data comes out like this (1.234234,)(1.12342,)(3.123412,)
when I really want it to come out like this: 1.23424, 1.1341234, 5.1342314 (i.e. without parenthesis). I need it this way to feed it into a python function. Does anyone know how I can grab data from the database in a way that doesn't have parenthesis?
Rows are returned as tuples, even if there is only one column in the query. You can access the first and only item as row[0]
The first time around in the for loop, row does indeed refer to the first row. The second time around, it refers to the second row, and so on.
By the way, you say that you are using mySQLdb, but the methods that you are using are from the underlying _mysql library (low level, scarcely portable) ... why??
You could also simply use this as your for loop:
for (info, ) in x.fetch_row(100):
print info

SQLAlchemy: Scan huge tables using ORM?

I am currently playing around with SQLAlchemy a bit, which is really quite neat.
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
For fun I did the equivalent of a select * over the resulting SQLite database:
session = Session()
for p in session.query(Picture):
print(p)
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
Okay, I just found a way to do this myself. Changing the code to
session = Session()
for p in session.query(Picture).yield_per(5):
print(p)
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
here's what I usually do for this situation:
def page_query(q):
offset = 0
while True:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
if not r:
break
for item in page_query(Session.query(Picture)):
print item
This avoids the various buffering that DBAPIs do as well (such as psycopg2 and MySQLdb). It still needs to be used appropriately if your query has explicit JOINs, although eagerly loaded collections are guaranteed to load fully since they are applied to a subquery which has the actual LIMIT/OFFSET supplied.
I have noticed that Postgresql takes almost as long to return the last 100 rows of a large result set as it does to return the entire result (minus the actual row-fetching overhead) since OFFSET just does a simple scan of the whole thing.
You can defer the picture to only retrieve on access. You can do it on a query by query basis.
like
session = Session()
for p in session.query(Picture).options(sqlalchemy.orm.defer("picture")):
print(p)
or you can do it in the mapper
mapper(Picture, pictures, properties={
'picture': deferred(pictures.c.picture)
})
How you do it is in the documentation here
Doing it either way will make sure that the picture is only loaded when you access the attribute.

Categories