How can i reduce my sqlite3 code using variables?

How can i reduce my sqlite3 code using variables? - python

I got around 500 lines of sqlite code and i am trying to reduce that number. So i tried making a function with variables.
I tried using this code:
x = "c.execute("
o = "'"
y = "'INSERT INTO "
unio = x+y
tabl = ["d","d_2","n_d","n_d_2"]
val = "VALUES ("
w = ") "
k = "?,?"
mark = 369
ek = 963
def db_up(table,):
aa = y+table+"("+tabl[0] + ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,)"
bb = unio+table+"("+tabl[1]+ ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,))"
print(aa) # 'INSERT INTO avg_dt(d, d_2) VALUES (?,?)',(mark,ek,)
print(bb) # c.execute('INSERT INTO avg_dt(d_2, d_2) VALUES (?,?)',(mark,ek,))
c.execute(str(aa)) # no succes
c.execute(aa) # no success
bb # no success
When i run the "c.execute(aa)" line. it throws me this error:
sqlite3.OperationalError: near "'INSERT INTO avg_dt(d, d_2) VALUES (?,?)'": syntax error
So...how can i make a sqlite3 code using variables a functions?
Thanks for taking the time ;)

I can't post a comment with my current rep, so let me try to provide a full answer.
Your approach seems to be that you have too much SQL code, which you consider a problem, so you're trying to "compress" the code by reducing common statements into shorter variable names. There are a number of reasons you should avoid doing this, and a number of better ways to achieve a similar result.
First, let's talk about why this is a bad idea:
Your code doesn't become more maintainable. Interspersing cryptic variable names like x, o, y, or unio doesn't make the code any easier to read, even for the author, assuming a few days have passed since you wrote it.
Using this kind of method doesn't make your code any more performant, and most likely makes it less performant: your program now has to worry about allocating and reallocating memory when doing string interpolation or concatenation, which takes cycles.
Doing string interpolation or concatenation in SQL should be done with extreme caution: this is essentially a homebrew version of prepared statements, which are usually authored by people with loads of experience doing SQL programming. Making this on your own risks your program being targetted by SQL injection (or at least column/value type mismatches).
Now, let's talk about mitigating this issue for you:
Lots of SQL code need not be unmanageable: typically, if you must maintain large amounts of raw SQL in your project, you dump that into a separate SQL file which you can then either execute directly from the database CLI or run from your program via a database driver.
ORMs are (usually) your friend: with the exception of exotic or outstandingly performance-sensitive SQL queries, modern ORMs can get rid of most raw SQL code in your program. The more naturally programmatic structure of ORM code also means you can break it up into different delegate functions to avoid code reuse, to an extent.
Please feel free to add details to your question; as it stands, it's not totally clear whether your concerns can be addressed with this answer.

Related

Searching a list using astroquery

I have the following code below that I got from https://astroquery.readthedocs.io/en/latest/gaia/gaia.html.
I get the br when ra and dec is a number. However, I have not just one number, but a list of ra and dec. When I tried putting a list in for ra and dec in the code below, I got an error saying Error 500: null. Is there a way to find br using a list of ra and dec?
coord = SkyCoord(ra=, dec=, unit=(u.degree, u.degree), frame='icrs')
width = u.Quantity(0.0005, u.deg)
height = u.Quantity(0.0005, u.deg)
r = Gaia.query_object_async(coordinate=coord, width=width, height=height)
r.pprint()
r.columns
br=[r["phot_bp_rp_excess_factor"]]
print (br)
I am new to astroquery, so any help will be apreciated.

Hi and congrats on your first StackOverflow question. As I understand it, Astroquery is a community effort, and the modules for querying individual online catalogues are in many cases being developed and maintained side-by-side with the online query systems, often by their same developers. So different modules within Astroquery are being worked on sometimes by different teams, and have some varying degree of feature-completeness and consistency in interfaces (something I'd like to see improved but it's very difficult to coordinate).
In the case of Gaia.query_object_async the docs are not completely clear, but it's not obvious that it even supports array SkyCoord. It should, or at least if it doesn't it should give a better error.
To double check, I dug into the code and found what I kind of suspected: It does not at all allow this. It is building a SQL query using string replacement (generally considered a bad idea) and passing that SQL query to a web-based service. Because the ra and dec values are arrays, it just blindly passes those array representations into the template for the SQL query, resulting in an invalid query:
SELECT
TOP 50
DISTANCE(
POINT('ICRS', ra, dec),
POINT('ICRS', [99.00000712 87.00000767], [24.99999414 24.99999461])
) as dist,
*
FROM
gaiadr2.gaia_source
WHERE
1 = CONTAINS(
POINT('ICRS', ra, dec),
BOX(
'ICRS',
[99.00000712 87.00000767],
[24.99999414 24.99999461],
0.005,
0.005
)
)
ORDER BY
dist ASC
The server, rather than return an error message suggesting that the query is malformed, instead just returns a general server error. Basically it crashes.
Long story short, you should probably open a bug report about this against astroquery, and see if it's on the Gaia maintainers' radar to deal with: https://github.com/astropy/astroquery/issues/new
In the meantime it sounds like your best bet is to make multiple queries in a loop and join their results together. Since it returns a Table, you can use astropy.table.vstack:
from astropy.table import vstack
results = []
for coord in coords:
results.append(Gaia.query_object_async(coord, width=width, height=height))
results = vstack(results)

Neo4J / py2neo -- cursor-based query?

If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.

Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.

SPARQL initialization

I use SPARQLWrapper in python to query a web enpoint with many different querys in a loop.
So I tried to make it work like this (let queries hold all different queries and result the results):
sparql = SPARQLWrapper("url")
prefix = "prefix..."
for i in arange(1:len(queries)):
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
But this does not work. The first query I pick from the list would return the supposed result, but any other querys wouldn't.
Instead of that, I now use this:
for i in arange(1:len(queries)):
[sparql,prefix] = initializeSPARQL()
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
and also
def initializeSPARQL():
sparql = SPARQLWrapper("url")
prefix = "prefix..."
return sparql,prefix
That works and is also not an issue of performance, since the querying itself is the bottleneck. But is there a better solution? This appears to be so wrong...

It is strange.. because I've been checking the code, and the query() method is completely stateless, so no idea why it's failing.
With i > 1, what does result[i] contains?
May I suggest you to try the following?
sparql = SPARQLWrapper("url")
prefix = "prefix..."
results = []
for i in range(0, len(queries)):
sparql.resetQuery()
sparql.setQuery(prefix+queries[i])
results[i] = sparql.query().convert()

I'm one of the developers of the library.
Your first try arises a bug. I'll check what internal data structure keeps with the previous usage to allow such way to use the library.
You second solution, even if is works, should be not the right way to do it.
As I said, I'll take a look on how to fix this.
For the future, please, submit a proper bug report to the project or an email to the mailing list.

Writing MySQL databases in Python using mySQLdb

So I have the following code, and it works:
for count in range(0,1000):
L=[random.randint(0, 127),random.randint(0, 127),random.randint(0, 127)]
random.randint(0, 127)
name=''.join(map(chr,L))
number=random.randint(0,1000)
x.execute('insert into testTable set name=(%s), number=(%s)', (name, number))
Above, x is just the cursor I made (obviously). I just create a random string from ASCII values and a random number and write it to my database (this was a purely BS example so that I knew it worked)/
Then,
I have in another script:
x.execute('insert into rooms set \
room_name=(%s),\
room_sqft=(%s),\
room_type=(%s),\
room_purpose=(%s) ,\
room_floor_number=(%s)',
(name, sqft, roomType, room_use_ranking, floor))
And I get a syntax error: invalid syntax on the first line, right at the x. part of x.execute.
What is different between the two lines? In the problem code, all arguments but name are ints (name is a string) that are gotten from a int(raw_input(...)) type prompt that catches bad input errors.
Clearly this works, but what is going wrong in the second piece of code?
Thanks,
nkk

There's a problem on the line BEFORE the x.execute. (x is unexpected at this point). Can you link more of the file?
Also, try this formatting, which can clear up this sort of thing by making the string one blob. (Your syntax highlighter should show it as one big multi-line string, too!)
sql = '''
INSERT INTO rooms
SET room_name=(%s),
room_sqft=(%s),
room_type=(%s),
room_purpose=(%s),
room_floor_number=(%s)
'''
x.execute(sql, (name, sqft, roomType, room_use_ranking, floor))

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?

Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.