I am a very newbie in using python and sqlite. I am trying to create a script that reads a data from a table (rawdata) and then performs some calculations which is then stored in a new table. I am counting the number race that a player has won before that date at a particular track position and calculating the percentage. There are 15 track positions in total. Overall the script is very slow. Any suggestions to improve its speed. I have already used the PRAGMA parameters.
Below is the script.
for item in result:
l1 = str(item[0])
l2 = item[1]
l3 = int(item[2])
winpost = []
key = l1.split("|")
dt = l2
###Denominator--------------
cursor.execute(
"SELECT rowid FROM rawdata WHERE Track = ? AND Date< ? AND Distance = ? AND Surface =? AND OfficialFinish=1",
(key[2], dt, str(key[4]), str(key[5]),))
result_den1 = cursor.fetchall()
cursor.execute(
"SELECT rowid FROM rawdata WHERE Track = ? AND RaceSN<= ? AND Date= ? AND Distance = ? AND Surface =? AND OfficialFinish=1",
(key[2], int(key[3]), dt, str(key[4]), str(key[5]),))
result_den2 = cursor.fetchall()
totalmat = len(result_den1) + len(result_den2)
if totalmat > 0:
for i in range(1, 16):
cursor.execute(
"SELECT rowid FROM rawdata WHERE Track = ? AND Date< ? AND PolPosition = ? AND Distance = ? AND Surface =? AND OfficialFinish=1",
(key[2], dt, i, str(key[4]), str(key[5]),))
result_num1 = cursor.fetchall()
cursor.execute(
"SELECT rowid FROM rawdata WHERE Track = ? AND RaceSN<= ? AND Date= ? AND PolPosition = ? AND Distance = ? AND Surface =? AND OfficialFinish=1",
(key[2], int(key[3]), dt, i, str(key[4]), str(key[5]),))
result_num2 = cursor.fetchall()
winpost.append(len(result_num1) + len(result_num2))
winpost = [float(x) / totalmat for x in winpost]
rank = rankmin(winpost)
franks = list(rank)
franks.insert(0, int(key[3]))
franks.insert(0, dt)
franks.insert(0, l1)
table1.append(franks)
franks = []
cursor.executemany("INSERT INTO posttable VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", table1)
Sending and retrieving an SQL query is "expensive" in terms of time. The easiest way to speed things up would be to use SQL functions to reduce the number of queries.
For example, the first two queries could be reduced to a single call using COUNT(), UNION, and Aliases.
SELECT COUNT(*)
FROM
( SELECT rowid FROM rawdata where ...
UNION
SELECT rowid FROM rawdata where ...
) totalmatch
In this case we take the two original queries (with your conditions in place of the "...") combine them with a UNION statement, give that union the alias "totalmatch", and count all the rows in it.
Same thing can be done with the second set of queries. Instead of cycling 16 times over 2 queries (resulting in 32 calls to the SQL engine) you can replace it with one query by also using GROUP BY.
SELECT PolPosition, COUNT(PolPosition)
FROM
( SELECT PolPosition FROM rawdata WHERE ...
UNION
SELECt PolPosition FROM rawdata WHERE ...
) totalmatch
GROUP BY PolPosition
In this case we take the exact same query as before and group it by PolPosition, using COUNT to display how many rows are in each group.
W3Schools is a great resource for how these functions work:
http://www.w3schools.com/sql/default.asp
Related
I am using pyodbc and Microsoft SQL Server
I am trying to replicate a stored procedure in python where this query is executed for every #currentSurveyId
SELECT *
FROM
(
SELECT
SurveyId,
QuestionId,
1 as InSurvey
FROM
SurveyStructure
WHERE
SurveyId = #currentSurveyId
UNION
SELECT
#currentSurveyId as SurveyId,
Q.QuestionId,
0 as InSurvey
FROM
Question as Q
WHERE NOT EXISTS
(
SELECT *
FROM SurveyStructure as S
WHERE S.SurveyId = #currentSurveyId AND S.QuestionId = Q.QuestionId
)
) as t
ORDER BY QuestionId
In Python, I so far have:
cursor.execute("""SELECT UserId FROM dbo.[User]""")
allSurveyID = cursor.fetchall()
for i in allSurveyID:
p = i
test = cursor.execute("""SELECT *
FROM
(
SELECT
SurveyId,
QuestionId,
1 as InSurvey
FROM
SurveyStructure
WHERE
SurveyId = (?)
UNION
SELECT
(?) as SurveyId,
Q.QuestionId,
0 as InSurvey
FROM
Question as Q
WHERE NOT EXISTS
(
SELECT *
FROM SurveyStructure as S
WHERE S.SurveyId = (?)AND S.QuestionId = Q.QuestionId
)
) as t
ORDER BY QuestionId""",p)
for i in test:
print(i)
The parameter works when used once (if I delete everything from UNION onwards). When trying to use the same parameter in the rest of the query, I get the following error:('The SQL contains 3 parameter markers, but 1 parameters were supplied', 'HY000')
Is it possible to use the same parameter multiple times in the same query?
Thank you
pyodbc itself only supports "qmark" (positional) parameters (ref: here), but with T-SQL (Microsoft SQL Server) we can use an anonymous code block to avoid having to pass the same parameter value multiple times:
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
sql = """\
SET NOCOUNT ON;
DECLARE #my_param int = ?;
SELECT #my_param AS original, #my_param * 2 AS doubled;
"""
results = crsr.execute(sql, 2).fetchone()
print(results) # (2, 4)
If reusing the same parameter value, simply multiply a one-item list of the parameter:
cursor.execute(sql, [p]*3)
Consider also refactoring your SQL for LEFT JOIN (or FULL JOIN) requiring two qmarks:
SELECT DISTINCT
ISNULL(S.SurveyId, ?) AS SurveyId,
Q.QuestionId,
IIF(S.SurveyId IS NOT NULL, 1, 0) AS InSurvey
FROM Question Q
LEFT JOIN SurveyStructure S
ON S.QuestionId = Q.QuestionId
AND S.SurveyId = ?
ORDER BY Q.QuestionId
Possibly even for one parameter:
SELECT MAX(S.SurveyId) AS SurveyId,
Q.QuestionId,
IIF(S.SurveyId IS NOT NULL, 1, 0) AS InSurvey
FROM Question Q
LEFT JOIN SurveyStructure S
ON S.QuestionId = Q.QuestionId
AND S.SurveyId = ?
GROUP BY Q.QuestionId,
IIF(S.SurveyId IS NOT NULL, 1, 0)
ORDER BY Q.QuestionId
I am using Python to access an Oracle Exadata database, which is HUGE. The documentation for the table is rather poor and I need to understand strange cases. Coming from an R/python world I ran the following query:
query = ("""
SELECT COUNT(counter) as freq, counter
FROM (
SELECT COUNT(*) as counter
FROM schema.table
WHERE x = 1 AND y = 1
GROUP BY a,b )
GROUP BY counter""")
with cx_Oralce.connct(dsn=tsn, encoding = "UTF-8") as con:
df = pd.read_sql(con=con, query=sql)
This essentially counts the frequency of observations for a given (a,b) pair. My prior was that they are all 1 (they are not). So I would like to see the observations that drive this:
query = ("""
SELECT *
FROM schema.table
WHERE x = 1 and y = 1
AND (for each (a,b) there is more than one record)""")
I am struggling to translate this into proper Oracle SQL.
In R (dplyr) this would be a combination of group_by and mutate (instead of summarise) and in Python pandas this could be done with transform.
I am new to SQL and may use incorrect terminology. I appreciate being corrected.
You can use window functions:
SELECT ab.*
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY a, b) as cnt
FROM schema.table t
WHERE x = 1 AND y = 1
) ab
WHERE cnt > 1;
I'm trying to create an SQL queries for a large list of records (>42 million) to insert into a remote database. Right now I'm building queries in the format INSERT INTO tablename (columnnames) VALUES (values)
tablename, columnnames, and values are all of varying length so I'm generating a number of placeholders equal to the number of values required.
The result is I have a string called sqcommand that looks like INSERT INTO ColName (?,?,?) VALUES (?,?,?); and a list of parameters that looks like ([Name1, Name2, Name3, Val1, Val2, Val3]).
When try to execute the query as db.execute(sqlcommand, params) I get errors indicating I'm trying to insert into columns "#P1", "#P2", "#P3" et cetera. Why aren't the values from my list properly translating? Where is it getting "#P1" from? I know I don't have a column of that name and as far as I can tell I'm not referencing a column of that name yet the execute method is still trying to use it.
UPDATE: As per request, the full code is below, modified to avoid anything that might be private. The end result of this is to move data, row by row, from an sqlite3 db file to an AWS SQL server.
newDB = pyodbc.connect(newDataBase)
oldDB = sqlite3.connect(oldDatabase)
tables = oldDB.execute("SELECT * FROM sqlite_master WHERE type='table';").fetchall()
t0 = datetime.now()
for table in tables:
print('Parsing:', str(table[1]))
t1 = datetime.now()
colInfo = oldDB.execute('PRAGMA table_info('+table[1]+');').fetchall()
cols = list()
cph = ""
i = 0
for col in colInfo:
cph += "?,"
cols.append(str(col[1]))
rowCount = oldDB.execute("SELECT COUNT(*) FROM "+table[1]+" ;").fetchall()
count = 0
while count <= int(rowCount[0][0]):
params = list()
params.append(cols)
count += 1
row = oldDB.execute("SELECT * FROM "+table[1]+" LIMIT 1;").fetchone()
ph = ""
for val in row:
ph += "?,"
params = params.append(str(val))
ph = ph[:-1]
cph = cph[:-1]
print(str(table[1]))
sqlcommand = "INSERT INTO "+str(table[1])+" ("+cph+") VALUES ("+ph+");"
print(sqlcommand)
print(params)
newDB.execute(sqlcommand, params)
sqlcommand = "DELETE FROM ? WHERE ? = ?;"
oldDB.execute(sqlcommand, (str(table[1]), cols[0], vals[0],))
newDB.commit()
Unbeknownst to me, column names can't be passed as parameters. Panagiotis Kanavos answered this in a comment. I guess I'll have to figure out a different way to generate the queries. Thank you all very much, I appreciate it.
I have a table with these data:
ID, Name, LastName, Date, Type
I have to query the table for the user with ID 1.
Get the row, if the Type of that user is not 2, then return that user, else return all users that have the same LastName and Date.
What would be the most efficient way to do this ?
What I had done is :
query1 = SELECT * FROM clients where ID = 1
query2 = SELECT * FROM client WHERE LastName = %s AND Date= %s
And I execute the first query
cursor.execute(sql)
rows = cursor.fetchall()
for row in rows:
if(row['Type'] ==2 )
cursor.execute(sql2(row['LastName'], row['Date']))
Save results
else
results = rows?
Is there a more efficient way of doing this using Joins?
Example if I only have a left join, how would I also ask if the type of the user is 2 ?
And if there is multiple rows to be returned, how to assign them to an array of objects in python?
Just do two queries to avoid loops here:
q1 = """
SELECT c.* FROM clients c where c.ID = 1
"""
q2 = """
SELECT b.* FROM clients b
JOIN (SELECT c.* FROM
clients c
c.ID = 1
AND
c.Type = 2) a
ON
a.LastName = b.LastName
AND
a.Date = b.Date
"""
Then you can just execute both queries and you'll have all the desired results you want without the need for loops since your loop will execute n number of queries where n is equal to the number of rows that match as opposed to grabbing it all in one join in one pass. Without more specifics as the desired data structure of final results, as it seems you only care about saving the results, this should give you what you want.
I gather a list of items for each item I check the database with SQL query with the following code:
SELECT *
FROM task_activity as ja
join task as j on ja.task_id = j.id
WHERE j.name = '%s'
AND ja.avg_runtime <> 0
AND ja.avg_runtime is not NULL
AND ja.id = (SELECT MAX(id) FROM task_activity
WHERE task_id = ja.task_id
and avg_runtime <> 0
AND ja.avg_runtime is not NULL)
% str(task.get('name'))).fetchall()
But do I need to iterate through the list and make a query for everyone. This list is quite large at times. Can I just make one query and get back a list data set?
In this particular query I'm only looking for the column avg_runtime with the task id and the maximum id will be the last calculated runtime.
I don't have access to the database other then to make queries. Using Microsoft SQL Server 2012 (SP1) - 11.0.3349.0 (X64)
You might be able to speed this up using row_number(). Note, I think there's a bug in your original query. Should ja.avg_runtime in the subquery just be avg_runtime?
sql = """with x as (
select
task_id,
avg_runtime,
id,
row_number() over (partition by ja.task_id order by ja.id desc) rn
from
task_activity as ja
join
task as j
on ja.task_id = j.id
where
j.name in ({0}) and
ja.avg_runtime <> 0 and
ja.avg_runtime is not null
) select
task_id,
avg_runtime,
id
from
x
where
rn = 1;"""
# build up ?,?,? for parameter substitution
# assume tasknames is the list containing the task names.
params = ",".join(tasknames.map(lambda x: "?"))
# connection is your db connection
cursor = connection.cursor()
# interpolate the ?,?,? and bind parameters
cursor.execute(sql.format(params), tasknames)
cursor.fetchall()
the following index should make this query pretty fast (although it depends how many rows are being excluded by the filters on ja.avg_runtime):
create index ix_task_id_id on task_activity (task_id, id desc);