join & f.write behaviour not as expected - python

I have a query running in SQL, which is returning the results in to a variable via a loop then punting that in to an HTML file. When I test this by printing to the console in Jupyter Notebook it prints as expected, the next 30 days of the calendar in order of date.
However, when I tell it to join the data using
dates = ''.join(variable)
it seems to not only reorder the dates so that the 13th of August sits oddly before the 13th of July, but it repeats the date div's 4 times in the page. See below for full code;
from os import getenv
import pyodbc
import os
cnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=MYVM\SQLEXPRESS;DATABASE=MyTables;UID=test;PWD=t')
cursor = cnxn.cursor() #makes connection
cursor.execute('DECLARE #today as date SET #today = GetDate() SELECT style112, day, month, year, dayofweek, showroom_name, isbusy from ShowroomCal where Date Between #today and dateadd(month,1,#today) order by style112') #runs statement
while row is not None:
inset = inset + ['<div class="'+ str(row.isbusy) + '">' + str(row.day) + '</div>']
row = cursor.fetchone()
dates = ''.join(inset)
f = open("C:\\tes.html",'r') # open file with read permissions
filedata = f.read() # read contents
f.close() # closes file
filedata = filedata.replace("{inset}", dates)
#os.remove("c:\\inetpub\\wwwroot\\cal\\tes.html")
f = open("c:\\inetpub\\wwwroot\\cal\\tes.html",'w')
f.write(filedata) # update it replacing the previous strings
f.close() # closes the file
cnxn.close()

''.join() does not alter the order in any way. If you get a different order then the database query produced rows in a different order.
I don't think you are telling the database to order your results by date. You order by style112, and the database is free to order values with the same style112 column value in any order it pleases. If style112 doesn't include date information (as a year, month, day sequence of fixed length) and date order is important, tell the database to use a correct order! Here that'd include year, month, day at the very least.
I'd also refactor the code to avoid quadratic performance behaviour; the inset = inset + [....] expression has to create a new list object each time, copying across all elements from inset and the new list into that. When adding N elements to a list this way, Python has to execute N * N steps. For 1000 elements, that's 1 million steps to execute! Use list.append() to add single elements, which will reduce the workload to roughly N steps.
You can loop directly over a cursor; this is more efficient as it can buffer rows, here's cursor.fetchone() can't assume you'll fetch more data. A for row in cursor: loop is also more readable.
You can also use string formatting rather than string concatenation, it'll help avoid all those str() calls and redundancy, as well as further reduce performance issues; all those string concatenations also create and recreate a lot of intermediate string objects that you don't need to create at all.
So use this:
cnxn = pyodbc.connect(
'DRIVER={ODBC Driver 13 for SQL Server};SERVER=MYVM\SQLEXPRESS;'
'DATABASE=MyTables;UID=test;PWD=t')
cursor = cnxn.cursor()
cursor.execute('''
DECLARE #today as date
SET #today = GetDate()
SELECT
style112, day, month, year, dayofweek, showroom_name, isbusy
from ShowroomCal
where Date Between #today and dateadd(month,1,#today)
order by year, month, day, style112
''')
inset = []
for row in cursor:
inset.append(
'<div class="{r.isbusy}">'
'<a href="#" id="{r.style112}"'
' onclick="parent.updateField(field38, {r.style112});">'
'{r.day}</a></div>'.format(r=row))
with open(r"C:\tes.html") as template:
template = f.read()
html = template.format(inset=''.join(inset))
with open(r"C:\inetpub\wwwroot\cal\tes.html", 'w') as output:
output.write(html)
Note: if any of your database data was entered by your users, you must ensure that the data is properly escaped for inclusion in HTML first, or you'll leave yourself open to XSS cross-site scripting attacks. Personally, I'd use a HTML templating engine with default escaping support, such as Jinja.

Related

Read from SQL Server with Python using few parameters from DataFrame

I need to read from SQl Server Database using this parameters:
period of time from uploaded Dataframe (date of order and date after month)
clients id from the same Dataframe
So I have something like this:
sql_sales = """
SELECT
dt,
clientID,
cost
WHERE
dt between %(date1)s AND %(date2)s
AND kod in %(client)s
"""
And I have df with columns:
clientsID
date of order
date after month
I can use list of clients but the code should parsed database with a few lists of paramenters (two of them is a part of period).
sales = sales.append(pd.read_sql(sql_sales, conn, params={'client':df['clientsID].tolist()}))
The way I got something similar to work in the past was to do in {} and then use .format with the parameters listed in order. Also, then you don't need to use the params argument. Finally, if you are using IN with SQL, then in Python you need to create a tuple from the client list. For the line dt between {} AND {}, you may also be able to do dt between ? AND ?.
client = tuple(df['clientsID'].tolist())
sql_sales = """
SELECT
dt,
clientID,
cost
WHERE
dt between {} AND {}
AND kod in {}
""".format(date1,date2,client)
sales = sales.append(pd.read_sql(sql_sales, conn))

Incorrect date being returned from SQL DB with Python script

I have an SQL DB which I am trying to extract data from. When I extract date/time values my script adds three zeros to the date/time value, like so: 2011-05-03 15:25:26.170000
Below is my code in question:
value_Time = ('SELECT TOP (4) [TimeCol] FROM [database1].[dbo].[table1]')
cursor.execute(value_Time)
for Timerow in cursor:
print(Timerow)
Time_list = [elem for elem in Timerow]
The desired result is that there is not an additional three zeros at then end of the date/time value so that I can insert it into a different database.
Values within Time_List will contain the incorrect date/time values, as well as the Timerow value.
Any help with this would be much appreciated!
from datetime import datetime
value_Time = ('SELECT TOP (4) [TimeCol] FROM [database1].[dbo].[table1]')
cursor.execute(value_Time)
row=cursor.fetchone()
for i in range(len(row)):
var=datetime.strftime(row[i], '%Y-%m-%d %H:%M:%S')
print(var)
I think you need a wrapper to surround your date control example "yyyy/mm/dd/hh/mm/ss" or "yyyymmddhhmmss"
Format((Datecontrol),"yyyy/mm/dd/hh/mm/ss")

Creating tables in MySQL based on the names of the columns in another table

I have a table with ~133M rows and 16 columns. I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. Is this possible to do exclusively with an SQL script? It seems logical to me that this would be the preferred, and fastest way to do it.
Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. The big SQL string command in the middle was copied from some other code in our codebase. I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed.
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
This doesn't need to be incredibly optimized; it's perfectly acceptable if the machine has to run for a few days in order to do it. But 1 year is far too long.
You can create a table from a SELECT like:
CREATE TABLE <other database name>.<column name>
AS
SELECT <column name>
FROM <original database name>.<table name>;
(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...)
That will also insert the data from the query into the new table. And it's probably the fastest way.
You could use dynamic SQL and information from the catalog (namely information_schema.columns) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess.
When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB.
As you hinted, it's much quicker to write an SQL script to redistribute the data.

Python csv_writer: Change the output format of a oracle date column

I have a table with a date column and want to format that column to DD.MM.YYYYY in a csv file but alter session does not effect the python csv_writer.
Is there a way to handle all date columns without using to_char in the sql code?
file_handle=open("test.csv","w")
csv_writer = csv.writer(file_handle,dialect="excel",lineterminator='\n',delimiter=';',quoting=csv.QUOTE_NONNUMERIC)
conn=cx_Oracle.connect(connectionstring)
cur = conn.cursor()
cur.execute("ALTER SESSION SET NLS_DATE_FORMAT = 'DD.MM.YYYY HH24:MI:SS'")
cur.execute("select attr4,to_char(attr4,'DD.MM.YYYY') from aTable")
rows = cur.fetchmany(16000)
while len(rows) > 0:
csv_writer.writerows(rows)
rows = cur.fetchmany(16000)
cur.close()
result:
"1943-04-21 00:00:00";"21.04.1943"
"1955-12-22 00:00:00";"22.12.1955"
"1947-11-01 00:00:00";"01.11.1947"
"1960-01-07 00:00:00";"07.01.1960"
"1979-12-01 00:00:00";"01.12.1979"
The output you see comes from the fact the result of a query is converted to the corresponding python datatypes - thus the values of the first column are datetime objects, and the second - strings (due to the to_char() cast you do in the query). The NLS_DATE_FORMAT controls the output for just regular (user) clients.
Thus the output in the csv is just the default representation of the python's datetime; if you want to output in a different form, you just need to change it.
As the query response is a list of tuples, you can't just change it in-place - it has to be copied and modified; alternatively, you could write it row by row, modified.
Here's just the write part with the 2nd approach:
import datetime
# the rest of your code
while len(rows) > 0:
for row in rows:
value = (row[0].strftime('%d.%m.%Y'), row[1])
csv_writer.writerow(value)
rows = cur.fetchmany(16000)
For reference, here's a short list with the python's strftime directives.

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories