Conflict accessing twice in azure storage table simultaneously - python

I've been running a script to retrieve data from an Azure storage table (such as this one as a reference) and copy it in another table from the same storage account without problem.
Now, the issue came when I tried to access this latter table to run some calculations and copy that in another table from the same storage account. This script returned the following error:
AzureConflictHttpError: Conflict
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:57d9b721-6002-012d-3d0c-b88bef000000\nTime:2019-01-29T19:55:53.5984026Z"}}}
At the same time, however, the code I was running previously also stopped printing the same error and won't start again even if no code is run, returning the previous error over and over again.
Is there any way to access the same API in azure storage multiple times?
UPDATE
Adding the source code, sorry for not having done that before. Basically the 2 codes I'm running in parallel are the same but with different filters; on this one I'm taking the data from Table 1 (which has a row per second) and I'm averaging these numbers per minute to add a row to Table 2, and on the other script I'm taking data from this Table 2 to average these rows per minute to a 5-minute average row in another Table 3, so basically a few parameters change but the code is basically the same.
There will be a third script, slightly different to these 2, but will take Table 2 as the input source, run other calculations and paste the results in a new row per minute in a future Table 4, so in general my idea is to have multiple entries to multiple tables at the same time to build new specific tables.
import datetime
import time
from azure.storage.table import TableService, Entity
delta_time = '00:01:00'
retrieve_time = '00:10:00'
start_time = '08:02:00'
utc_diff = 3
table_service = TableService(account_name='xxx', account_key='yyy')
while True:
now_time = datetime.datetime.now().strftime("%H:%M:%S")
now_date = datetime.datetime.now().strftime("%d-%m-%Y")
hour = datetime.datetime.now().hour
if hour >= 21:
now_date = (datetime.datetime.now() + datetime.timedelta(days=1)).strftime("%d-%m-%Y")
retrieve_max = (datetime.datetime.now() + datetime.timedelta(hours=utc_diff)+ datetime.timedelta(minutes=-10)).strftime("%H:%M:%S")
start_diff = datetime.datetime.strptime(now_time, '%H:%M:%S') - datetime.datetime.strptime(start_time, '%H:%M:%S') + datetime.timedelta(hours=utc_diff)
if start_diff.total_seconds() > 0:
query = "PartitionKey eq '"+str(now_date)+"' and RowKey ge '"+str(retrieve_max)+"'"
tasks=table_service.query_entities('Table1',query)
iqf_0 = []
for task in tasks:
if task.Name == "IQF_0":
iqf_0.append([task.RowKey, task.Area])
last_time = iqf_0[len(iqf_0)-1][0]
time_max = datetime.datetime.strptime(last_time, '%H:%M:%S') - datetime.datetime.strptime(delta_time, '%H:%M:%S') #+ datetime.timedelta(hours=utc_diff)
area = 0.0
count = 0
for i in range(len(iqf_0)-1, -1, -1):
diff = datetime.datetime.strptime(last_time, '%H:%M:%S') - datetime.datetime.strptime(iqf_0[i][0], '%H:%M:%S')
if diff.total_seconds() < 60:
area += iqf_0[i][1]
count += 1
else:
break
area_average = area/count
output_row = Entity()
output_row.PartitionKey = now_date
output_row.RowKey = last_time
output_row.Name = task.Name
output_row.Area = area_average
table_service.insert_entity('Table2', output_row)
date_max = datetime.datetime.now() + datetime.timedelta(days=-1)
date_max = date_max.strftime("%d-%m-%Y")
query = "PartitionKey eq '"+str(date_max)+"' and RowKey ge '"+str(retrieve_max)+"'"
tasks=table_service.query_entities('Table2',query)
for task in tasks:
diff = datetime.datetime.strptime(now_time, '%H:%M:%S') - datetime.datetime.strptime(task.RowKey, '%H:%M:%S') + datetime.timedelta(hours=utc_diff)
print(i, datetime.datetime.strptime(now_time, '%H:%M:%S'), datetime.datetime.strptime(task.RowKey, '%H:%M:%S'), diff.total_seconds())
if task.PartitionKey == date_max and diff.total_seconds()>0:
table_service.delete_entity('Table2', task.PartitionKey, task.RowKey)
time.sleep(60 - time.time() % 60)

It sounds like you were running two codes to copy data in a same Azure Storage Accout from Table 1 to Table 2 to Table 3 at the same time. Per my experience, the issue was normally caused by writing a data record (a Table Entity) concurrently at the same time, or using the incorrect method for an existing Entity, which is resource competition issue for writing.
It's a common Table Service Error, you can find it at here.
And there is a document Inserting and Updating Entities which explains the differences of the operation effect between the functions Insert Entity, Update Entity, Merge Entity, Insert Or Merge Entity, and Insert Or Replace Entity.
Now, your code did not shared for us. Considering for all possible cases, there are three solutions to fix the issue.
Run your two codes one after another in order of copying data between different tables, not concurrently.
Using the correct function to update data for an existing entity, you can refer to the document above and the similar SO thread Add or replace entity in Azure Table Storage.
To use a global lock for a unique primary key of a Table Entity to avoid operating the same Table Entity concurrently in two codes at the same time.

Related

SQLITE - unable to combine thousands of delete statements into single transaction

I'm facing an issue with deleting thousands of rows from a single table takes a long time (over 2 minutes for 14k records) while inserting the same records is nearly instant (200ms). Both the insert and delete statements are handled identically - a loop generates the statements and appends them to a list, then the list is passed to a separate function that opens a transaction, executes all the statements and then finishes with commit. At least this was my impression before I started testing this with pseudocode - but it looks like I misunderstood the need for opening the transaction manually.
I've read about transactions (https://www.sqlite.org/faq.html#q19) but since the inserts are pretty much instant then I am not sure whether this is the case here.
My understanding is that transaction == commit, and if this is correct then it looks like all the delete statements are in a single transaction - mid-processing I can see all the deleted rows until the final commit, after which they are actually deleted. Ie the situation in the FAQ link below should be different - since no commit takes place. But the slow speed indicates that it is still doing something else that is slowing things down as if each delete statement were a separate transaction.
After running the pseudocode it appears that while the changes are not committed until explicit commit is sent (via conn.commit()) but the "begin" or "begin transaction" in front of the loop does not have any effect. I think this is because sqlite3 sends the "begin" automatically in the background ( Merge SQLite files into one db file, and 'begin/commit' question )
Pseudocode to test this out:
import sqlite3
from datetime import datetime
insert_queries = []
delete_queries = []
rows = 30000
for i in range(rows):
insert_queries.append(f'''INSERT INTO test_table ("column1") VALUES ("{i}");''')
for i in range(rows):
delete_queries.append(f'''DELETE from test_table where column1 ="{i}";''')
conn = sqlite3.connect('/data/test.db', check_same_thread=False)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print('*'*50)
print(f'Starting inserts: {timestamp}')
# conn.execute('BEGIN TRANSACTION')
for query in insert_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished inserts: {timestamp}')
print('*'*50)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Starting deletes: {timestamp}')
# conn.isolation_level = None
# conn.execute('BEGIN;')
# conn.execute('BEGIN TRANSACTION;')
for query in delete_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished deletes: {timestamp}')
One weird thing is that the row count exponentially increases the delete time (2s to delete 10k rows, 7s to delete 20k rows, 43s to delete 50k rows) while the insert time is instant regardless of row count.
EDIT:
The original question was - why does the delete statement take so much more time compared to insert statement and how to speed it up so the speeds for both inserting and deleting rows would be similar.
As per snakecharmerb's suggestion one workaround around this would be to do it like this:
rows = 100000
delete_ids = ''
for i in range(rows):
if delete_ids:
delete_ids += f',"{i}"'
else:
delete_ids += f'"{i}"'
delete_str = f'''DELETE from test_table where column1 IN ({delete_ids});'''
conn.execute(delete_str)
conn.commit()
While this is most likely against all best-practices, it does seem to work - it takes around 2 seconds to delete 1mil rows.
I tried batching the deletes in sets of 50:
...
batches = []
batch = []
for i in range(rows):
batch.append(str(i))
if len(batch) == 50:
batches.append(batch)
batch = []
if batch:
batches.append(batch)
...
base = 'DELETE FROM test_table WHERE column1 IN ({})'
for batch in batches:
placeholders = ','.join(['?'] * len(batch))
sql = base.format(placeholders)
conn.execute(sql, batch)
conn.commit()
...
and this reduced the duration to 1 - 2 seconds (from 6 - 8 originally).
Combining this approach with executemany resulted in a 1 second duration.
Using a query to define the deleted columns was almost instant
DELETE FROM test_table WHERE column1 IN (SELECT column1 FROM test_table)
but it's possible Sqlite recognises that this query is the same as a bare DELETE FROM test_table and optimises.
Switching off the secure_delete PRAGMA seemed to make performance even worse.

Is there a faster way to move millions of rows from Excel to a SQL database using Python?

I am a financial analyst with about two month's experience with Python, and I am working on a project using Python and SQL to automate the compilation of a report. The process involves accessing a changing number of Excel files saved in a share drive, pulling two tabs from each (summary and quote) and combining the datasets into two large "Quote" and "Summary" tables. The next step is to pull various columns from each, combine, calculate, etc.
The problem is that the dataset ends up being 3.4mm rows and around 30 columns. The program I wrote below works, but it took 40 minutes to work through the first part (creating the list of dataframes) and another 4.5 hours to create the database and export the data, not to mention using a LOT of memory.
I know there must be a better way to accomplish this, but I don't have a CS background. Any help would be appreciated.
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
quote_combined = list()
summary_combined = list()
# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes. Add two columns, format column headers, then
# add each dataframe to the list of dataframes.
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_combined.append(quote_sheet)
summary_combined.append(summary_sheet)
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
# Concatenate the list of dataframes to append one to another.
# Totals about 3.4mm rows for August
totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)
# Change directory, create Sqlite database, and send the combined dataframes to database
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')
print('Finished. It took: ', datetime.now() - starttime)
'''
I see a few things you could do. Firstly, since your first step is just to transfer the data to your SQL DB, you don't necessarily need to append all the files to each other. You can just attack the problem one file at a time (which means you can multiprocess!) - then, whatever computations need to be completed, can come later. This will also result in you cutting down your RAM usage since if you have 10 files in your folder, you aren't loading all 10 up at the same time.
I would recommend the following:
Construct an array of filenames that you need to access
Write a wrapper function that can take a filename, open + parse the file, and write the contents to your MySQL DB
Use the Python multiprocessing.Pool class to process them simultaneously. If you run 4 processes, for example, your task becomes 4 times faster! If you need to derive computations from this data and hence need to aggregate it, please do this once the data's in the MySQL DB. This will be way faster.
If you need to define some computations based on the aggregate data, do it now, in the MySQL DB. SQL is an incredibly powerful language, and there's a command out there for practically everything!
I've added in a short code snippet to show you what I'm talking about :)
from multiprocessing import Pool
PROCESSES = 4
FILES = []
def _process_file(filename):
print("Processing: "+filename)
pool = Pool(PROCESSES)
pool.map(_process_file, FILES)
SQL clarification: You don't need an independent table for every file you move to SQL! You can create a table based on a given schema, and then add the data from ALL your files to that one table, row by row. This is essentially what the function you use to go from DataFrame to table does, but it creates 10 different tables. You can look at some examples on inserting a row into a table here.However, in the specific use case that you have, setting the if_exists parameter to "append" should work, as you've mentioned in your comment. I just added the earlier references in because you mentioned that you're fairly new to Python, and a lot of my friends in the finance industry have found gaining a slightly more nuanced understanding of SQL to be extremely useful.
Try this, Here most of the time is taken is during Loading Data from excel to Dataframe. I am not sure following script will reduce the time to within seconds but It will reduce the RAM baggage, which in turn could speed up the process. It will potentially reduce the time by at least 5-10 minutes. Since I have no access to data I cannot be sure. But you should try this
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')
summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

Bigquery Partition Date Python

I would like to write another table by partition date the table in bigquery. But I couldn't find how to do it. I use Python and google cloud library. I want to create a table using standard SQL.But I get an error.
Error : google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/astute-baton-272707/queries/f4b9dadb-1390-4260-bb0e-fb525aff662c?maxResults=0&location=US: The number of columns in the column definition list does not match the number of columns produced by the query at [2:72]
Please let me know if there is another solution. Day to day İnsert to table the next stage of the project.
I may have been doing it wrong from the beginning. I am not sure.
Thank You.
client = bigquery.Client()
sql = """
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
OPTIONS (
description="weather stations with precipitation, partitioned by day"
) AS
select
FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)), "Turkey") AS visitStartTime_ts,
date
,FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime+(h.time/1000) AS INT64)), "Turkey") AS hitsTime_ts
,h.appInfo.appId as appId
,fullVisitorId
,(SELECT value FROM h.customDimensions where index=1) as cUserId
,h.eventInfo.eventCategory as eventCategory
,h.eventInfo.eventAction as eventAction
,h.eventInfo.eventLabel as eventLabel
,REPLACE(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(1)],'}','') as player_type
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(0)] as PLAY_SESSION_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(1)] as CHANNEL_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(2)] as CONTENT_EPG_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(3)] as OFF_SET
FROM `zzzzz.yyyyyy.xxxxxx*` a,
UNNEST(hits) AS h
where
1=1
and SPLIT(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(0)],'/')[OFFSET(0)] like 'player'
and _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND (BYTE_LENGTH(h.eventInfo.eventCategory) - BYTE_LENGTH(REPLACE(h.eventInfo.eventCategory,'/{','')))/2 + 1 = 2
AND h.eventInfo.eventAction='heartBeat'
"""
job = client.query(sql) # API request.
job.result()
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
A quick solution for the problem presented here: When creating a table, you don't need to declare the schema of it, if there's a query where data is coming from. Right now there's a conflict between the data and the declared schema. So remove one.
Instead of starting the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
Start the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy
PARTITION BY date

Improve sqlite query speed

I have a list of numbers (actually, percentages) to update in a database. The query is very simple, I obtain the id of the items somewhere in my code, and then, I update these items in the database, with the list of numbers. See my code:
start_time = datetime.datetime.now()
query = QtSql.QSqlQuery("files.sqlite")
for id_bdd, percentage in zip(list_id, list_percentages):
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
params = (percentage, id_bdd)
query.prepare(request)
for value in params:
query.addBindValue(value)
query.exec_()
elsapsed_time = datetime.datetime.now() - start_time
print(elsapsed_time.total_seconds())
It takes 1 second to generate list_percentages, and more than 2 minutes to write all the percentages in the database.
I use sqlite for the database, and there are about 7000 items in the database. Is it normal that the query takes so much time ?
If not, is there a way to optimize it ?
EDIT:
Comparison with the sqlite3 module from the std library:
bdd = sqlite3.connect("test.sqlite")
bdd.row_factory = sqlite3.Row
c = bdd.cursor()
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
for id_bdd, percentage in zip(list_id, list_percentages):
params = (percentage, id_bdd)
c.execute(request, params)
bdd.commit()
c.close()
bdd.close()
I think the QSqlQuery commits the changes at each loop lap, while the sqlite3 module allows to commit at the ends, all the different queries at the same time.
For the same test database, the QSqlQuery takes ~22 s, while the "normal" query takes ~0.3 s. I can't believe this is just a perf issue, I must do something wrong.
You need to start a transaction, and commit all the updates after the loop.
Not tested but should be close to:
start_time = datetime.datetime.now()
# Start the transaction time
QtSql.QSqlDatabase.transaction()
query = QtSql.QSqlQuery("files.sqlite")
for id_bdd, percentage in zip(list_id, list_percentages):
request = "UPDATE papers SET percentage_match = ? WHERE id = ?"
params = (percentage, id_bdd)
query.prepare(request)
for value in params:
query.addBindValue(value)
query.exec_()
# commit changues
if QtSql.QSqlDatabase.commit():
print "updates ok"
elsapsed_time = datetime.datetime.now() - start_time
print(elsapsed_time.total_seconds())
At the other hand, this question could to be a database performance issue, try to create an index on id field: https://www.sqlite.org/lang_createindex.html
You will need direct access to the database.
create index on papers (id);
Do you really need to call the prepare each time? to me, it seems the request doesn't change, so this "prepare" function could be moved out of loop?

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories