SQLite - calculate moving average - python

i have an table in sqlite using pysqlite:
create table t
(
id integer primary key not null,
time datetime not null,
price decimal(5,2)
)
how can i from this data calculate moving average with window X seconds large with an sql statement?

As far as I understand your question, You do not want the average over the last N items, but over the last x seconds, am I correct?
Well, this gives you the list of all prices recorded the last 720 seconds:
>>> cur.execute("SELECT price FROM t WHERE datetime(time) > datetime('now','-720 seconds')").fetchall()
of course, you can feed that to the AVG-SQL-Function, to get the average price in that window:
>>> cur.execute("SELECT AVG(price) FROM t WHERE datetime(time) > datetime('now','-720 seconds')").fetchall()
You can also use other time units, and even chain them.
For example, to obtain the average price for the last one and a half hour, do:
>>> cur.execute("SELECT AVG(price) FROM t WHERE datetime(time) > datetime('now','-30 minutes','-1 hour')").fetchall()
Edit: SQLite datetime reference can be found here

The moving average with a window x units large is given at time i by:
(x[i] + x[i+1] + ... + x[i+x-1]) / x
To compute it, you want to make a LIFO stack (which you can implement as a queue) of size x, and compute its sum. You can then update the sum by adding the new value and subtracting the old one from the old sum; you get the new one from the database and the old one by popping the first element off the stack.
Does that make sense?

Related

Calculate weighted average in pandas with unique condition

I'm trying to calculate the weighted average of the "prices" column in the following dataframe for each zone, regardless of hour. I want to essentially sum the quantities that match A, divide each individual quantity row by that amount (to get the weights) and then multiply it by the price.
There are about 200 zones, I'm having a hard time writing something that will generically detect that the Zones match, and not have to write df['ZONE'] = 'A' etc. Please help my lost self =)
HOUR: 1,2,3,1,2,3,1,2,3
ZONE: A,A,A,B,B,B,C,C,C
PRICE: 12,15,16,17,12,11,12,13,15
QUANTITY: 5,6,1 5,7,9 6,3,2
I'm not sure if you can generically write something, but I thought what if I wrote a function where x is my "Zone", create a list with possible zones, and then create a for loop. Here's the function I wrote, doesn't really work - trying to figure out how else I can make it work
def wavgp(x):
df.loc[df['ZONE'].isin([str(x)])] = x
Here is a possible solution using groupby operation:
weighted_price = df.groupby('ZONE').apply(lambda x: (x['PRICE'] * x['QUANTITY']).sum()/x['QUANTITY'].sum())
Explaination
First we groupby zone , for each of these block (of the same zone) we are going to multiply the price by the quantity and sum these values. We divide this result by the sum of the quantity to get your desired result.
ZONE
A 13.833333
B 12.761905
C 12.818182
dtype: float64

Python Pandas - Cumulative time series data, calculate change?

I'm pulling data from an API at intervals, each item I pull has a "start date" (for the ad campaign, t1) and the next points will be an increasing t2 value (now). It's for a learning project I'm doing, as I'm relatively new to data science.
The values, such as revenue, cost, clicks, conversions etc. are cumulative. To find the change from one datapoint to the next, I'd have to subtract n - (n-1), as n contains the data from (n-1).
I pull the data into a dataframe using (the database is sqlite for now):
SQL = """SELECT
MAX(a.t2) as "Snapshot time",
a.volid AS "Camp ID",
a.tsid AS "TS ID",
a.placement as "Source ID",
a.clicks AS "Clicks tracker",
a.visits AS "Visits tracker",
a.conversions AS "Conversion",
a.revenue AS "Revenue USD",
b.cost AS "Cost USD" ,
b.clicks AS "ts Clicks",
from tracker a JOIN ts b ON a.placement = b.placement AND a.tsid =
b.campaignid AND a.t2 = b.t2
GROUP BY a.voli, a.tsid, a.placement"""
df = pd.read_sql_query(SQL, conn)
df_t2['snapshot'] = pd.to_datetime(df_t2['snapshot'], format='%Y-%m-%dT%H:%M:%S.%fZ')
# Generate time value for the second sql query, for n-x
t1 = df_t2['snapshot'].max() - dt.timedelta(hours=offset)
t1 = t1.strftime('%Y-%m-%dT%H:%M:%S.%fZ')
This gives me the latest snapshot (t0--tn). My initial thought was to make a similar dataframe for (t0-t(n-1)) and subtract them, which is there the t1 variable came into the picture in the code above.
But I tried this and can't get it to work. I also tested to handle it like a time series in Pandas, but I'm not sure if my data structure works for that.
The expected behavior would be to get a dataframe only consisting of data for tn-t(n-1). Even better if it would be possible to generate all the n-(n-1) for the entire series, so each record would be an increment and not a cumulative value.
Any input would be much appreciated. Thanks in advance.

getting a sub-dataframe from values

i'm working with timeseries data with this format:
[timestamp][rain value]
i wanted to count rainfall events in the timeseries data, where we define a rainfall event as a subdataframe of the main dataframe which contains nonzero values between zero rainfall values
i managed to get the start of the dataframe by getting the index of rainfall value before the first nonzero value:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
what i can't figure out is how to find the end. i was looking for some function zero():
end=cur.rain.values.zero()[0][0]
to find the next zero value in the rain column and mark that as the end of my subdataframe
additionally, because my data is sampled at 15min intervals, it would mean that a temporary lull of 15mins would give me two rainfall events instead of one, which realistically isn't true. which means i would like to define some time period - 6hrs for example - to separate rainfall events.
what i was thinking of (but could not execute because i couldn't find the end of the subdataframe yet), in pseudocode:
start = df.rain.values.nonzero()[0][0] - 1
cur = df[start:]
end=cur.rain.values.zero()[0][0]
temp = df[end:]
z = temp.rain.values.nonzero()[0][0] - 1
if timedelta (z-end) >=6hrs:
end stays as endpoint of cur
else:
z is new endpoint, find next nonzero to again check
so i guess my question is, how do i find the end of my subdataframe if i don't want to iterate over all rows
and am i on the right track with my pseudocode in defining the end of a rainfall event as, say, 6 hours of 0 rain.

SQLAlchemy Subtract 2 Values in Same Column Differentiated By Another Column ID

I am looking to make a query that will look at the "reading" column of a table, and return the difference of the average readings over the past hour with another column (called height) id of 1 or 2.
Essentially, make average of all readings over past hour with height value of 1. Make average of all readings over past hour with height value of 2, and subtract the two.
How can I do this in one query in sqlalchemy?
Might not be want you want because I don't know sqlalchemy, but a raw Postgresql SQL query to achieve could be
SELECT
AVG(CASE WHEN height = 2 THEN reading ELSE NULL END) -
AVG(CASE WHEN height = 1 THEN reading ELSE NULL END)
FROM table
WHERE reading_time >= (CURRENT_TIMESTAMP - INTERVAL '1 HOUR')
This makes use of the fact that NULL values are excluded from the AVG aggregate function.

Python library for dealing with time associated data?

I've got some data (NOAA-provided weather forecasts) I'm trying to work with. There are various data series (temperature, humidity, etc), each of which contains a series of data points, and indexes into an array of datetimes, on various time scales (Some series are hourly, others 3-hourly, some daily). Is there any sort of library for dealing with data like this, and accessing it in a user-friendly way.
Ideal usage would be something like:
db = TimeData()
db.set_val('2010-12-01 12:00','temp',34)
db.set_val('2010-12-01 15:00','temp',37)
db.set_val('2010-12-01 12:00','wind',5)
db.set_val('2010-12-01 13:00','wind',6)
db.query('2010-12-01 13:00') # {'wind':6, 'temp':34}
Basically the query would return the most recent value of each series.
I looked at scikits.timeseries, but it isn't very amenable to this use case, due to the amount of pre-computation involved (it expects all the data in one shot, no random-access setting).
If your data is sorted you can use the bisect module to quickly get the entry with the greatest time less than or equal to the specified time.
Something like:
i = bisect_right(times, time)
# times[j] <= time for j<i
# times[j] > time for j>=i
if times[i-1] == time:
# exact match
value = values[i-1]
else:
# interpolate
value = (values[i-1]+values[i])/2
SQLite has a date type. You can also convert all the times to seconds since epoch (by going through time.gmtime() or time.localtime()), which makes comparisons trivial.
It is a classic row-to-column problem, in a good SQL DBMS you can use unions:
SELECT MAX(d_t) AS d_t, SUM(temp) AS temp, SUM(wind) AS wind, ... FROM (
SELECT d_t, 0 AS temp, value AS wind FROM table
WHERE type='wind' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
SELECT d_t, value, 0 FROM table
WHERE type='temp' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
...
) q1;
The trick is to make a subquery for each dimension while providing placeholder columns for the other dimensions. In Python you can use SQLAlchemy to dynamically generate a query like this.

Categories