complex SQL query with SQLite

complex SQL query with SQLite - python

I need to make a complex query, and I need help. Below is an example of what do i have:
id | Date | Validity
48 | 6-1-2009 | notFound
47 | 6-1-2009 | valid
46 | 6-1-2009 | valid
45 | 3-1-2009 | invalid
44 | 3-1-2009 | invalid
42 | 4-1-2009 | notFound
41 | 4-1-2009 | notFound
48 | 4-1-2009 | valid
[Here come the SQL.]
And the query result would look like:
Date | valid | invalid | notFound
3-1-2009 | 0 | 2 | 0
4-1-2009 | 1 | 2 | 2
6-1-2009 | 3 | 2 | 3
Can i do this only in SQLite? Can someone help me? I'm working in Python.
I need this to generate a line graph out of it. The example would be: line chart!

I believe this should do it:
SELECT Date,
SUM(Validity = "valid") AS valid,
SUM(Validity = "invalid") AS invalid,
SUM(Validity = "notFound") AS notFound
FROM table_name
GROUP BY Date
The SUM function will tally the total count of the various Validity types. The '=' function returns 0 when false and 1 when true, which is what allows this to work.
** EDIT **
I just realized that this isn't quite what you want, as you're looking for an overall aggregate at each date that also includes data from the previous date. This can be done fairly easily in python.
valid_sum = 0
invalid_sum = 0
notfound_sum = 0
for r in cursor.fetchall():
date = r[0]
valid_sum += int(r[1])
invalid_sum += int(r[2])
notfound_sum += int(r[3])
# print aggregate data for this date

Your Date column cannot be properly sorted because it does not begin with the most significant field, the year.
Assuming that you change the date to have a proper yyyy-mm-dd format, you could use something like this:
SELECT A.Date,
(SELECT COUNT(*) FROM MyTable AS B
WHERE B.Date <= A.Date
AND B.Validity = 'valid') AS valid,
(SELECT COUNT(*) FROM MyTable AS B
WHERE B.Date <= A.Date
AND B.Validity = 'invalid') AS invalid,
(SELECT COUNT(*) FROM MyTable AS B
WHERE B.Date <= A.Date
AND B.Validity = 'notFound') AS notFound
FROM (SELECT DISTINCT Date FROM MyTable
ORDER BY Date) AS A

Related

Efficient Way to Build Large Scale Hierarchical Data Tree Path

I have a large dataset (think: big data) of network elements that form a tree-like network.
A toy dataset looks like this:
| id | type | parent_id |
|-----:|:-------|:------------|
| 1 | D | <NA> |
| 2 | C | 1 |
| 3 | C | 2 |
| 4 | C | 3 |
| 5 | B | 3 |
| 6 | B | 4 |
| 7 | A | 4 |
| 8 | A | 5 |
| 9 | A | 3 |
Important rules:
The root nodes (in the toy example of type D) and the leaf nodes (in the toy example of type A) cannot be connected with each other and amongst each other. I.e., a D node cannot be connected with another D node (vice-versa for A nodes) and an A node cannot directly be connected with a D node.
For simplicity reasons, any other node type can randomly be connected in terms of types.
The tree depth can be arbitrarily deep.
The leaf node is always of type A.
A leaf node does not need to be connected through all intermediate nodes. In reality there are only a handful intermediary nodes that are mandatory to pass through. This circumstance can be neglected for this example here.
If you are to recommend doing it in Spark, the solution must be written with pyspark in mind.
What I would like to achieve is to build an efficient way (preferably in Spark) to calculate the tree-path for each node, like so:
| id | type | parent_id | path |
|-----:|:-------|:------------|:--------------------|
| 1 | D | <NA> | D:1 |
| 2 | C | 1 | D:1>C:2 |
| 3 | C | 2 | D:1>C:2>C:3 |
| 4 | C | 3 | D:1>C:2>C:3>C:4 |
| 5 | B | 3 | D:1>C:2>C:3>B:5 |
| 6 | B | 4 | D:1>C:2>C:3>C:4>B:6 |
| 7 | A | 4 | D:1>C:2>C:3>C:4>A:7 |
| 8 | A | 5 | D:1>C:2>C:3>B:5>A:8 |
| 9 | A | 3 | D:1>C:2>C:3>A:9 |
Note:
Each element in the tree path is constructed like this: id:type.
If you have other efficient ways to store the tree path (e.g., closure tables) and calculate them, I am happy to hear them as well. However, the runtime for calculation must be really low (less than an hour, preferably minutes) and retrieval later needs to be in the area of few seconds.
The ultimate end goal is to have a data structure that allows me to aggregate any network node underneath a certain node efficiently (runtime of a few seconds at most).
The actual dataset consisting of around 3M nodes can be constructed like this:
Note:
The commented node_counts that produces the above shown toy examples
The distribution of the node elements is close to reality.
import random
import pandas as pd
random.seed(1337)
node_counts = {'A': 1424383, 'B': 596994, 'C': 234745, 'D': 230937, 'E': 210663, 'F': 122859, 'G': 119453, 'H': 57462, 'I': 23260, 'J': 15008, 'K': 10666, 'L': 6943, 'M': 6724, 'N': 2371, 'O': 2005, 'P': 385}
#node_counts = {'A': 3, 'B': 2, 'C': 3, 'D': 1}
elements = list()
candidates = list()
root_type = list(node_counts.keys())[-1]
leaf_type = list(node_counts.keys())[0]
root_counts = node_counts[root_type]
leaves_count = node_counts[leaf_type]
ids = [i + 1 for i in range(sum(node_counts.values()))]
idcounter = 0
for i, (name, count) in enumerate(sorted(node_counts.items(), reverse=True)):
for _ in range(count):
_id = ids[idcounter]
idcounter += 1
_type = name
if i == 0:
_parent = None
else:
# select a random one that is not a root or a leaf
if len(candidates) == 0: # first bootstrap case
candidate = random.choice(elements)
else:
candidate = random.choice(candidates)
_parent = candidate['id']
_obj = {'id': _id, 'type': _type, 'parent_id': _parent}
#print(_obj)
elements.append(_obj)
if _type != root_type and _type != leaf_type:
candidates.append(_obj)
df = pd.DataFrame.from_dict(elements).astype({'parent_id': 'Int64'})
In order to produce the tree path in pure python with the above toy data you can use the following function:
def get_hierarchy_path(df, cache_dict, ID='id', LABEL = 'type', PARENT_ID = 'parent_id', node_sep='|', elem_sep=':'):
def get_path(record):
if pd.isna(record[PARENT_ID]):
return f'{record[LABEL]}{elem_sep}{record[ID]}'
else:
if record[PARENT_ID] in cache_dict:
parent_path = cache_dict[record[PARENT_ID]]
else:
try:
parent_path = get_path(df.query(f'{ID} == {record[PARENT_ID]}').iloc[0])
except IndexError as e:
print(f'Index Miss for {record[PARENT_ID]} on record {record.to_dict()}')
parent_path = f'{record[LABEL]}{elem_sep}{record[ID]}'
cache_dict[record[PARENT_ID]] = parent_path
return f"{parent_path}{node_sep}{record[LABEL]}{elem_sep}{record[ID]}"
return df.apply(get_path, axis=1)
df['path'] = get_hierarchy_path(df, dict(), node_sep='>')
What I have already tried:
Calculating in pure python with the above function on the large dataset takes me around 5.5 hours. So this is not really a solution. Anything quicker than this is appreciated.
Technically using the spark graphframes package, I could use BFS. This would give me a good solution for individual leave nodes, but it does not scale to the entire network.
I think Pregel is the way to go here. But I do not know how to construct it in Pyspark.
Thank you for your help.

My current solution for this challenge relies now no longer on Spark but on SQL.
I load the whole dataset to a Postgres DB and place a Unique Index on id, type and parent_id.
Then using the following query, I can calculate the path:
with recursive recursive_hierarchy AS (
-- starting point
select
parent_id
, id
, type
, type || ':' || id as path
, 1 as lvl
from hierarchy.nodes
union all
-- recursion
select
ne.parent_id as parent_id
, h.id
, h.type
, ne.type || ':' || ne.id || '|' || h.path as path
, h.lvl + 1 as lvl
from (
select *
from hierarchy.nodes
) ne
inner join recursive_hierarchy h
on ne.id = h.parent_id
), paths as (
-- complete results
select
*
from recursive_hierarchy
), max_lvl as (
-- retrieve the longest path of a network element
select
id
, max(lvl) as max_lvl
from paths
group by id
)
-- all results with only the longest path of a network element
select distinct
, p.id
, p.type
, p.path
from paths p
inner join max_lvl l
on p.id = l.id
and p.lvl = l.max_lvl

How can I do to add the condition AND using exlude?

I have this query :
MyTable.objects.filter(date=date).exclude(starthour__range=(start, end), endhour__range=(start, end))
But I want exclude the queries that starthour__range=(start, end) AND endhour__range=(start, end) not OR. I think in this case the OR is used.
Could you help me please ?
Thank you very much !

This is consequence of De Morgan's law [wiki], this specifies that ¬ (x ∧ y) is ¬ x ∨ ¬ y. This thus means that the negation of x and y, is not x or not y. Indeed, if we take a look at the truth table:
x | y | x &wedge; y | ¬x | ¬y | ¬(x &wedge; y) | ¬x &vee; ¬y
---+---+-------+----+----+----------+---------
0 | 0 | 0 | 1 | 1 | 1 | 1
0 | 1 | 0 | 1 | 0 | 1 | 1
1 | 0 | 0 | 0 | 1 | 1 | 1
1 | 1 | 1 | 0 | 0 | 0 | 0
So excluding items where both the starthour is in the(start, end) range and endhour is in the (start, end) range, is logically equivalent to allowing items where the starthour is not in the range, or where the endhour is not in the range.
Using and logic
You can thus make a disjunction in the .exclude(…) call to filter out items that satisfy one of the two conditions, or thus retain objects that do not satisfy any of the two conditions:
MyTable.objects.filter(date=date).exclude(
Q(starthour__range=(start, end)) | Q(endhour__range=(start, end))
)
Overlap logic
Based on your query however, you are looking for overlap, not for such range checks. It will not be sufficient to validate the starthour and endhour. If you want to check if two things overlap. Indeed, imagine that an event starts at 08:00 and ends at 18:00, and you filter for a range with 09:00 and 17:00, then both the starthour and endhour are not in the range, but still the events overlap.
Two ranges [s1, e1] and [s2, e2] do not overlap if s1≥ e2, or s2≥ e1. The negation, the condition when the two overlap is thus: s1< e2 and s2< e1. We thus can exclude the items that overlap with:
# records that do not overlap
MyTable.objects.filter(date=date).exclude(
starthour__lt=end, endhour__lt=start
)

Updating a row with multiple different values in python with sqlite3

I have looked at many similar sources to this question but can't seem to find an answer that solves my problem.
I am using python3.7.1 and sqlite3 to essentially create quizzes. I have an array of questions which I have inserted into a table, with a 'Class_name' and 'quizNumber' using this code example:
quizQuestion = ['1 + 1 =', '2 + 2 =', '3 + 3 =',...]
quizNumber = (1)
quizClass = ('6D')
for question in quizQuestions:
cursor.execute( "INSERT INTO set_questionsTbl (QuizNumber, Class_name, Question) VALUES (?,?,?)", (quizNumber,quizClass,question))
This code works fine and the table(set_questionsTbl) looks like this:
QuizNumber | Class_name | Question | Answer
-------------------------------------------
1 | 6D | 1 + 1 = |
1 | 6D | 2 + 2 = |
1 | 6D | 3 + 3 = |
etc..
I also have an array of answers:
quizAnswers = [2,4,6,...]
The problem that occurs is when trying to update the table with the answers so it looks like this:
QuizNumber | Class_name | Question | Answer
-------------------------------------------
1 | 6D | 1 + 1 = | 2
1 | 6D | 2 + 2 = | 4
1 | 6D | 3 + 3 = | 6
etc...
The code I tried was this:
for answer in quizAnswers:
cursor.execute("UPDATE set_questionsTbl SET Answer = (?) ", (answer,))
This didn't work as with every loop the previous inputted answer got overwritten leaving me with this:
QuizNumber | Class_name | Question | Answer
-------------------------------------------
1 | 6D | 1 + 1 = | 6
1 | 6D | 2 + 2 = | 6
1 | 6D | 3 + 3 = | 6
etc...
I have also tried joining the loops together but that doesn't work, table looks like:
QuizNumber | Class_name | Question | Answer
-------------------------------------------
1 | 6D | 1 + 1 = | 2
1 | 6D | 2 + 2 = | 2
1 | 6D | 3 + 3 = | 2
1 | 6D | 1 + 1 = | 4
1 | 6D | 2 + 2 = | 4
1 | 6D | 3 + 3 = | 4
1 | 6D | 1 + 1 = | 6
1 | 6D | 2 + 2 = | 6
1 | 6D | 3 + 3 = | 6
I have tried to correct this many times and searched many different examples but couldn't seem to find a solution. So how do I loop through each question and update the answer with each answer in quizAnswers?
I am new to stack overflow so I may have missed a question similar to this, if so please link it.

The point you are missing is, that you instruct your database to update all records with the answer in the current iteration of the loop.Your example:
quizAnswers = [2,4,6]
for answer in quizAnswers:
cursor.execute("UPDATE set_questionsTbl SET Answer = (?) ", (answer,))
basically does:
set Answer for *all* records to 2
set Answer for *all* records to 4
set Answer for *all* records to 6
Your database does not understand the meaning of the data you put into the tables. For the DB, the questions you inserted into the Question column of your set_questionsTbl are just sequences of characters. It has no means of evaluating the equation in Question in order to set a value in Answer only if that value is the result of the equation.
You need to tell your DB which answer to set for which question(s).
You need to give your query criteria, that a record must meet for the update operation to be applied to it. That's what WHERE clauses are for.
Following your example, you'll want to create a correlation between questions and their respective answers. The below is one way to do that.
First, create a list of tuples. Each tuple contains an answer in the first element, the second element holds the corresponding question (this needs to be the exact value you inserted into the Question column).
quiz_question_answers = [(2, '1 + 1 ='), (4, '2 + 2 ='), (6, '3 + 3 =')]
Then you can iterate over that list of tuples to execute the UPDATE query in your cursor:
for q_a in quiz_question_answers:
cursor.execute("UPDATE set_questionsTbl SET Answer = ? WHERE Question = ?", (q_a[0], q_a[1]))
This will update the answer in all records that have the specific equation in Question. Records with different Class_name and/or QuizNumber - a record like this for example:
QuizNumber | Class_name | Question | Answer
-------------------------------------------
4 | 5A | 1 + 1 = |
- will be updated as well, because the only criteria, the Question equaling 1 + 1 =, is met for this record also.
If you want the answers to be set only for questions of quiz number 1 for class 6D, you'll have to add more restrictive criteria to your query, e.g.:
quiz_question_answers = [(2, '1 + 1 =', 1, '6D'), (4, '2 + 2 =', 1, '6D'), (6, '3 + 3 =', 1, '6D')]
for q_a in quiz_question_answers:
cursor.execute("UPDATE set_questionsTbl "
"SET Answer = ? "
"WHERE Question = ? "
"AND QuizNumber = ? "
"AND Class_name = ?",
(q_a[0], q_a[1], q_a[2], q_a[3]))
With this particular solution (using a list of tuples with the query parameters in the correct order) you can also use executemany to let the cursor do the loop over the list of parameter tuples for you:
quiz_question_answers = [(2, '1 + 1 ='), (4, '2 + 2 ='), (6, '3 + 3 =')]
cursor.executemany("UPDATE set_questionsTbl SET Answer = ? WHERE Question = ?", quiz_question_answers)
Here's some further reading on WHERE clauses specifically for SQLite.

You should have a unique ID for at least one of the fields in your table. That way when you execute the answer update, you could add a WHERE clause, e.g:
cursor.execute( "UPDATE set_questionsTbl SET answer=? WHERE unique_id=?", (answer,uniqueId))
and that way not all of your records will be updated.

Panda sum rows based on metric on column value

I have a dataframe df of the form
type | time | value
------------------------
a | 1.2 | 1
a | 1.3 | 3
a | 2.1 | 4
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Is there any feasible way to, for all rows, consolidate (sum) all following rows of a given type that have a timestamp difference of less than, for example, 1?
So for this example, the second and third row should be added to the first and the output should be
type | time | value
------------------------
a | 1.2 | 8
a | 2.3 | 6
b | 2 | 21
b | 3 | 3
. . .
. . .
Normally I would simply iterate over every row, add the value for all following rows that satisfy the constraint to the active row and then drop all rows whose values were added from the dataframe. But I'm not completely sure how to do that safely with panda considering that "You should never modify something you are iterating over."
But I sadly also don't see how this could be done with any operation that is applied on the whole dataframe at once.
Edit: I've found a very rough way to do it using a while loop. In every iteration it only adds the next row to those rows that already have no row of the same type with a time stamp less than 1 before it:
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
while df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'].any():
df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] = df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'value'] + df.loc[(df.type == df.nexttype) & ((df.time - df.lasttime >1 ) | (df.type != df.lasttype)) & (df.time - df.nexttime <=1 ),'nextvalue']
df = df.loc[~((df.shift(1).type == df.shift(1).nexttype) & ((df.shift(1).time - df.shift(1).lasttime >1 ) | (df.shift(1).type != df.shift(1).lasttype)) & (df.shift(1).time - df.shift(1).nexttime <=1 ))]
df['nexttime']= df['time'].shift(-1)
df['nexttype']= df['type'].shift(-1)
df['lasttime']= df['time'].shift(1)
df['lasttype']= df['type'].shift(1)
df['nextvalue'] = df['value'].shift(-1)
I would still be very interested if there is any faster way to do this, as this kind of loop obviously is not very efficient (especially since for the kind of dataframes I work with it has to iterate a few ten thousand times).

How do I put data from a while loop into a table?

Basically I'm estimating pi using polygons. I have a loop which gives me a value for n, ann and bnn before running the loop again. here is what I have so far:
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
print(t,an,bn)
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
print(t,ann,bnn)
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
return
This is what I get if I run it with these values:
>>> printPiTable(4,2*sqrt(2),4,5)
4 4 2.8284271247461903
8 3.3137084989847607 3.0614674589207187
16 3.1825978780745285 3.121445152258053
32 3.1517249074292564 3.1365484905459398
64 3.1441183852459047 3.1403311569547534
128 3.1422236299424577 3.1412772509327733
I want to find a way to make it instead of printing out these values, just print the values in a nice neat table, any help?

Use string formatting. For example,
print('{:<4}{:>20f}{:>20f}'.format(t,ann,bnn))
produces
4 4.000000 2.828427
8 3.313708 3.061467
16 3.182598 3.121445
32 3.151725 3.136548
64 3.144118 3.140331
128 3.142224 3.141277
{:<4} is replaced by t, left-justified, formatted to a string of length 4.
{:>20f} is replaced by ann, right-justified, formatted as a float to a string of length 20.
The full story on the format string syntax is explained here.
To add column headers, just add a print statement like
print('{:<4}{:>20}{:>20}'.format('t','a','b'))
For fancier ascii tables, consider using a package like prettytable:
import prettytable
def printPiTable(an,bn,n,k):
"""Prints out a table for values n,2n,...,(2^k)n"""
table = prettytable.PrettyTable(['t', 'a', 'b'])
u = (2**k)*n
power = 0
t = ((2**power)*n)
while t<=u:
if power < 1:
table.add_row((t,an,bn))
power = power + 1
t = ((2**power)*n)
else:
afrac = (1/2)*((1/an)+(1/bn))
ann = 1/afrac
bnn = sqrt(ann*bn)
table.add_row((t,ann,bnn))
an = ann
bn = bnn
power = power + 1
t = ((2**power)*n)
print(table)
printPiTable(4,2*sqrt(2),4,5)
yields
+-----+---------------+---------------+
| t | a | b |
+-----+---------------+---------------+
| 4 | 4 | 2.82842712475 |
| 8 | 3.31370849898 | 3.06146745892 |
| 16 | 3.18259787807 | 3.12144515226 |
| 32 | 3.15172490743 | 3.13654849055 |
| 64 | 3.14411838525 | 3.14033115695 |
| 128 | 3.14222362994 | 3.14127725093 |
+-----+---------------+---------------+
Perhaps it is overkill for this sole purpose, but Pandas can make nice tables too, and can export them in other formats, such as HTML.

You can use output formatting to make it look pretty. Look here for an example:
http://docs.python.org/release/1.4/tut/node45.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

complex SQL query with SQLite - python

Related

Efficient Way to Build Large Scale Hierarchical Data Tree Path

How can I do to add the condition AND using exlude?

Updating a row with multiple different values in python with sqlite3

Panda sum rows based on metric on column value

How do I put data from a while loop into a table?

Categories

Resources