I've got a system built in Django which receives data. I store the data as follows:
id | sensor | message_id | value
----+--------+------------+-------
1 | A | 1 | xxx
2 | A | 2 | xxx
3 | A | 3 | xxx
4 | B | 1 | xxx
5 | B | 2 | xxx
6 | B | 4 | xxx
7 | B | 7 | xxx
We expect the message_id to increases by one per sensor with every subsequent message. As you can see, the message_ids for sensor B are: 1, 2, 4, 7. This means the messages with numbers 3, 5 and 6 are missing for sensor B. In this case we would need to investigate the missing messages, especially if there are many missing. So I now want a way to know about these missing messages when it happens.
So I want do a check whether a message is missing in the past five minutes. I would expect an output that says something like:
3 messages are missing for sensor B in the last 5 minutes. The following ids are missing: 3, 5, 6
The simplest way I thought of doing this is by querying the message_id for one sensor and then looping over them to check whether any number is skipped. I thought of something like this:
five_minutes_ago = datetime.now() - timedelta(minutes=5)
queryset = MessageData.objects.filter(created__gt=five_minutes_ago).filter(sensor='B').order_by('message_id')
last_message_id = None
for md in queryset:
if last_message_id is None:
last_message_id = md.message_id
else:
if md.message_id != last_message_id + 1:
missing_messages = md.message_id - last_message_id - 1
print(f"{missing_messages} messages missing for sensor {md.sensor}")
But since I've got hundreds of sensors this seems like it's not the best way to do it. It might even be possible to do in the SQL itself, but I'm unaware of a way to do so.
How could I efficiently do this?
You can try something like this, I have added comments above the line for the logic, feel free to comment in case of any query.
five_minutes_ago = datetime.now() - timedelta(minutes=5)
queryset = MessageData.objects.filter(created__gt=five_minutes_ago).filter(sensor='B').order_by('message_id')
# rows that should ideally be there if no message_id was missing, i.e. equal to last message_id
ideal_num_of_rows = queryset.last().message_id
# total number of message_id present
total_num_of_row_present = queryset.count()
# number of missing message_ids
num_of_missing_message_ids = ideal_num_of_rows - total_num_of_rows_present - 1
You can accomplish what you want with a single SQL statement. The following generates for an array for of missing message ids and the number of missing messages for each sensor. This is done in 3 steps:
Get the minimum and maximum message ids.
Generate a dense list of message id needed.
Left join the actual sensor messages with the dense list and select
only those in the dense list not in the actual table. Count the
items selected.
with sensor_range (sensor, min_msg_id, max_msg_id) as -- 1 get necessary message range
( select sensor
, min(message_id)
, max(message_id)
from sensor_messages
group by sensor
-- where message_ts > current_timestamp - interval '5 min)
) --select * from sensor_range;
, sensor_series (sensor, msg_id) as -- 2 generate list of needed messages_id
( select sensor, n
from sensor_range sr
cross join generate_series( sr.min_msg_id
, sr.max_msg_id
, 1
) gs(n)
) --select * from sensor_series;
select ss.sensor
, array_agg(ss.msg_id) missing_message_ids --3 Identify messing message_id and count their number
, array_length(array_agg(ss.msg_id),1) missing_messages_count
from sensor_series ss
left join sensor_messages sm
on ( ss.sensor = sm.sensor
and sm.message_id = ss.msg_id
)
where sm.message_id is null
group by ss.sensor
order by ss.sensor;
See demo here. This could be packaged into an SQL function that returns a table if desired. A good reference.
Your description mentions a time range, but your data does not have a timestamp column. The query has a comment for handling this.
Related
I'm trying to wrap my head around the pyflink datastream api.
My use case is the following:
The source is a kinesis datastream consisting of the following:
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
10
10
15
70min
2
1
5
5
10
120min
I want to create a session window aggregation with a gap of 60 minutes, calculating the mean for each cookie-cluster combination. The window assignment should be based on the cookie, the aggregation based on cookie and cluster.
The result would therefore be like this (each row being forwarded immediately):
cookie
cluster
dim0
dim1
dim2
time_event
1
1
5
5
5
1min
1
2
1
0
6
30min
2
1
1
2
3
45min
1
1
7.5
7.5
10
70 min
2
1
5
5
10
120min
Expressed in SQL, for a new record I'd like to perform this aggregation:
INSERT INTO `input` (`cookie`, `cluster`, `dim0`, `dim1`, `dim2`, `time_event`) VALUES
("1", "1", 0, 0, 0, 125)
WITH RECURSIVE by_key AS (
SELECT *,
(time_event - lag(time_event) over (partition by cookie order by time_event)) as "time_passed"
FROM input
WHERE cookie = "1"
),
new_session AS (
SELECT *,
CASE WHEN time_passed > 60 THEN 1 ELSE 0 END as "new_session"
FROM by_key),
by_session AS (
SELECT *, SUM(new_session) OVER(partition by cookie order by time_event) as "session_number"
FROM new_session)
SELECT cookie, cluster, avg(dim0), avg(dim1), avg(dim2), max(time_event)
FROM by_session
WHERE cluster = "1"
GROUP BY session_number
ORDER BY session_number DESC
LIMIT 1
I tried to accomplish this with the table api, but I need the results to be updated as soon as a new record is added to a cookie-cluster combination. This is my first project with flink, and the datastream API is an entirely different beast, especially since a lot of stuff is not included yet for python.
My current approach looks like this:
Create a table from the kinesis datastream (datastream has no kinesis connector)
Convert it to a datastream to perform the aggregation. From what I've read, watermarks are propagated and the resulting row objects contain the column names, i.e. I can handle them like a python dictionary. Please correct me, if I'm wrong on this.
Key the data stream by the cookie.
Window with a custom SessionWindowsAssigner, borrowing from the Table API. I'm working on a seperate post on that.
Process the windows by calculating the mean for each cluster
table_env = StreamTableEnvironment.create(stream_env, environment_settings=env_settings)
table_env.execute_sql(
create_table(input_table_name, input_stream, input_region, stream_initpos)
)
ds = table_env.to_append_stream(input_table_name)
ds.key_by(lambda r: r["cookie"])\
.window(SessionWindowAssigner(session_gap=60, is_event_time=True)\
.trigger(OnElementTrigger()).\
.process(MeanWindowProcessFunction())
My basic idea for the ProcessWindowFunction would go like this:
class MeanWindowProcessFunction(ProcessWindowFunction[Dict, Dict, str, TimeWindow]):
def process(self,
key: str,
content: ProcessWindowFunction.Context,
elements: Iterable) -> Iterable[Dict]:
clusters = {}
cluster_records = {}
for element in inputs:
if element["cluster"] not in clusters:
clusters[element["cluster"]] = {key: val for key, val in element.as_dict().items()}
cluster_records[element["cluster"]] = 0
else:
for dim in range(3):
clusters[element["cluster"]][f"dim{dim}"] += element[f"dim{dim}"]
clusters[element["cluster"]]["time_event"] = element["time_event"]
cluster_records[element["cluster"]] += 1
for cluster in clusters.keys():
for dim in range(3):
clusters[cluster][f"dim{dim}"] /= cluster_records[cluster]
return clusters.values()
def clear(self, context: 'ProcessWindowFunction.Context') -> None:
pass
Is this the right approach for this problem?
Do I need to consider anything else for the ProcessWindowFunction, like actually implementing the clear method?
I'd be very grateful for any help, or any more elaborate examples of windowed analytics applications in pyflink. Thank you!
I have a MySQL Column with the following information:
codes = [
"[1]",
"[1-1]",
"[1-1-01]",
"[1-1-01-01]",
"[1-1-01-02]",
"[1-1-01-03]",
"[1-1-02]",
"[1-1-02-01]",
"[1-1-02-02]"
"[1-2]",
"[1-2-01]",
"[1-2-01-01]",
"[2]",
"[2-1]",
"[2-1-01-01]",
"[2-1-01-02]"
]
This is a hierarchical structure for accounts, and I need to know, for each account, which is parent and which is the child to add to a secondary table called AccountsTree.
My models are:
Accounts:
id = db.Column(Integer)
account = db.Column(Integer)
...
numbers = db.Column(Integer)
AccountsTree:
id = db.Column(Integer)
parent = db.Column(Integer, db.ForeignKey('Accounts.id')
child = db.Column(Integer, db.ForeignKey('Accounts.id')
I started coding something like:
For each_element in code_list:
replace "[" and "]"
split strings in '-' and
make a new list of lists: each list element is a list of codes, each element of the inner list is a level
But that's starting not to look very good, and it looks as if I'm adding unnecessary complexity.
My workflow is:
(1) Import XLS from front end
(2) Parse XLS and add information to Accounts table
(3) Find out hierarchy of accounts and add information to AccountsTree table
I'm currently struggling with the 3rd step. So, after adding information to my Accounts table, how do I find out the hierarchical structure to fill my AccountsTree table?
My desired result is:
ParentID | ChildID
1 | 2
2 | 3
3 | 4
3 | 4
3 | 5
3 | 6
Anybody went through a similar challenge and can share a most efficient approach?
Thanks
Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks
You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.
I am coming from a Java background and learning Python by applying it in my work environment whenever possible. I have a piece of functioning code that I would really like to improve.
Essentially I have a list of namedtuples with 3 numerical values and 1 time value.
complete=[]
uniquecomplete=set()
screenedPartitions = namedtuple('screenedPartitions'['feedID','partition','date', 'screeeningMode'])
I parse a log and after this is populated, I want to create a reduced set that is essentially the most recently dated member where feedID, partition and screeningMode are identical. So far I can only get it out by using a nasty nested loop.
for a in complete:
max = a
for b in complete:
if a.feedID == b.feedID and a.partition == b.partition and\
a.screeeningMode == b.screeeningMode and a.date < b.date:
max = b
uniqueComplete.add(max)
Could anyone give me advice on how to improve this? It would be great to work it out with whats available in the stdlib, as I guess my main task here is to get me thinking about it with the map/filter functionality.
The data looks akin to
FeedID | Partition | Date | ScreeningMode
68 | 5 |10/04/2017 12:40| EPEP
164 | 1 |09/04/2017 19:53| ISCION
164 | 1 |09/04/2017 20:50| ISCION
180 | 1 |10/04/2017 06:11| ISAN
128 | 1 |09/04/2017 21:16| ESAN
So
after the code is run line 2 would be removed as line 3 is a more recent version.
Tl;Dr, what would this SQL be in Python :
SELECT feedID,partition,screeeningMode,max(date)
from Complete
group by 'feedID','partition','screeeningMode'
Try something like this:
import pandas as pd
df = pd.DataFrame(screenedPartitions, columns=screenedPartitions._fields)
df = df.groupby(['feedID','partition','screeeningMode']).max()
It really depends on how your date is represented, but if you provide data I think we can work something out.
I have two tables with the following structures where in table 1, the ID is next to Name while in table 2, the ID is next to Title 1. The one similarity between the two tables are that, the first person always has the ID next to their name. They are different for the subsequent people.
Table 1:
Name&Title | ID #
----------------------
Random_Name 1|2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| 2000
Title_2_1 | -
Title_2_2 | -
... |...
Table 2:
Name&Title | ID #
----------------------
Random_Name 1| 2000
Title_1_1 | -
Title_1_2 | -
Random_Name 2| -
Title_2_1 | 2000
Title_2_2 | -
... |...
I have the code to recognize table 1 but struggle to incorporate structure 2. The table is stored as a nested list of row (each row is a list). Usually, for one person there are only 1 row of name but multiple rows of titles. The pseudo-code is this:
set count = 0
find the ID next to the first name, set it to be a recognizer
for row_i,row in enumerate(table):
compare the ID of the next row until I found: row[1] == recognizer
set count = row i
slice the table to get the first person.
The actual code is this:
header_ind = 0 # something related to the rest of the code
recognizer = data[header_ind+1][1]
count = header_ind+1
result = []
result.append(data[0]) #this append the headers
for i, row in enumerate(data[header_ind+2:]):
if i <= len(data[header_ind+4:]):
if row[1] and data[i+1+header_ind+2][1] is recognizer:
print data[i+header_ind+3]
one_person = data[count:i+header_ind+3]
result.append(one_person)
count = i+header_ind+3
else:
if i == len(data[header_ind+3:]):
last_person = data[count:i+header_ind+3]
result.append(last_person)
count = i+header_ind+3
I have been thinking about this for a while and so I just want to know whether it is possible to get an algorithm to incorporate Table 2 given that the we cannot distinguish the row name and titles.
Going to stick this here
So these are your inputs assumption is you are restricted to this...:
# Table 1
data1 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','2000'],
['Title_2_1','-'],
['Title_2_2','-']]
# TABLE 2
data2 = [['Name&Title','ID#'],
['Random_Name1','2000'],
['Title_1_1','-'],
['Title_1_2','-'],
['Random_Name2','-'],
['Title_2_1','2000'],
['Title_2_2','-']]
And this is your desired output:
for x in data:
print x
['Random_Name2', '2000']
['Name&Title', 'ID#']
[['Random_Name1', '2000'], ['Title_1_1', '-'], ['Title_1_2', '-']]
[['Random_Name2', '2000'], ['Title_2_1', '-'], ['Title_2_2', '-']]