SlidingWindows Python Apache Beam duplicate the data

SlidingWindows Python Apache Beam duplicate the data - python

The problem
Each time the system receive a message from pubsub with a Sliding Windows it been duplicated
The code
| 'Parse dictionary' >> beam.Map(lambda elem: (elem['Serial'], int(elem['Value'])))
| 'window' >> beam.WindowInto(window.SlidingWindows(30, 15),accumulation_mode=AccumulationMode.DISCARDING)
| 'Count' >> beam.CombinePerKey(beam.combiners.MeanCombineFn())
The output
If I only send one message from pub/sub and try to print what I have after the sliding window finish with the code:
class print_row2(beam.DoFn):
def process(self, row=beam.DoFn.ElementParam, window=beam.DoFn.WindowParam,timestamp=beam.DoFn.TimestampParam):
print row, timestamp2str(float(window.start)), timestamp2str(float(window.end)),timestamp2str(float(timestamp))
The result
('77777', 120.0) 2018-11-16 08:21:15.000 2018-11-16 08:21:45.000 2018-11-16 08:21:45.000
('77777', 120.0) 2018-11-16 08:21:30.000 2018-11-16 08:22:00.000 2018-11-16 08:22:00.000
If I print the message before 'window' >> beam.WindowInto(window.SlidingWindows(30, 15)) I only get once
The process in "graphic mode:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |=X===========| : :
w2: |==============| :
...
The message X was sent only once at the begining of the slidingwindow, it should only be received once, but is been receiving twice
I have tried with both AccumulationMode values, also with a trigger=AftyerWatermark but i can not fix the problem.
What could be wrong?
Extra
With FixedWindows this is the correct code for my porpouse:
| 'Window' >> beam.WindowInto(window.FixedWindows(1 * 30))
| 'Speed Average' >> beam.GroupByKey()
| "Calculating average" >> beam.CombineValues(beam.combiners.MeanCombineFn())
or
| 'Window' >> beam.WindowInto(window.FixedWindows(1 * 30))
| "Calculating average" >> beam.CombinePerKey(beam.combiners.MeanCombineFn())

All elements that belong to the window are emitted. If an element belongs to multiple windows it will be emitted in each window.
Accumulation mode only matters if you plan to handle late data/multiple trigger firings. In this case discarding mode gives you only new elements in the window when the trigger fires again, i.e. emits only the elements that arrived in the same window since previous trigger firing, the elements that were already emitted are not emitted again and are discarded. In accumulating mode the whole window will be emitted for every trigger firing, it will include old elements that were already emitted last time and new elements that have arrived since then.
If I understand your example you have sliding windows, they have a length of 30 seconds, and they start every 15 seconds. So they overlap for 15 seconds:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |=============| : :
w2: |==============| :
w3: |===============|
...
So any element in your case will belong to at least two windows (except for first and last windows).
E.g. in your example, if your message was sent between 17:07:15 and 17:07:30 it will appear in both windows.
Fixed windows don't overlap so element can belong only to one window:
time: ----t+00---t+15---t+30----t+45----t+60------>
: : :
w1: |=============| :
w2: |===============|
w3: |====...
...
More about windows here: https://beam.apache.org/documentation/programming-guide/#windowing

I have exactly the same issue, however in java. I have a window with 10 seconds duration and a step of 3 seconds. When an event is emitted from the mqtt topic, that I subscribe to, it looks like the ParDo function that I have runs and emits the first and only event to all of the three "constructed" windows.
X is the event that i send at a random timestamp: 2020-09-15T21:17:57.292Z
time: ----t+00---t+15---t+30----t+45----t+60------>
: : : : :
w1: |X============| : :
w2: |X=============| :
w3: |X==============|
...
Even the same timestamp is assigned to them!! I must really doing something completely wrong.
I use Scala 2.12 and BEAM 2.23 with a Direct Runner.
[Hint]: I use states in the processElement function! Where the state is being hold per key + window. Maybe there is a bug there? I will try to test it without states.
UPDATE: Removed the state fields and the single event is assigned to one window.

Related

minimalmodbus read multiple register on same device

Is there a way where I can read a set of register on a device with the python library minimalmodbus? Or is it one by one I need to read_register like this?
instrument.read_register(0x0033,1)

Look at the minimalmodbus documentation for read_registers(). You shouldn't need to change the functioncode argument.
Assuming you want the first one hundred and twenty-five registers, starting at register zero:
registers = instrument.read_registers(0, 125)
If you wanted to print those registers:
for idx in range(len(registers)):
print(f"Register {idx:3d} : {registers[idx]:5d} ({hex(registers[idx])})")
This will output something like:
Register 0 : 0 (0x0)
Register 1 : 1 (0x1)
Register 2 : 2 (0x2)
Register 3 : 3 (0x3)
Register 4 : 4 (0x4)
…
EDIT: Looking at page 9 of the specification document, there are sixteen and thirty-two bit registers co-mingled. It will be easier to read them explicitly. Otherwise you'll need to shift and combine the two sixteen-bit registers, which is annoying and minimalmodbus has functions to make it easier and less error prone for you.
E.g.
# 0000h 2 V L1-N INT32 Value weight: Volt*10
# read_long(registeraddress: int, functioncode: int = 3, signed: bool = False, byteorder: int = 0) → int
L1_N_voltage = instrument.read_long(0, signed=True) / 10
# 0028h 2 W sys INT32 Value weight: Watt*10
sys_wattage = instrument.read_long(0x28, signed=True) / 10
Note that read_long() doesn't have number_of_decimals support, so you need to manually divide the reading by ten. For the current and power factor you'll need to divide by 1,000.

Beam job with Repeated AfterProcessingTime trigger runs forever

I tried to create a beam pipeline that allows late events and has an AfterProcessingTime trigger so that the trigger can aggregate all the data that arrive in time at its first fire and fire as few times as possible for late data. I found this tutorial, but my pipeline got stuck with their late_data_stream
options = StandardOptions(streaming=True)
with TestPipeline(options=options) as p:
_ = (p | create_late_data_stream()
| beam.Map(lambda x : x) # Work around for typing issue
| beam.WindowInto(beam.window.FixedWindows(5),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING,
allowed_lateness=50)
| beam.combiners.Count.PerKey()
| beam.Map(lambda x : f'Count is {x[1]}')
| "Output Window" >> beam.ParDo(GetElementTimestamp(print_pane_info=True))
| "Print count" >> PrettyPrint()
)
Any idea why is this happening? And is there a way to have repeatedly trigger to stop if the watermark goes past window_end+ allowed_lateness?
Thanks in advance for any help.

Parsing MML logs with python

I would like to parse a text file to extract information into a spreadsheet program , the problem is that the input text file often has multiple seperators ,delimiters and text characters like below.
Script Task : dd_new
==========Summary==========
Success Num : 2690
Fail Num : 2342
===========================
==========Fail MML Command==========
MML Command-----LST RACHCFG:;
NE : 042249
Report : Ne is not connected.
MML Command-----LST RACHCFG:;
NE : 073196
Report : +++ 073196 2016-08-11 16:01:00 DST
O&M #22638
%%/*327033029*/LST RACHCFG:;%%
RETCODE = 939589973 Invalid Command.
--- END
MML Command-----LST RACHCFG:;
NE : 236263
Report : +++ 236263 2016-08-11 16:00:59 DST
O&M #807456706
%%/*325939762*/LST RACHCFG:;%%
RETCODE = 0 Operation succeeded.
Display RACHCfg
---------------
Local cell ID Power ramping step(dB) Preamble initial received target power(dBm) Message size of select group A(bit) PRACH Frequency Offset Indication of PRACH Configuration Index PRACH Configuration Index Maximum number of preamble transmission Timer for contention resolution(subframe) Maximum number of Msg3 HARQ transmissions
1 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
2 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
3 2dB -104dBm 56bits 6 Not configure NULL 10times 64 5
(Number of results = 3)
--- END
MML Command-----LST RACHCFG:;
NE : 236264
Report : +++ 236264 2016-08-11 16:01:00 DST
O&M #807445026
%%/*325939772*/LST RACHCFG:;%%
RETCODE = 0 Operation succeeded.
Display RACHCfg
---------------
Local cell ID = 1
Power ramping step(dB) = 2dB
Preamble initial received target power(dBm) = -104dBm
Message size of select group A(bit) = 56bits
PRACH Frequency Offset = 6
Indication of PRACH Configuration Index = Not configure
PRACH Configuration Index = NULL
Maximum number of preamble transmission = 10times
Timer for contention resolution(subframe) = 64
Maximum number of Msg3 HARQ transmissions = 5
(Number of results = 1)
--- END
Approach:
The part where we split the line with a ":" as in line.split(':') is a clear way to build a dictionary ,but I need a way for the program to gnaw its way into making decisions all along the way.
Is it a valid block ? (header Script task :dd new can be discarded).
Fail MML commands with both NE not connected and Invalid command need to be extracted with the corresponding type into a dict :
{NE: '042249', Status : 'Ne is not connected'}, {NE :'07316', Status : 'Invalid Command'}
Succesful MML command should be able to parse succesfully into a nested dictionary when either of the two output formatting of Display RACH Cfg are encountered .Both should yield (Nested Key value dictionary for each NE,Local cell id and then each parameter.
.
{NE:236263,{Local Cell Id: '1' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........},Local Cell Id: '2' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........}Local Cell Id: '3' : {Power ramping Step :'2dB',,Preamble initial received target power:-104dBm,........}
and a similar (with a single Local Cell Id for the second case)
{NE :236264 ,{Local Cell Id : '1' { Power ramping Step :'2dB',Preamble initial received target power:-104dBm,.....}
I have searched the forums and seen examples of regex and ideas of replacing '===' with start /end tag and reading blocks but if there is a more pythonic way or an already existing library that deals with such stuff, it would make my life very easy.
For those who want to know what it is, these are MML logs from telecommunication equipment listing its configuration.

Unexpected Python Arithmetic Behavior

I'm working on a huffman encoder/decoder in Python, and am experiencing some unexpected (at least for me) behavior in my code. Encoding the file is fine, the problem occurs when decoding the file. Below is the associated code:
def decode(cfile):
with open(cfile,"rb") as f:
enc = f.read()
len_dkey = int(bin(ord(enc[0]))[2:].zfill(8) + bin(ord(enc[1]))[2:].zfill(8),2) # length of dictionary
pad = ord(enc[2]) # number of padding zeros at end of message
dkey = { int(k): v for k,v in json.loads(enc[3:len_dkey+3]).items() } # dictionary
enc = enc[len_dkey+3:] # actual message in bytes
com = []
for b in enc:
com.extend([ bit=="1" for bit in bin(ord(b))[2:].zfill(8)]) # actual encoded message in bits (True/False)
cnode = 0 # current node for tree traversal
dec = "" # decoded message
for b in com:
cnode = 2 * cnode + b + 1 # array implementation of tree
if cnode in dkey:
dec += dkey[cnode]
cnode = 0
with codecs.open("uncompressed_"+cfile,"w","ISO-8859-1") as f:
f.write(dec)
The first with open(cfile,"rb") as f call runs very quickly for all file sizes (tested sizes are 1.2MB, 679KB, and 87KB), but the part that slows down the code significantly is the for b in com loop. I've done some timing and I honestly don't know what's going on.
I've timed the whole decode function on each file, as shown below:
87KB 1.5 sec
679KB 6.0 sec
1.2MB 384.7 sec
first of all, I don't even know how to assign this complexity. Next, I've timed a single run through of the problematic loop, and got that the line cnode = 2*cnode + b + 1 takes 2e-6 seconds while the if cnode in dkey line takes 0.0 seconds (according to time.clock() on OSX). So it seems as if the arithmetic is slowing down my program significantly...? Which I feel like doesn't make sense.
I actually have no idea what is going on, and any help at all would be super welcome

I found a solution to my problem, but I am still left with confusion afterwards. I solved the problem by changing the dec from "" to [], and then changing the dec += dkey[cnode] line to dec.append(dkey[cnode]). This resulted in the following times:
87KB 0.11 sec
679KB 0.21 sec
1.2MB 1.01 sec
As you can see, this has immensely cut down the time, so in that aspect, this was a success. However, I am still confused as to why python's string concatenation seems to be the problem here.

Python reading from Arduino script still not working

iIve asked you few weeks ago about solution on my python's scripy problem.
I just started my project again, and still got a problem.
My Arduino is working fine, command sudo screen /dev/ttyACM0 is working perfect, and I'm getting:
T: 52.80% 23.80 15% 92% N
T: 52.80% 23.80 15% 92% N
T: 52.80% 23.80 15% 92% N
T - letter is separator between next row
First number is Humidity
Second is temperature
Third is photoresistor
Next one is soil moisure
and last one is fan working state (N - not working, Y - working)
I would like to use Python's script with cron to write a text file with results for every single sensor data.
For example I'll use cron to save 4 text files (temp.txt, humi.txt, soil.txt, photo.txt) every 5 minutes, 30 minutes, 1 hour, 3 hours, 12 hours, 24 hours.
Next I'll use a php script to show data as diagrams on my website.
But the problem is with my python script. I've got a solution here, and at the moment I'm using the following script (temperature's example):
#!/usr/bin/python
import serial
import time
buffer = bytes()
ser = serial.Serial('/dev/ttyACM0',9600, timeout=10)
while buffer.count('T:') < 2:
buffer += ser.read(30)
ser.close();
# Now we have at least one complete datum. Isolate it.
start = buffer.index('T:')
end = buffer.index('T:', start+1)
items = buffer[start:end].strip().split()
print time.strftime("%Y-%m-%d %H:%M:%S"), items[2]
But in my text file I've got incorrect info, which looks like:
2013-05-10 19:47:01 12%
2013-05-10 19:48:01
2013-05-10 19:49:01 N
2013-05-10 19:50:01 24.10
2013-05-10 19:51:01 24.10
2013-05-10 19:52:01 7%
2013-05-10 19:53:01 24.10
but it should be 2013-05-10 19:47:01 24.10 all the time.
What's wrong with it?

I suspect that instead of
items = buffer[start:end].strip().split()
you want
items = buffer[start:end].split().strip()
or maybe
items = buffer[start:end].split()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SlidingWindows Python Apache Beam duplicate the data - python

Related

minimalmodbus read multiple register on same device

Beam job with Repeated AfterProcessingTime trigger runs forever

Parsing MML logs with python

Unexpected Python Arithmetic Behavior

Python reading from Arduino script still not working

Categories

Resources