Combiner Functions Seemingly Not emitting correct results - python

So I'm working on a test streaming case. Reading from pubsub and for now, sending to stdout for some visuals on the pipeline and transforms.
I believe I'm getting some unusual output, and believe I'm likely missing something so hoping someone can help.
Take my code (stripped back to debug):
with beam.Pipeline(options=opts)as p:
(
p
| ReadFromPubSub(topic=topic_name
,timestamp_attribute='timestamp')
| beam.WindowInto(beam.window.FixedWindows(beam.window.Duration(5)),
trigger=beam.trigger.AfterWatermark(),
accumulation_mode=beam.trigger.AccumulationMode.ACCUMULATING)
| beam.CombineGlobally(beam.combiners.CountCombineFn()).without_defaults()
| beam.Map(print)
)
I am generating an arbitrary number of events and pushing those to my topic - currently 40. I can confirm through the generation of the events that they all succeed in reaching the topic. Upon simply printing the results of the topic (using beam), I can see what I would expect.
However, what I wanted to try was some basic window aggregation and using both beam.CombineGlobally(beam.combiners.CountCombineFn()) and beam.combiners.Count.Globally(), I notice 2 things happening (not strictly at the same time).
The first issue:
When I print additional window start/ end timestamps, I am getting more than 1 instance of the same window returned. My expectation on a local runner, would be that there is a single fixed window collecting the number of events and emitting a result.
This is the DoFn I've used to get a picture of the windowing data.
class ShowWindowing(beam.DoFn):
def process(self, elem, window = beam.DoFn.WindowParam):
yield f'I am an element: {elem}\nstart window time:{window.start.to_utc_datetime()} and the end window time: {window.end.to_utc_datetime()}'
And to reiterate, the issue is that I am not getting 'duplicate' results, it is rather I am getting multiple semi-grouped results.
The second issue I have (which I feel is related to the above but I've seen this occur without the semi-grouping of elements):
When I execute my pipeline through the CLI (I use notebooks a lot), and generate events to my topic, I am getting considerably less output back which appear to be just partial results.
Example: I produce 40 events - each event has a lag of half a second. My window is set to 5 seconds, I expect (give or take) a combined result of 10 each 5 seconds over 20 seconds. What I get is a completely partial result. This could be a count of 1 over a window or a count of 8.
I've read and re-read the docs (admittedly skipping over some of it just to seek an answer) but I've referenced the katas and the Google Dataflow quest to look for examples/ alternatives and I cannot identify where I'm going wrong.
Thanks

I think this boils down to a TODO in the Python local runner in handling watermarks for PubSub subscriptions. Essentially, it thinks it has received all the data up until now, but there is still data in PubSub that has a timestamp less than now() which becomes late data once it is actually read.
A real runner such as Dataflow won't have this issue.

Related

My Azure Function in Python v2 doesn't show any signs of running, but it probably is

I have a simple function app in Python v2. The plan is to process millions of images, but right I just want to make the scaffolding right, i.e. no image processing, just dummy data. So I have two functions:
process with an HTTP trigger #app.route, this inserts 3 random image URLs to the Azure Queue Storage,
process_image with a Queue trigger #app.queue_trigger, that processes one image URL from above (currently only logs the event).
I trigger the first one with curl request and as expected, I can see the invocation in the Azure portal in the function's invocation section and I can see the items in the Storage Explorer's queue.
But unexpectedly, I do not see any invocations for the second function, even though after a few seconds the items disappear from the images queue and end up in the images-poison queue. So this means that something did run with the queue items 5 times. I see the following warning in the application insights checking traces and exceptions:
Message has reached MaxDequeueCount of 5. Moving message to queue 'case-images-deduplication-poison'.
Can anyone help with what's going on? Here's the gist of the code.
If I was to guess, something else is hitting that storage queue, like your dev machine or another function, can you put logging into the second function? (sorry c# guy so I don't know the code for logging)
Have you checked the individual function metric, in the portal, Function App >> Functions >> Function name >> overview >> Total execution Count and expand to the relevant time period?
Do note that it take up to 5 minutes for executions to show but after that you'll see them in the metrics

How can I measure the coverage (in production system)?

I would like to measure the coverage of my Python code which gets executed in the production system.
I want an answer to this question:
Which lines get executed often (hot spots) and which lines are never used (dead code)?
Of course this must not slow down my production site.
I am not talking about measuring the coverage of tests.
I assume you are not talking about test suite code coverage which the other answer is referring to. That is a job for CI indeed.
If you want to know which code paths are hit often in your production system, then you're going to have to do some instrumentation / profiling. This will have a cost. You cannot add measurements for free. You can do it cheaply though and typically you would only run it for short amounts of time, long enough until you have your data.
Python has cProfile to do full profiling, measuring call counts per function etc. This will give you the most accurate data but will likely have relatively high impact on performance.
Alternatively, you can do statistical profiling which basically means you sample the stack on a timer instead of instrumenting everything. This can be much cheaper, even with high sampling rate! The downside of course is a loss of precision.
Even though it is surprisingly easy to do in Python, this stuff is still a bit much to put into an answer here. There is an excellent blog post by the Nylas team on this exact topic though.
The sampler below was lifted from the Nylas blog with some tweaks. After you start it, it fires an interrupt every millisecond and records the current call stack:
import collections
import signal
class Sampler(object):
def __init__(self, interval=0.001):
self.stack_counts = collections.defaultdict(int)
self.interval = interval
def start(self):
signal.signal(signal.VTALRM, self._sample)
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
def _sample(self, signum, frame):
stack = []
while frame is not None:
formatted_frame = '{}({})'.format(
frame.f_code.co_name,
frame.f_globals.get('__name__'))
stack.append(formatted_frame)
frame = frame.f_back
formatted_stack = ';'.join(reversed(stack))
self.stack_counts[formatted_stack] += 1
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
You inspect stack_counts to see what your program has been up to. This data can be plotted in a flame-graph which makes it really obvious to see in which code paths your program is spending the most time.
If i understand it right you want to learn which parts of your application is used most often by users.
TL;DR;
Use one of the metrics frameworks for python if you do not want to do it by hand. Some of them are above:
DataDog
Prometheus
Prometheus Python Client
Splunk
It is usually done by function level and it actually depends on application;
If it is a desktop app with internet access:
You can create a simple db and collect how many times your functions are called. For accomplish it you can write a simple function and call it inside every function that you want to track. After that you can define an asynchronous task to upload your data to internet.
If it is a web application:
You can track which functions are called from js (mostly preferred for user behaviour tracking) or from web api. It is a good practice to start from outer to go inner. First detect which end points are frequently called (If you are using a proxy like nginx you can analyze server logs to gather information. It is the easiest and cleanest way). After that insert a logger to every other function that you want to track and simply analyze your logs for every week or month.
But you want to analyze your production code line by line (it is a very bad idea) you can start your application with python profilers. Python has one already: cProfile.
Maybe make a text file and through your every program method just append some text referenced to it like "Method one executed". Run the web application like 10 times thoroughly as a viewer would and after this make a python program that reads the file and counts a specific parts of it or maybe even a pattern and adds it to a variable and outputs the variables.

How to delay a ROS Topic by a certain amount of time

Actual problem:
I have a controller node that subscribes to 2 topics and publishes to 1 topic. Although, in simulation everything seems to be working as expected, in actual HW, the performance degrades. I suspect the problem is that one of the two input topics is lagging behind the other one by a significant amount of time.
Question:
I want to re-create this behavior in simulation in order to test the robustness of the controller. Therefore, I need to delay one of the topics by a certain amount of time - ideally this should be configurable parameter. I could write a node that has a FIFO memory buffer and adjust the delay-time by monitoring the frequency of the topic. Before I do that, is there a command line tool or any other quick to implement method that I can use?
P.s. I'm using Ubuntu 16.04 and ROS-Kinetic.
I do not know of any out of the box solutions that would do exactly what you describe here.
For a quick hack, if your topic does not have a timestamp and the node just takes in the messages as they arrive the easiest thing to do would be to record a bag and play the two topics back from two different instances of rosbag play. Something like this:
first terminal
rosbag play mybag.bag --clock --topics /my/topic
second terminal started some amount of time later
rosbag play mybag.bag --topics /my/other_topic
Not sure about the --clock flag, whether you need it depends mostly on what you mean by simulation. If you would want to control the time difference more than pressing enter in two different terminals you could write a small bash script to launch them.
Another option that would still involve bags, but would give you more control over the exact time the message is delayed by would be to edit the bag to have the messages in the bag already with the correct delay. Could be done relatively easily by modifying the first example in the rosbag cookbook:
import rosbag
with rosbag.Bag('output.bag', 'w') as outbag:
for topic, msg, t in rosbag.Bag('input.bag').read_messages():
# This also replaces tf timestamps under the assumption
# that all transforms in the message share the same timestamp
if topic == "/tf" and msg.transforms:
outbag.write(topic, msg, msg.transforms[0].header.stamp)
else:
outbag.write(topic, msg, msg.header.stamp if msg._has_header else t)
Replacing the if else with:
...
import rospy
...
if topic == "/my/other_topic":
outbag.write(topic, msg, t + rospy.Duration.from_sec(0.5))
else:
outbag.write(topic, msg, t)
Should get you most of the way there.
Other than that if you think the node would be useful in the future or you want to have it work on live data as well then you would need to implement the the node you described with some queue. One thing you could look at for inspiration could be the topic tools, git for topic tools source.

Caching a script using NUKE API

I want to write a script that uses Nuke's built-in Performance Timers to "sanity-check" the current comp.
For this I am clearing all of the viewer cache to start off fresh. Now I need to trigger the caching. As it seems the only way to achieve this is by using nuke.activeViewer().play(1). Using this call I get my timeline cached but I have no indication of when the timeline is fully cached to be able to stop and reset the Performace Timers.
I am aware that I can also use nuke.activeViewer().frameControl(+1) to skip 1 frame at a time till I'm at the last frame but it seems to me that using this call is not causing the comp to cache that frame. Actually the timeline indicates that the frame is cached but nuke.activeViewer().node().frameCached(nuke.frame()) is returning false.
Nevertheless I have written something that is working but only really barely.
Here it is:
import nuke
nuke.clearRAMCache()
vc = nuke.activeViewer()
v = vc.node()
fr = v.playbackRange()
vc.frameControl(-6)
print fr.maxFrame()
cached_frames = 0
while cached_frames < fr.maxFrame():
print "Current Frame: {}".format(nuke.frame())
if not v.frameCached(nuke.frame()):
print "Frame: {} not cached".format(nuke.frame())
while not v.frameCached(nuke.frame()):
print "caching..."
vc.play(1)
print "Frame: {} cached".format(nuke.frame())
print "Incrementing from caching"
cached_frames += 1
else:
vc.frameControl(1)
print "incrementing from skipping"
#cached_frames += 1
print "Cached Frames: {}".format(cached_frames)
print "DONE"
vc.stop()
I know that this is not a really nice piece of code but sometimes these lines execute really well and at other times it just hangs a random (at least it seems so) amount of time.
So are there any callbacks available or writable for the Viewer in Nuke or something similar?
Any help is much appreciated!
What specific requirement w.r.t performance do you want to achieve ?
Nuke has its built in feature
"Nuke can display accurate performance timing data onscreen or output it to XML file to help you troubleshoot bottlenecks in slow scripts. When performance timing is enabled, timing information is displayed in the Node Graph, and the nodes themselves are colored according to the proportion of the total processing time spent in each one, from green (fast nodes) through to red (slow nodes)." -
referred to
Mimicking callbacks on the viewer timeline is only achievable using Threads.
Just create a Thread, check for the current frame to be cached and step to the next frame using nuke.activeViewer().frameControl() from that Thread.

Why does it look like multiple processes are being used when only 1 process is specified?

Please forgive me as I'm new to using the multiprocessing library in python and new to testing multiprocess/multi-threaded projects.
In some legacy code, someone created a pool of processes to execute multiple processes in parallel. I'm trying to debug the code by making the pool only have 1 process but the output looks like it's still using multiple processes.
Below is some sanitized example code. Hopefully I included all the important elements to demo what I'm experiencing.
def myTestFunc():
pool = multiprocessing.Pool(1) # should only use 1 process
for i in someListOfNames:
pool.apply_async(method1, args=(listA))
def method1(listA):
for i in listA:
print "this is the value of i: " + i
sys.stdout.flush()
What is happening is since I expect there should only be 1 process in the pool, I shouldn't have any output collision. What I see sometimes in the log msgs is this:
this is the value of i: Alpha
this is the value of i: Bravo
this is the this is the value of i: Mike # seems like 2 things trying to write at the same time
The two things writing at the same time seems to appear closer to the bottom of my debug log, rather than the top, which means the longer I run, the more likely I get these msgs overwriting each other. I haven't tested with a shorter list yet though.
I realize testing multi-process/multi-threaded programs is difficult but in this case, I think I've restricted it such that it should be a lot easier than normal to test. I'm confused why this is happening b/c
I set the pool to have only 1 process
(I think) I force the process to flush its write buffer so it should be writing w/o waiting/queuing and getting this situation.
Thanks in advance for any help you can give me.

Categories