Video Overlay System with OpenCV

Video Overlay System with OpenCV - python

I'm trying to implement a video overlay solution such as this one: https://www.videologixinc.com/, where there is no delay in the original source video.
My problem is that, with OpenCV, all the necessary drawings (circle, text, etc) requires the entire frame to be processed and then returned to be exhibited. Is there any solution where I could just overlay the information in the original source without implying in delay/frame drop? (the additional information can be displayed with delay - drawings, text - but not the original video pipeline).
Multiprocessing could make things faster, but I would still have delay or frame drops.
I was also thinking if would be better to have two simultaneous applications and maybe two different computers - one to read the frame and make the processing - and another one to just receive, somehow, the information to overlay it on the original video pipeline.
Any thoughts? Thank you all!
An example of data pipeline in this case, without interfering in the original video flow

Related

Programmatically accessing PTS times in MP4 container

Background
For a research project, we are recording video data from two cameras and feed a synchronization pulse directly into the microphone ADC every second.
Problem
We want to derive a frame time stamp in the clock of the pulse source for each camera frame to relate the camera images temporally. With our current methods (see below), we get a frame offset of around 2 frames between the cameras. Unfortunately, inspection of the video shows that we are clearly 6 frames off (at least at one point) between the cameras.
I assume that this is because we are relating audio and video signal wrong (see below).
Approach I think I need help with
I read that in the MP4 container, there should be PTS times for video and audio. How do we access those programmatically. Python would be perfect, but if we have to call ffmpeg via system calls, we may do that too ...
What we currently fail with
The original idea was to find video and audio times as
audio_sample_times = range(N_audiosamples)/audio_sampling_rate
video_frame_times = range(N_videoframes)/video_frame_rate
then identify audio_pulse_times in audio_sample_times base, calculate the relative position of each video_time to the audio_pulse_times around it, and select the same relative value to the corresponding source_pulse_times.
However, a first indication that this approach is problematic is already that for some videos, N_audiosamples/audio_sampling_rate differs from N_videoframes/video_frame_rate by multiple frames.
What I have found by now
OpenCV's cv2.CAP_PROP_POS_MSEC seems to do exactly what we do, and not access any PTS ...
Edit: What I took from the winning answer
container = av.open(video_path)
signal = []
audio_sample_times = []
video_sample_times = []
for frame in tqdm(container.decode(video=0, audio=0)):
if isinstance(frame, av.audio.frame.AudioFrame):
sample_times = (frame.pts + np.arange(frame.samples)) / frame.sample_rate
audio_sample_times += list(sample_times)
signal_f_ch0 = frame.to_ndarray().reshape((-1, len(frame.layout.channels))).T[0]
signal += list(signal_f_ch0)
elif isinstance(frame, av.video.frame.VideoFrame):
video_sample_times.append(float(frame.pts*frame.time_base))
signal = np.abs(np.array(signal))
audio_sample_times = np.array(audio_sample_times)
video_sample_times = np.array(video_sample_times)
Unfortunately, in my particular case, all pts are consecutive and gapless, so the result is the same as with the naive solution ...
By picture clues, we identified a section of ~10s in the videos, somewhere in which they desync, but can't find any traces of that in the data.

You need to run ffprobe to retrieve the PTS times. I don't know the exact command, but if you're ok with another package, try ffmpegio:
pip install ffmpegio-core
// OR
pip install ffmpegio // if you also want to use it to read video frames & audio samples
If you're on Windows, see this doc on where ffmpeg.exe can be found automatically.
Then if you can run
import ffmpegio
frames = ffmpegio.probe.frames('video.mp4', intervals=10)
This will return the frames info as a list of dicts of the first 10 packets (of mixed streams in the order of pts). If you remove the intervals argument, it'll retrieve every frame (will take a long time).
Inspect each dict of frames and decide which entries you need (say 'media_type', 'stream_index', pts and pts_time). Then add entries argument containing these:
frames = ffmpegio.probe.frames('video.mp4', intervals=10,
entries=['media_type', 'stream_index', 'pts','pts_time'])
Once you're happy with what it returns, incorporate to your program.
The intervals argument accepts many different formats, please read the doc.
What this or any other FFmpeg-based approach does not offer you is getting this info with the data frames. You need to read in the frame timing data separately and mesh them with the data yourself. If you prefer a solution with more control (but perhaps more coding) look into pyav, which interfaces the underlying library of FFmpeg. I'm fairly certain you can retrieve pts simultaneously with framedata.
Disclaimer: This function has not been tested extensively. So, you may encounter an issue. If you have, please report on GitHub and I'll fix it asap.

Playing audio in sync with video with frames generated on the fly, real time. Plausible?

I'm a self taught python programmer working on a hobby project, but I'm having some difficulty and would like to address what I see as a potential XY problem.
My app takes an input of an audio file (converts it to wav) and produces visual representations of the audio (90x90, RGB, frames) in the form of numpy arrays. I used to save these frames to a video file using open-cv, then use ffmpeg to scale the video and add the (original, non-wav) audio over the top, but this meant waiting until the app had finished to play the file. I would like to be able to play the audio and display the frames as they are generated, in sync. My generation code takes at maximum 8ms of a 16ms frame (60fps), so I have a reasonable amount of cycles to play with.
From my research, I have found that SDL is the tool that is most appropriate to display frames at high speeds, and have managed to make a simple system to display frames 'in time', by brute-force pixel editing. I have also discovered that SDL can play audio, and it even seems that I could synchronize this with the video as I would like, via the callback function. However, being a decidedly non-c programmer, I am at a loss as to how to best to display frames, as directly assigning pixels cannot be the safest or fastest, and I would like to scale the frames as the are displayed. I am also at a loss as to how best to convert numpy arrays to textures efficiently, as well as how best to control the synchronicity of my generation code, the audio, and video frames.
I'm not specifically looking for an answer to any of those problems, though advice would be appreciated, I'm just making sure that this is a reasonable way forward. Is SDL/pysdl2 coupled with numpy appropriate in this scenario? Or is this asking too much from python overall?

Is there a way to generate a gif in python without consuming an excessive amount of RAM?

I'm writing a little application to generate a GIF from a kifu file (it's a type of file used to save a game in Japanese chess). I'm using Matplotlib currently to draw the board and the pieces, and the matplotlib.animation.FuncAnimation class combined with numpngw.AnimatedPNGWriter to write the gif. However, it uses more than 800MB of RAM to generate a single gif with 80 frames. After reflection, this value seems not surprising, because (from my understanding), each frame has a dimension of 1700x1000 and is in color. So, to keep every frame in frame, it needs a minimum of 1700*1000*80*(nb_bytes by pixel), which is a huge amount of RAM.
Is there a way to minimize this amount either with matplotlib or with another library? I suppose I need to compress frames after creating them instead of keeping them raw but I can't figure out how to do that.
Thank you very much

Presenting parts of a pre-prepared image array in Shady

I'm interested in migrating from psychtoolbox to shady for my stimulus presentation. I looked through the online docs, but it is not very clear to me how to replicate what I'm currently doing in matlab in shady.
What I do is actually very simple. For each trial,
I load from disk a single image (I do luminance linearization off-line), which contains all the frames I plan to display in that trial (the stimulus is 1000x1000 px, and I present 25 frames, hence the image is 5000x5000px. I only use BW images, so I have a single int8 value per pixel).
I transfer the entire image from the CPU to the GPU
At some point (externally controlled) I copy the first frame to the video buffer and present it
At some other point (externally controlled) I trigger the presentation of the
remaining 24 frames (copying the relevant part of the image to video buffer for each video frame, and then calling flip()).
The external control happens by having another machine communicate with the stimulus presentation code over TCP/IP. After the control PC sends a command to the presentation PC and this is executed, the presentation PC needs to send back an acknowledgement message to the control PC. I need to send three ACK messages, one when the first frame appears on screen, one when the 2nd frame appears on screen, and one when the 25th frame appears on screen (this way the control PC can easily verify if a frame has been dropped).
In matlab I do this by calling the blocking method flip() to present a frame, and when it returns I send the ACK to the control PC.
That's it. How would I do that in shady? Is there an example that I should look at?

The places to look for this information are the docstrings of Shady.Stimulus and Shady.Stimulus.LoadTexture, as well as the included example script animated-textures.py.
Like most things Python, there are multiple ways to do what you want. Here's how I would do it:
w = Shady.World()
s = w.Stimulus( [frame00, frame01, frame02, ...], multipage=True )
where each frameNN is a 1000x1000-pixel numpy array (either floating-point or uint8).
Alternatively you can ask Shady to load directly from disk:
s = w.Stimulus('trial01/*.png', multipage=True)
where directory trial01 contains twenty-five 1000x1000-pixel image files, named (say) 00.png through 24.png so that they get sorted correctly. Or you could supply an explicit list of filenames.
Either way, whether you loaded from memory or from disk, the frames are all transferred to the graphics card in that call. You can then (time-critically) switch between them with:
s.page = 0 # or any number up to 24 in your case
Note that, due to our use of the multipage option, we're using the "page" animation mechanism (create one OpenGL texture per frame) instead of the default "frame" mechanism (create one 1000x25000 OpenGL texture) because the latter would exceed the maximum allowable dimensions for a single texture on many graphics cards. The distinction between these mechanisms is discussed in the docstring for the Shady.Stimulus class as well as in the aforementioned interactive demo:
python -m Shady demo animated-textures
To prepare the next trial, you might use .LoadPages() (new in Shady version 1.8.7). This loops through the existing "pages" loading new textures into the previously-used graphics-card texture buffers, and adds further pages as necessary:
s.LoadPages('trial02/*.png')
Now, you mention that your established workflow is to concatenate the frames as a single 5000x5000-pixel image. My solutions above assume that you have done the work of cutting it up again into 1000x1000-pixel frames, presumably using numpy calls (sounds like you might be doing the equivalent in Matlab at the moment). If you're going to keep saving as 5000x5000, the best way of staying in control of things might indeed be to maintain your own code for cutting it up. But it's worth mentioning that you could take the entirely different strategy of transferring it all in one go:
s = w.Stimulus('trial01_5000x5000.png', size=1000)
This loads the entire pre-prepared 5000x5000 image from disk (or again from memory, if you want to pass a 5000x5000 numpy array instead of a filename) into a single texture in the graphics card's memory. However, because of the size specification, the Stimulus will only show the lower-left 1000x1000-pixel portion of the array. You can then switch "frames" by shifting the carrier relative to the envelope. For example, if you were to say:
s.carrierTranslation = [-1000, -2000]
then you would be looking at the frame located one "column" across and two "rows" up in your 5x5 array.
As a final note, remember that you could take advantage of Shady's on-the-fly gamma-correction and dithering–they're happening anyway unless you explicitly disable them, though of course they have no physical effect if you leave the stimulus .gamma at 1.0 and use integer pixel values. So you could generate your stimuli as separate 1000x1000 arrays, each containing unlinearized floating-point values in the range [0.0,1.0], and let Shady worry about everything beyond that.

The simplest video streaming?

I have a camera that is taking pictures one by one (about 10 pictures per second) and sending them to PC. I need to show this incoming sequence of images as a live video in PC.
Is it enough just to use some Python GUI framework, create a control that will hold a single image and just change the image in the control very fast?
Or would that be just lame? Should I use some sort of video streaming library? If yes, what do you recommend?

Or would that be just lame?
No. It wouldn't work at all.
There's a trick to getting video to work. Apple's QuickTime implements that trick. So does a bunch of Microsoft product. Plus some open source video playback tools.
There are several closely-related tricks, all of which are a huge pain in the neck.
Compression. Full-sized video is Huge. Do the math 640x480x24-bit color at 30 frames per second. It adds up quickly. Without compression, you can't read it in fast enough.
Buffering and Timing. Sometimes the data rates and frame rates don't align well. You need a buffer of ready-to-display frames and you need a deadly accurate clock to get them do display at exactly the right intervals.
Making a sequence of JPEG images into a movie is what iPhoto and iMovie are for.
Usually, what we do is create the video file from the image and play the video file through a standard video player. Making a QuickTime movie or Flash movie from images isn't that hard. There are a lot of tools to help make movies from images. Almost any photo management solution can create a slide show and save it as a movie in some standard format.
Indeed, I think that Graphic Converter can do this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.