I'm writing a multi-threaded application in python 3, one thread grab frames from a webcam using opencv, another one record audio frames using pyaudio. Both threads put the frames in a separate circular buffer, with absolute timestamp for every frame.
Now I'd like to create another thread who read from the buffers and join audio and video frame together using the timestamp information, then save everything to a mp4 file. The only thing I found is merging audio and video files using for example ffmpeg, but nothing related to frames on the fly.
Do I really need to create the audio and video files before join them? What I don't understand in this case is how to handle synchronization..
Any hints will be appreciated.
EDIT
In reponse to the comments, the timestamps are created by me and are absolute, I use a data structure which contains the actual data (video or audio frame) and the timestamp. The point is that audio is recorded with a microphone and video using a webcam, which are different hardware, not synchronized.
Webcam grab a frame, elaborate it and put in a circular buffer using my data structure (data + timestamp).
Microphone record an audio frame, elaborate it and put in a circular buffer using my data structure (data + timestamp).
So I have 2 buffers, I want to pop frames and join together in whatever video file format, matching the timestamps in the most accurate way possible. My idea is something that can add an audio frame to a video frame (I will check about timestamps matching).
Related
Background
For a research project, we are recording video data from two cameras and feed a synchronization pulse directly into the microphone ADC every second.
Problem
We want to derive a frame time stamp in the clock of the pulse source for each camera frame to relate the camera images temporally. With our current methods (see below), we get a frame offset of around 2 frames between the cameras. Unfortunately, inspection of the video shows that we are clearly 6 frames off (at least at one point) between the cameras.
I assume that this is because we are relating audio and video signal wrong (see below).
Approach I think I need help with
I read that in the MP4 container, there should be PTS times for video and audio. How do we access those programmatically. Python would be perfect, but if we have to call ffmpeg via system calls, we may do that too ...
What we currently fail with
The original idea was to find video and audio times as
audio_sample_times = range(N_audiosamples)/audio_sampling_rate
video_frame_times = range(N_videoframes)/video_frame_rate
then identify audio_pulse_times in audio_sample_times base, calculate the relative position of each video_time to the audio_pulse_times around it, and select the same relative value to the corresponding source_pulse_times.
However, a first indication that this approach is problematic is already that for some videos, N_audiosamples/audio_sampling_rate differs from N_videoframes/video_frame_rate by multiple frames.
What I have found by now
OpenCV's cv2.CAP_PROP_POS_MSEC seems to do exactly what we do, and not access any PTS ...
Edit: What I took from the winning answer
container = av.open(video_path)
signal = []
audio_sample_times = []
video_sample_times = []
for frame in tqdm(container.decode(video=0, audio=0)):
if isinstance(frame, av.audio.frame.AudioFrame):
sample_times = (frame.pts + np.arange(frame.samples)) / frame.sample_rate
audio_sample_times += list(sample_times)
signal_f_ch0 = frame.to_ndarray().reshape((-1, len(frame.layout.channels))).T[0]
signal += list(signal_f_ch0)
elif isinstance(frame, av.video.frame.VideoFrame):
video_sample_times.append(float(frame.pts*frame.time_base))
signal = np.abs(np.array(signal))
audio_sample_times = np.array(audio_sample_times)
video_sample_times = np.array(video_sample_times)
Unfortunately, in my particular case, all pts are consecutive and gapless, so the result is the same as with the naive solution ...
By picture clues, we identified a section of ~10s in the videos, somewhere in which they desync, but can't find any traces of that in the data.
You need to run ffprobe to retrieve the PTS times. I don't know the exact command, but if you're ok with another package, try ffmpegio:
pip install ffmpegio-core
// OR
pip install ffmpegio // if you also want to use it to read video frames & audio samples
If you're on Windows, see this doc on where ffmpeg.exe can be found automatically.
Then if you can run
import ffmpegio
frames = ffmpegio.probe.frames('video.mp4', intervals=10)
This will return the frames info as a list of dicts of the first 10 packets (of mixed streams in the order of pts). If you remove the intervals argument, it'll retrieve every frame (will take a long time).
Inspect each dict of frames and decide which entries you need (say 'media_type', 'stream_index', pts and pts_time). Then add entries argument containing these:
frames = ffmpegio.probe.frames('video.mp4', intervals=10,
entries=['media_type', 'stream_index', 'pts','pts_time'])
Once you're happy with what it returns, incorporate to your program.
The intervals argument accepts many different formats, please read the doc.
What this or any other FFmpeg-based approach does not offer you is getting this info with the data frames. You need to read in the frame timing data separately and mesh them with the data yourself. If you prefer a solution with more control (but perhaps more coding) look into pyav, which interfaces the underlying library of FFmpeg. I'm fairly certain you can retrieve pts simultaneously with framedata.
Disclaimer: This function has not been tested extensively. So, you may encounter an issue. If you have, please report on GitHub and I'll fix it asap.
I am trying to build a sports analysis platform where I have a deep learning model which processes Live video(RTMP/Webcam) frames, applies overlays,score etc. and then I need to combine it with microphone audio and rebroadcast with audio and video in sync. I think I need the presentation time stamps of the frames (Since AI frame processing takes variable time) and somehow provide ffmpeg with it but I'm lost and could not find a similar example doing this.
I'm trying to implement a video overlay solution such as this one: https://www.videologixinc.com/, where there is no delay in the original source video.
My problem is that, with OpenCV, all the necessary drawings (circle, text, etc) requires the entire frame to be processed and then returned to be exhibited. Is there any solution where I could just overlay the information in the original source without implying in delay/frame drop? (the additional information can be displayed with delay - drawings, text - but not the original video pipeline).
Multiprocessing could make things faster, but I would still have delay or frame drops.
I was also thinking if would be better to have two simultaneous applications and maybe two different computers - one to read the frame and make the processing - and another one to just receive, somehow, the information to overlay it on the original video pipeline.
Any thoughts? Thank you all!
An example of data pipeline in this case, without interfering in the original video flow
I can open a video and play it with opencv 2 using the cv2.VideoCapture(myvideo). But is there a way to delete a frame within that video using opencv 2? The deletion must happen in-place, that is, the file being played will end up with a shorter time due to deleted frames. Simply zeroing out the matrix wouldn't be sufficient.
For example something like:
video = cv2.VideoCapture(myvideo.flv)
while True:
img = video.read()
# Show the image
cv2.imgshow(img)
# Then go delete it and proceed to next frame, but is this possible?
# delete(img)??
So the above code would technically contain 0 bytes at the end since it reads then deletes the frame in the video file.
OpenCV is not the right tool for this job. What you need for this is a media processing framework, like ffmpeg (=libavformat/libavcodec/libswscale) or GStreamer.
Also depending on the encoding scheme used, simply deleting just a single frame may not be possible. Only in a video consisting of just Intra frames (I-frames), frame exact editing is possible. If the video is encoding in so called group of pictures (GOP) removing a single frame requires to reencode the whole GOP it was part of.
You can't do it in-place, but you can use OpenCV's VideoWriter to write the frames that you want in a new video file.
I've been messing around with Gstreamer and Gnonlin lately, I've been concatenating segments of video files but when I dynamically connect the src pad on the composition, I can choose either the audio or video portion of the files, producing silent playback or videoless audio. How can I attach my composition to an audioconverter and a video sink at the same time. Do I have to make two compositions and add the files to both them?
Yes, gnonlin compositions work on one media type at a time. Audio and Video are treated separately.