I need help combining image and audio clips into 1 video

I need help combining image and audio clips into 1 video - python

My goal is to make a video of a text to speech Reading images i made.
I have the images and audio as files, my goal is to combine them as a slideshow fashion where the image durations last as long as the text to speech audio duration. Also would be nice to have a transition mp4 between the clips.
The problem is that I have no idea where to start. The pymovie documentations don't seem to cover this from my understanding.
I need directions on where to go/what to use/how to use.
I am also creating the images in a for loop and planning to make a function to add the image and audio into the file
I have searched for 10-20 minutes now and didn't find anything to help me.
Keep in mind i am a newbie python programmer.

Related

OpenCV & MoviePy - Analyzing video frames

I've been using OpenCV and MoviePy to get images out of a video (1 image per second) and once extracted, I analyze the image with pytesseract. The part where the script extract images takes quite a bit of time. Is it possible or is there a function that I've overlooked in MoviePy or OpenCV that allows video frames to be analyzed without having to create images first? This could tremendously speed up the process.
Current steps:
Scan and extract 1fps with a specific video as argument
From each of those images, perform analysis on a specific area
Desired:
Perform analysis on a specific area of the video itself at 1 fps.
If this function exists, please inform me. Otherwise, would there be a workaround for this? Suggestions?
Thanks!!

Playing audio in sync with video with frames generated on the fly, real time. Plausible?

I'm a self taught python programmer working on a hobby project, but I'm having some difficulty and would like to address what I see as a potential XY problem.
My app takes an input of an audio file (converts it to wav) and produces visual representations of the audio (90x90, RGB, frames) in the form of numpy arrays. I used to save these frames to a video file using open-cv, then use ffmpeg to scale the video and add the (original, non-wav) audio over the top, but this meant waiting until the app had finished to play the file. I would like to be able to play the audio and display the frames as they are generated, in sync. My generation code takes at maximum 8ms of a 16ms frame (60fps), so I have a reasonable amount of cycles to play with.
From my research, I have found that SDL is the tool that is most appropriate to display frames at high speeds, and have managed to make a simple system to display frames 'in time', by brute-force pixel editing. I have also discovered that SDL can play audio, and it even seems that I could synchronize this with the video as I would like, via the callback function. However, being a decidedly non-c programmer, I am at a loss as to how to best to display frames, as directly assigning pixels cannot be the safest or fastest, and I would like to scale the frames as the are displayed. I am also at a loss as to how best to convert numpy arrays to textures efficiently, as well as how best to control the synchronicity of my generation code, the audio, and video frames.
I'm not specifically looking for an answer to any of those problems, though advice would be appreciated, I'm just making sure that this is a reasonable way forward. Is SDL/pysdl2 coupled with numpy appropriate in this scenario? Or is this asking too much from python overall?

Detecting a noise within an audio stream in Python

My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sounds is being played out of the speakers, by applications such as games for example
My "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect for example can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device -- perhaps something based on this? or potentially sounddevice - but I can't determine how to make this work by looking at their api?
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
My question is - can somebody help me with the two parts to my question:
How do I sample the audio device on my PC using python
How should I attempt this comparison (ideally with a little example)
Many thanks in advance.

Audio alignment (same sentence with different speakers)

I am super new to audio processing. I have one reference audio file and several other audio recordings (same sentence spoken by different speakers - differ in dialect and duration) and I want to align the all the audio files to the one audio reference file with the least warping. I tried using MFCC and Chroma features (python/librosa) but I don't know what to do next. I was reading about DTW (Dynamic Time Warping) for alignment, would that work? Is there an example/open source project or audio tool which already does this? It seems to be a solved problem but I couldn't find it. Please help.
I was following read this -
https://librosa.github.io/librosa_gallery/auto_examples/plot_music_sync.html but how do I save back the aligned audio in time domain?
This seems related - Dynamic time warping with python (final mapping)

The simplest video streaming?

I have a camera that is taking pictures one by one (about 10 pictures per second) and sending them to PC. I need to show this incoming sequence of images as a live video in PC.
Is it enough just to use some Python GUI framework, create a control that will hold a single image and just change the image in the control very fast?
Or would that be just lame? Should I use some sort of video streaming library? If yes, what do you recommend?

Or would that be just lame?
No. It wouldn't work at all.
There's a trick to getting video to work. Apple's QuickTime implements that trick. So does a bunch of Microsoft product. Plus some open source video playback tools.
There are several closely-related tricks, all of which are a huge pain in the neck.
Compression. Full-sized video is Huge. Do the math 640x480x24-bit color at 30 frames per second. It adds up quickly. Without compression, you can't read it in fast enough.
Buffering and Timing. Sometimes the data rates and frame rates don't align well. You need a buffer of ready-to-display frames and you need a deadly accurate clock to get them do display at exactly the right intervals.
Making a sequence of JPEG images into a movie is what iPhoto and iMovie are for.
Usually, what we do is create the video file from the image and play the video file through a standard video player. Making a QuickTime movie or Flash movie from images isn't that hard. There are a lot of tools to help make movies from images. Almost any photo management solution can create a slide show and save it as a movie in some standard format.
Indeed, I think that Graphic Converter can do this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.