Python Audio Analysis: What properties/values can I extract from it? - python

I'm currently working on a tkinter python school project where the sole purpose is to generate images from audio files, I'm going to pick audio properties and use them as values to generate unique abstract images from it, however I don't know which properties I can analyze to extract the values from. So I was looking for some guidance on which properties (audio frequency, amplitude... etc.) I can extract values from to use to generate the images with Python.

The question is very broad in it's current form.
(Bare in mind audio is not my area of expertise so do keep an eye out for the opinion of people working in audio/audiovisual/generative fields.)
You can go about it either way: figure out what kind of image(s) you'd like to create from audio and from there figure out which audio features to use. The other way around is also valid: pick an audio feature you'd like to explore, then think of how you'd best or most interestingly represent that visually.
There's a distintion between image and images.
For a single image, the simplest thing I can think of is drawing a grid of squares where a visual property of the square (e.g. square size, fill colour intensity, etc.) is mapped to the amplitude at that time. The single image would visualise a whole track's amplitude pattern. Even with such a simple example there are many choices you can make (how often you sample, how you layout the grid (cartesian, polar), how each amplitude sample is visualised (could different shapes, sizes, colours, etc.).
(Similar concept to CinemaRedux, simpler for audio only)
You can look into the field of data visualisation for inspiration.
Information is Beautiful is great place to start.
If you want to generate images that seems to go into the audiovisual territory (e.g. abstract animation, audio reactive motion graphics, etc.).
Your question originally had the tag Processing tag, which I removed, however you could be using Processing's Python Mode.
In ferms of audio visualisisation one good example I can think is Robert Hogin's work, see Magnetosphere and the Audio-generated landscape prototype. He is using frequency analysis (FFT) with a bit of smoothing/data massaging to amplify the elements useful for visualisation and dampen some of the noise:
(There are a few handy audio libraries such as Minim and beads, however I assume you're intresting in using raw Python, not Jython (which is what the official Processing Python mode uses). He is an answer on FFT analysis for visualisation (even though it's in Processing Java, the principles can be applied in Python)
Personally I've only used pyaudio so far for basic audio tasks. I would assume you could use it for amplitude analysis, but for other more complex tasks you might something extra.
Doing a quick search librosa pops up.
If what you want to achieve isn't clear, try prototyping first and start with the simplest audio analysis and visual elements you can think of (e.g. amplitude mapped to boxes over time). Constraints can be great for creativity and the minimal approach could translate into a cleaner, minimal visuals.
You can then look into FFT, MFCC, onset/ beat detection, etc.
Another tool that could be useful for prototyping is Sonic Visualiser.
You can open a track and use some of the built-in feature extractors.
(You can even get away with exporting XML or CSV data from Sonic Visualser which you can load/parse in Python and use to render image(s))
It uses a plugin system (similar to VST plugins in DAWs like Abbleton Live, Apple Logic, etc.) called Vamp plugins. You can then use the VampPy Python wrapper if you need the data at runtime.
(You might also want to draw inspiration from other languages used of audiovisual artworks like PureData + Gems , MaxMSP + Jitter, VVVV, etc.)

Time domain: Zero-crossing rate, Root mean square energy ,etc . Frequency Domain: Spectral bandwith,flux,rollof,flatness,MFCC etc. Also ,tempo, You can use librosa for Python , link : https://librosa.org/doc/latest/index.html for extraction from a .wav file , which implements Fast Fourier Transfrom and framing. And then you can apply some statistics such mean,standard deviation to the vector of the above characteristics across the whole audio file.

Providing an additional avenue for exploration: you have some tools to explore this qualitatively (as opposed to quantitatively using metrics derived from the audio signal as suggested in the great answers above)
As you mention the objective is to generate unique abstract images from sound - I would suggest an interesting angle may be to apply some Machine Learning techniques and derive some mood classification predictions from the source audio.
For instance you could use the Tensorflow models in essentia to predict the mood of the track and associate images you select with the mood scores generated. I would suggest going well beyond this and using the tkinter image creation tools to create your mappings to mood. Use pen and paper to develop your mapping strategy - are certain moods more angular or circular? What colour mappings will you select, and why? You have a great deal of freedom to create these mappings - so start simple as complexity builds naturally.
Using some some simple mood predictions may be more useful for you as someone who has more experience with the qualitative experience with sound rather than the quantitative experience as an audio engineer. I think this may be worth making central to the report you write and documenting your mapping decisions and design process for the report if this is a requirement of the task.

Related

Subtract/compare differences between two audio files using Python

My goal is to take two identical length .wav files, one original with noise + speech, one processed with improved speech and compare the two. This should leave me with the difference of the two wave files being the noise that was removed during the processing.
I would like to do this to practice my python coding skills, also to test the efficiency of speech processing programs. So far I have found programs that can do this but I would really like to build my own simple version in python. Some libraries that I have considered are audiodiff and librosa but it doesn't seem like they include a subtract function.
I have a few programs that can accomplish this but I would like to create a simple version and be able to customize it over time as my needs expand.
The task of estimating the quality of a speech clip is known as Speech Quality Estimation. A slightly simpler task formulation is Speech Intelligibility, which only measures how easy it is to understand the speech, not the subjective "quality" of the speech signal.
These are established task within speech signal processing, and there are industry standards as well as bleeding edge research. Such methods (often referred to as objective metrics) take into account the human auditory system and perception of speech in order to give a good estimate for how a group of humans would evaluate the result in a listening test. Because of the complexity of human perception, good methods are a bit more involved than just a difference of the audio.
If you have access to the original speech (without noise), then you can use a full reference based method. An early ITU standard is PESQ, and the most recent as of 2021 is POLQA. A good metric with open source implementation is VISQOL.
If you do not have access to the original speech without noise, you must use a no-reference / single-ended / non-intrusive method. An ITU standard is P.563. There is currently a lot of research using deep neural networks to better solve this task. Examples are NISQA, Gamper2019
Some more details can be found on https://github.com/jonnor/machinehearing/tree/master/audio-quality

Is there a way to generate real time depthmap from single camera video in python/opencv?

I'm trying to convert single images into it's depthmap, but I can't find any useful tutorial or documentation.
I'd like to use opencv, but if you know a way to get the depth map using for example tensorflow, I'd be glad to hear it.
There are numerous tutorials for stereo vision but I want to make it cheaper because it's for a project to help blind people.
I'm currently using esp32 cam to stream frame by frame and receiving the images on python using opencv.
Usually, we need a photometric measurement from a different position in the world to form a geometric understanding of the world(a.k.a depth map). For a single image, it is not possible to measure the geometric, but it is possible to infer depth from prior understanding.
One way for a single image to work is to use a deep learning-based method to direct infer depth. Usually, the deep learning-based approaches are all based on python, so if you only familiar with python, then this is the approach that you should go for. If the image is small enough, i think it is possible for realtime performance. There are many of this kind of work using CAFFE, TF, TORCH etc. you can search on git hub for more option. The one I posted here is what i used recently
reference:
Godard, Clément, et al. "Digging into self-supervised monocular depth estimation." Proceedings of the IEEE international conference on computer vision. 2019.
Source code: https://github.com/nianticlabs/monodepth2
The other way is to use a large FOV video for a single camera-based SLAM. This one has various constraints such as need good features, large FOV, slow motion, etc. You can find many of this work such as DTAM, LSDSLAM, DSO, etc. There are a couple of other packages from HKUST or ETH that does the mapping given the position(e.g if you have GPS/compass), some of the famous names are REMODE+SVO open_quadtree_mapping etc.
One typical example for a single camera-based SLAM would be LSDSLAM. It is a realtime SLAM.
This one is implemented based on ROS-C++, I remember they do publish the depth image. And you can write a python node to subscribe to the depth directly or the global optimized point cloud and project it into a depth map of any view angle.
reference: Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European conference on computer vision. Springer, Cham, 2014.
source code: https://github.com/tum-vision/lsd_slam

How to compare if two images representing the same object if the pictures of the object belongs from two different sources - in OpenCV?

Suppose I have an image of a car taken from my mobile camera and I have another image of the same car taken downloaded from the internet.
(For simplicity please assume that both the images contain the same side view projection of the same car.)
How can I detect that both the images are representing the same object i.e. the car, in this case, using OpenCV?
I've tried template matching, feature matching (ORB) etc but those are not working and are not providing satisfactory results.
SIFT feature matching might produce better results than ORB. However, the main problem here is that you have only one image of each type (from the mobile camera and from the Internet. If you have a large number of images of this car model, then you can train a machine learning system using those images. Later you can submit one image of the car to the machine learning system and there is a much higher chance of the machine learning system recognizing it.
From a machine learning point of view, using only one image as the master and matching another with it is analogous to teaching a child the letter "A" using only one handwritten letter "A", and expecting him/her to recognize any handwritten letter "A" written by anyone.
Think about how you can mathematically describe the car's features so that every car is different. Maybe every car has a different size of wheels? Maybe the distance from the door handle to bottom of the side window is a unique characteristic for every car? Maybe every car's proportion of front side window's to rear side window's width is an individual feature of that car?
You probably can't answer yes with 100% confidence to any of these quesitons. But, what you can do, is combine those into a multidimensional feature vector and perform classification.
Now, what will be the crucial part here is that since you're doing manual feature description, you need to take care of doing an excellent work and testing every step of the way. For example, you need to design features that will be scale and perspective invariant. Here, I'd recommend reading on how face detection was designed to fulfill that requirement.
Will Machine Learning be a better solution? Depends greatly on two things. Firstly, what kind of data are you planning to throw at the algorithm. Secondly, how well can you control the process.
What most people don't realize today, is that Machine Learning is not some magical solution to every problem. It is a tool and as every tool it needs proper handling to provide results. If I were to give you advice, I'd say you will not handle it very well yet.
My suggestion: get acquainted with basic feature extraction and general image processing algorithms. Edge detection (Canny, Sobel), contour finding, shape description, hough transform, morphological operations, masking, etc. Without those at your fingertips, I'd say in that particular case, even Machine Learning will not save you.
I'm sorry: there is no shortcut here. You need to do your homework in order to make that one work. But don't let that scare you. It's a great project. Good luck!

how to make semantic label image?

Now I want to train my own image data in caffe using SegNet.
But at the first step we need label our own image like these:
I have tried to search github but cannot find anything. So my question is anyone know which tool can make semantic label image?
Check out a tool called sloth: https://github.com/cvhciKIT/sloth, which is an open-source tool written in Python with PyQt for creating ground truth computer vision datasets for a wide array of applications, such as semantically creating data like you have above.
If you don't like sloth, you can use any image editing software, like GIMP where you would make one layer per label and use polygons and flood fill of different hues to create your data. You would then merge all of the layers together to make a final image that you would use for your purposes.
However, as user Miki mentioned (see discussion thread below), creating new datasets from the beginning will take a considerable amount of effort. It is highly advisable that you don't create this on your own as you need a lot of data to ensure your algorithms are performing correctly. You'll need the help of other (hopefully willing) PhD students, preferably those you know personally or work with you in your lab or workplace to help manually curate this data for you.
If this isn't an option, you can use crowd sourced funded places like Amazon Mechanical Turk where you can outsource the work to willing individuals where you inform them of the task at hand and you pay a small amount per image. This would be something to consider if you can't find many people to help you.
All in all, this will take a considerable amount of effort, not only in terms of time but in terms of people if you want to create a large data set within a short span of time. I would recommend you simply use established datasets, such as what you have referenced from Cambridge, or Miki suggested LabelMe by Antonio Torralba which not only is a toolbox for annotating images from his LabelMe dataset but it also allows you to do the same for your own images.
Good luck!
As answer by #rayryeng a tool called sloth is great to finish these task in simple way. However, if I have more than 20 object waiting for me to classify, sloth is not a ideal tools. Thus I develop a simple tool which call IsLabel to finish these problem with few algorithms.
And the result look like these while using IsLabel just took me 40s:
INPUT:
OUTPUT:
I know its not perfect but it work fine for me.
I would recommend using https://www.labelbox.io/. They open sourced a lot of their code and have a hosting platform to manage the whole labeling process end to end.
Here is an example of segmentation
And you can export labels with a mask.

Automatic font recognition with Python

As you may have heard of, there is an online font recognition service call WhatTheFont
I'm curious about the tech behind this tool. I think basically we can seperate this into two parts:
Generate images from font files of various format, refer to http://www.fileinfo.com/filetypes/font for a list of font file extensions.
Compare submitted image with all generated images
I appreciate you share some advice or python code to implement two steps above.
As the OP states, there are two parts (and probably also a third part):
Use PIL to generate images from fonts.
Use an image analysis toolkit, like OpenCV (which has Python bindings) to compare different shapes. There are a variety of standard techniques to compare different objects to see whether they're similar. For example, scale invariant moments work fairly well and are part of the OpenCv toolkit.
Most of the standard tools in #2 are designed to look for similar but not necessarily identical shapes, but for font comparison this might not be what you want, since the differences between fonts can be based on very fine details. For fine-detail analysis, try comparing the x and y profiles of a perimeter path around the each letter, appropriately normalized, of course. (This, or a more mathematically complicated variant of it, has been used with good success in font analysis.)
I can't offer Python code, but here are two possible approaches.
"Eigen-characters." In face recognition, given a large training set of normalized facial images, you can use principal component analysis (PCA) to obtain a set of "eigenfaces" which, when the training faces are projected upon this subspace, exhibit the greatest variance. The "coordinates" of the input test faces with respect to the space of eigenfaces can be used as the feature vector for classification. The same thing can be done with textual characters, i.e., many versions of the character 'A'.
Dynamic Time Warping (DTW). This technique is sometimes used for handwriting character recognition. The idea is that the trajectory taken by the tip of a pencil (i.e., d/dx, d/dy) is similar for similar characters. DTW makes invariant some of the variations across instances of single person's writing. Similarly, the outline of a character can represent a trajectory. This trajectory then becomes the feature vector for each font set. I guess the DTW part is not as necessary with font recognition because a machine creates the characters, not a human. But it may still be useful to disambiguate spatial ambiguities.
This question is a little old, so here goes an updated answer.
You should take a look into this paper DeepFont: Identify Your Font from An Image. Basically it's a neural network trained on tons of images. It was presented commercially in this video.
Unfortunately, there is no code available. However, there is an independent implementation available here. You'll need to train it yourself, since weights are not provided, but the code is really easy to follow. In addition to this, consider that this implementation is only for a few fonts.
There is also a link to the dataset and a repo to generate more data.
Hope it helps.

Categories