This question already has an answer here:
Can I convert spectrograms generated with librosa back to audio?
(1 answer)
Closed 11 months ago.
I know that it is possible to convert an audio to a representive image.
Is someone know if the opposite way possible?
Can we convert back the representive image to audio?
If it's possible please tell me how.
I looked for ways to do this but I did not find.
Edit: my main goal is to generate new/random music using DCGAN.
I thought to take an audio, convert is to the image of the freq graph, use DCGAN and the convert it back to audio.
I don't know wwich tool to use and how exacly to do this.
If someone can help me it will be nice.
there are many ways to do this ... the approach I used is to iterate across each pixel in the input image ... assign to each pixel in order a unique frequency ... the range of frequencies can be arbitrary lets vary it across the human audible range from 200 to 8,000 Hertz ... divide this audio freq range by the number of pixels which will give you a frequency increment value ... give the first pixel 200 Hertz and as you iterate across all pixels give each pixel a frequency by adding this freq increment value to the previous pixel's frequency
while you perform above iteration across all pixels determine the light intensity value of the current pixel and use this to determine a value normalize from zero to one which will be the amplification factor of the frequency of a given pixel
now you have a new array where each element records the light intensity value and a frequency ... walk across this array and create an oscillator to output a sin curve at an amplitude driven from the amplification factor at the frequency of the current array element ... now combine all such oscillator outputs and normalize into a single aggregate audio
this aggregate synthesized output audio is the time domain representation of the input image which is your frequency domain starting point
beautiful thing is this output audio is the inverse Fourier Transform of the image ... anyone fluent in Fourier Transform will predict what comes next namely this audio can then be sent into a FFT call which will output a new output image which if you implement all this correctly will match more or less to your original input image
I used golang not python however this challenge is language agnostic ... good luck and have fun
there are several refinements to this ... a naive way to parse the input image is to simply zig zag left to right top to bottom which will work however if you use a Hilbert Curve to determine which pixel comes next your output audio will be better suited to people listening especially when and if you change the image resolution of out original input image ... ignore this embellishment until you have it working
far more valuable than the code which implements this is the voyage of discovery endured in writing the code ... here is the video which inspired me to embark on this voyage https://www.youtube.com/watch?v=3s7h2MHQtxc # Hilbert's Curve: Is infinite math useful?
here is a sample input photo
here is the output photo after converting above image into audio and then back into an image
once you get this up and running and are able to toggle from frequency domain into the time domain and back again you are free to choose whether you start from audio or an image
Related
I have a picture of human eye taken roughly 10cm away using a mobile phone(no specifications regarding the camera). After some detection and contouring, I got 113px as the Euclidean distance between the center of the detected iris and the outermost edge of iris on the taken image. Dimensions of the image: 483x578px.
I tried converting the pixels into mm by simply multiplying the number of pixels with the size of a pixel in mm since 1px is roughly equal to 0.264mm which gives the proper length only if the image is in 1:1 ratio wrt to the real-time eye which is not the case here.
Edit:
Device used: One Plus 7T
View of range = 117 degrees
Aperture = f/2.2
Distance photo was taken = 10 cm (approx)
Question:
Is there an optimal way to find the real time radius of this particular eye with the amount of information I have gathered through processing thus far and by not including a reference object within the image?
P.S. The actual HVID of the volunteer's iris is 12.40mm taken using Sirus(A hi-end device to calculate iris radius and I'm trying to simulate the same actions using Python and OpenCV)
After months I was able to come up with the result after ton of research and lots of trials and errors. This is not the most ideal answer but it gave me expected results with decent precision.
Simply, In order to measure object size/distance from the image we need multiple parameters. In my case, I was trying to measure the diameter of iris from a smart phone camera.
To make that possible we need to know the following details prior to the calculation
1. The Size of the physical sensor (height and width) (usually in mm)
(camera inside the smart phone whose details can be obtained from websites on the internet but you need to know the exact brand and version of the smart phone used)
Note: You cannot use random values for these, otherwise you will get inaccurate results. Every step/constraint must be considered carefully.
2. The Size of the image taken (pixels).
Note: Size of the image can be easily obtained used img.shape but make sure the image is not cropped. This method relies on the total width/height of the original smartphone image so any modifications/inconsistencies would result in inaccurate results.
3. Focal Length of the Physical Sensor (mm)
Note: Info regarding focal length of the sensor used can be acquired from the internet and random values should not be given. Make sure you are taking images with auto focus feature disabled so the focal length is preserved. Incase if you have auto focus on then the focal length will be constantly changing and the results will be all over the place.
4. Distance at which the image is taken (Very Important)
Note: As "Christoph Rackwitz" told in the comment section. The distance from which the image is taken must be known and should not be arbitrary. Head cannoning a number as input will always result in inaccuracy for sure. Make sure you properly measure the distance from sensor to the object using some sort of measuring tool. There are some depth detection algorithms out there in the internet but they are not accurate in most cases and need to calibrated after every single try. That is indeed an option if you dont have any setup to take consistent photos but inaccuracies are inevitable especially in objects like iris which requires medical precision.
Once you have gathered all these "proper" information the rest is to dump these into a very simple equation which is a derivative of the "Similar Traingles".
Object height/width on sensor (mm) = Sensor height/width (mm) × Object height/width (pixels) / Sensor height/width (pixels)
Real Object height (in units) = Distance to Object (in units) × Object height on sensor (mm) / Focal Length (mm)
In the first equation, you must decide from which axis you want to measure. For instance, if the image is taken in portrait and you are measuring the width of the object on the image, then input the width of the image in pixels and width of the sensor in mm
Sensor height/width in pixels is nothing but the size of the "image"
Also you must acquire the object size in pixels by any means.
If you are taking image in landscape, make sure you are passing the correct width and height.
Equation 2 is pretty simple as well.
Things to consider:
No magnification (Digital magnification can destroy any depth info)
No Autofocus (Already Explained)
No cropping/editing image size/resizing (Already Explained)
No image skewing.(Rotating the image can make the image unfit)
Do not substitute random values for any of these inputs (Golden Advice)
Do not tilt the camera while taking images (Tilting the camera can distort the image so the object height/width will be altered)
Make sure the object and the camera is exactly in the same line
Don't use EXIF data of the image (EXIF data contains depth information which is absolute garbage since they are not accurate at all. DO NOT CONSIDER THEM)
Things I'm unsure of till now:
Lens distortion / Manufacturing defects
Effects of field of view
Perspective Foreshortening due to camera tilt
Depth field cameras
DISCLAIMER: There are multiple ways to solve this issue but I chose to use this method and I highly recommend you guys to explore more and see what you can come up with. You can basically extend this idea to measure pretty much any object using a smartphone (given the images that a normal smart phone can take)
(Please don't try to measure the size of an amoeba with this. Simply won't work but you can indeed take some of the advice I have gave for your advantage)
If you have cool ideas and issues with my answers. Please feel free to let me know I would love to have discussions. Feel free to correct me if I have made any mistakes and misunderstood any of these concepts.
Final Note:
No matter how hard you try, you cannot make something like a smartphone to work and behave like a camera sensor which is specifically designed to take images for measuring purposes. Smart phone can never beat those but sure we can manipulate the smart phone camera to achieve similar results upto a certain degree. So you guys must keep this in mind and I learnt it the hard way
I have signal(s) (of a person climbing stairs.) of the following nature. This is a signal worth 38K + samples over a period of 6 minutes of stair ascent. The parts where there is some low frequency noise are the times when the person would take a turnabout to get to the next flight of stairs (and hence does not count as stair ascent.)
Figure 1
This is why I need to get rid of it for my deep learning model which only accepts the stair ascent data. Essentially, I only need the high frequency regions where the person is climbing stairs. I could do eliminate it manually, but it would take me a lot of time since there are 58 such signals.
My approach for a solution to this problem was modulating this signal with a square wave which is 0 for low frequency regions and 1 for high frequency regions and then to multiply the signals together. But the problem is how to create such a square wave signal which detects the high and low frequency regions on its own?
I tried enveloping the signal (using MATLAB's envelope rms function) and I got the following result:
Figure 2
As you can see the envelope rms signal follows the function quite well. But I am stuck as to how to create a modulating square wave function off of it (essentially what I am asking for a variable pulse-width modulating waveform.)
PS: I have considered using high-pass filter but this won't work because there are some low frequency signals in the high frequency stair-climbing region which I cannot afford to remove. I have also thought of using some form of rising/falling edge detection(for the envelope rms function) but have found no practical way of implementing it.) Please advise.
Thank you for your help in advance,
Shreya
Thanks to David for his thresholding suggestion which I did on my dataset I have these results... though I am again stuck with trying to get rid of the redundant peaks between zeros (see image below) What do I do next?
Figure 3
I think I have been able to solve my problem of being able to isolate the "interesting" part of the waveform from the entire original waveform successfully using the following procedure (for the reader's future reference:)
A non-uniform waveform such as Figure 1 can have the "envelope(rms)" MATLAB function applied to obtain the orange function such as the one in Figure 2. Subsequently, I filtered this enveloperms waveform using MATLAB's very own "idfilt" function. This enabled me to get rid of the unwanted spikes (between zeroes) that were occurring between the "interesting" parts of the waveform. Then, using thresholding, I converted this waveform to be equal to 1 at the "interesting" parts and 0 at the "uninteresting" parts giving me a pulse-width modulated square wave form that follows ONLY the "interesting parts of the original waveform (in Figure 1) I then multiplied my square waveform with the original function and was able to filter out the "uninteresting" parts as demonstrated in Figure 4.
Figure 4
Thank You all for your help! This thread is now resolved!
I think I have been able to solve my problem of being able to isolate the "interesting" part of the waveform from the entire original waveform successfully using the following procedure (for the reader's future reference:)
A non-uniform waveform such as Figure 1 can have the "envelope(rms)" MATLAB function applied to obtain the orange function such as the one in Figure 2. Subsequently, I filtered this enveloperms waveform using MATLAB's very own "idfilt" function. This enabled me to get rid of the unwanted spikes (between zeroes) that were occurring between the "interesting" parts of the waveform. Then, using thresholding, I converted this waveform to be equal to 1 at the "interesting" parts and 0 at the "uninteresting" parts giving me a pulse-width modulated square wave form that follows ONLY the "interesting parts of the original waveform (in Figure 1) I then multiplied my square waveform with the original function and was able to filter out the "uninteresting" parts as demonstrated in Figure 4.
Thank You all for your help! This thread is now resolved!
I know the basic flow or process of the Image Registration/Alignment but what happens at the pixel level when 2 images are registered/aligned i.e. similar pixels of moving image which is transformed to the fixed image are kept intact but what happens to the pixels which are not matched, are they averaged or something else?
And how the correct transformation technique is estimated i.e. how will I know that whether to apply translation, scaling, rotation, etc and how much(i.e. what value of degrees for rotation, values for translation, etc.) to apply?
Also, in the initial step how the similar pixel values are identified and matched?
I've implemented the python code given in https://simpleitk.readthedocs.io/en/master/Examples/ImageRegistrationMethod1/Documentation.html
Input images are of prostate MRI scans:
Fixed Image Moving Image Output Image Console output
The difference can be seen in the output image on the top right and top left. But I can't interpret the console output and how the things actually work internally.
It'll be very helpful if I get a deep explanation of this thing. Thank you.
A transformation is applied to all pixels. You might be confusing rigid transformations, which will only translate, rotate and scale your moving image to match the fixed image, with elastic transformations, which will also allow some morphing of the moving image.
Any pixel that a transformation cannot place in the fixed image is interpolated from the pixels that it is able to place, though a registration is not really intelligent.
What it attempts to do is simply reduce a cost function, where a high cost is associated with a large difference and a low cost is associated with a small difference. Cost functions can be intensity based (pixel values) or feature based (shapes). It will (semi-)randomly shift the image around untill a preset criteria is met, generally a maximum amount of iterations.
What that might look like can be seen in the following gif:
http://insightsoftwareconsortium.github.io/SimpleITK-Notebooks/registration_visualization.gif
I have a song and I'd like to use Python to analyze it.
I need to find the "major sounds" in the song.
I use this term because I don't know the technical term for it, but here is what I mean:
https://www.youtube.com/watch?v=TYYyMu3pzL4
If you play the only first second of the song, I count about 4 major sounds.
In general, these are the same sounds that a person would hum if they were humming the song.
What are these called? And is there a function in librosa (or any other library/programming language) that can help me pinpoint their occurrence in a song?
I can provide more info/examples as needed.
UPDATE: After doing more research, I believe I am looking for what is called the "strongest beats". Librosa already has a beat_track function, but I think this gives you every single thing that can be called a beat in the song. I don't really want every beat, just the ones that stand out the most. The over-arching goal here is to create a music video where the major action happening on the screen lines up perfectly with the strongest beats. This creates a synergistic effect within the video - everything feels connected.
You would do well to call the process of parsing audio to identify its sonic archetypes acoustic fingerprinting
Audio has a time dimension so to witness your "major sounds" requires listening to the audio for a period of time ... across a succession of instantaneous audio samples. Audio can be thought of as a time series curve where for each instant in time you record the height of the audio curve digitized into PCM format. It takes wall clock time to hear a given "major sound". Here your audio is in its natural state in the time domain. However the information load of a stretch of audio can be transformed into its frequency domain counterpart by feeding a window of audio samples into a fft api call ( to take its Fourier Transform ).
A powerfully subtle aspect of taking the FFT is it removes the dimension of time from the input data and replaces it with a distillation while retaining the input information load. As an aside, if the audio is periodic once transformed from the time domain into its frequency domain representation by applying a Fourier Transform, it can be reconstituted back into the same identical time domain audio curve by applying an inverse Fourier Transform. The data which began life as a curve which wobbles up and down over time is now cast as a spread of frequencies each with an intensity and phase offset yet critically without any notion of time. Now you have the luxury to pluck from this static array of frequencies a set of attributes which can be represented by a mundane struct data structure and yet imbued by its underlying temporal origins.
Here is where you can find your "major sounds". To a first approximation you simply stow the top X frequencies along with their intensity values and this is a measure of a given stretch of time of your input audio captured as its "major sound". Once you have a collection of "major sounds" you can use this to identify when any subsequent audio contains an occurrence of a "major sound" by performing a difference match test between your pre stored set of "major sounds" and the FFT of the current window of audio samples. You have found a match when there is little or no difference between the frequency intensity values of each of those top X frequencies of the current FFT result compared against each pre stored "major sound"
I could digress by explaining how by sitting down and playing the piano you are performing the inverse Fourier Transform of those little white and black frequency keys, or by saying the muddied wagon tracks across a spring rain swollen pasture is the Fourier Transform of all those untold numbers of heavily laden market wagons as they trundle forward leaving behind an ever deepening track imprinted with each wagon's axle width, but I won't.
Here are some links to audio fingerprinting
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
Shazam-like acoustic fingerprinting of continuous audio streams (github.com) https://news.ycombinator.com/item?id=15809291
https://github.com/dest4/stream-audio-fingerprint
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints. https://github.com/adblockradio/stream-audio-fingerprint
https://stackoverflow.com/questions/26357841/audio-matching-audio-fingerprinting
By using open source library, pylibdmtx is able to detect data matrix barcode inside an image. The processing speed slower when the barcode just a small portion in a large image. It take few argument to shrink and detected the barcode
Here is a part of coding in the library
with libdmtx_decoder(img, shrink) as decoder:
properties = [
(DmtxProperty.DmtxPropScanGap, gap_size),
(DmtxProperty.DmtxPropSymbolSize, shape),
(DmtxProperty.DmtxPropSquareDevn, deviation),
(DmtxProperty.DmtxPropEdgeThresh, threshold),
(DmtxProperty.DmtxPropEdgeMin, min_edge),
(DmtxProperty.DmtxPropEdgeMax, max_edge)
]
My question is, is there any other library to use beside pylibdmtx ? Or any suggestion to increase the processing speed without affect the accuracy. By the way pylibdmtx is updated on 18/1/2017, it is a maintained library
An option is to pre-locate the code by image filtering.
A Data Matrix has a high contrast (in theory) and a given cell size. If you shring the image so that cells become one or two pixels large, the Data Matrix will stand-out as a highly textured area, and the gradient will strongly respond.
I am also using this library for decoding data matrix and i found something about these arguments that timeout is int value in milliseconds which is really helping for quickly decoding and gap_size is no of pixels between tow data matrix used when you have more than one data matrix in sequence to decode with equal gap. with threshold you can directly give threshold value between 0-100 to this function without using open-CV functions and max count is no of data matrix to be decode in one image and shape is data_matrix size i. for 1010 it is 0, 1212 it is 1 and so on.
by using all these together we can have quick and effective decoding of data matrix.