Can aubio be used to detect rhythm-only segments?

Can aubio be used to detect rhythm-only segments? - python

Does aubio have a way to detect sections of a piece of audio that lack tonal elements -- rhythm only? I tested a piece of music that has 16 seconds of rhythm at the start, but all the aubiopitch and aubionotes algorithms seemed to detect tonality during the rhythmic section. Could it be tuned somehow to distinguish tonal from non-tonal onsets? Or is there a related library that can do this?

Been busy the past couple of days - but started looking into this today...
It'll take a while to perfect I guess but I thought I'd give you a few thoughts and some code I've started working on to attack this!
Firstly, pseudo code's a good way to design an initial method.
1/ use import matplotlib.pyplot as plt to spectrum analyse the audio, and plot various fft and audio signals.
2/ import numpy as np for basic array-like structure handling.
(I know this is more than pseudo code, but hey :-)
3/ plt.specgram creates spectral maps of your audio. Apart from the image it creates (which can be used to start to manually deconstruct your audio file), it returns 4 structures.
eg
ffts,freqs,times,img = plt.specgram(signal,Fs=44100)
ffts is a 2 dimentional array where the columns are the ffts (Fast Fourier Transforms) of the time sections (rows).
The plain vanilla specgram analyses time sections of 256 samples long, stepping 128 samples forwards each time.
This gives a very low resolution frequency array at a pretty fast rate.
As musical notes merge into a single sound when played at more or less 10 hz, I decided to use the specgram options to divide the audio into 4096 sample lengths (circa 10 hz) stepping forwards every 2048 samples (ie 20 times a second).
This gives a decent frequency resolution, and the time sections being 20th sec apart are faster than people can perceive individual notes.
This means calling the specgram as follows:
plt.specgram(signal,Fs=44100,NFFT=4096,noverlap=2048,mode='magnitude')
(Note the mode - this seems to give me amplitudes of between 0 - 0.1: I have a problem with fft not giving me amplitudes of the same scale as the audio signal (you may have seen the question I posted). But here we are...
4/ Next I decided to get rid of noise in the ffts returned. This means we can concentrate on freqs of a decent amplitude, and zero out the noise which is always present in ffts (in my experience).
Here is (are) my function(s):
def gate(signal,minAmplitude):
return np.array([int((((a-minAmplitude)+abs(a-minAmplitude))/2) > 0) * a for a in signal])
Looks a bit crazy - and I'm sure a proper mathematician could come up with something more efficient - but this is the best I could invent. It zeros any freqencies of amplitude less than minAmplitude.
This is the relevant code to call it from the ffts returned by plt.specgram as follows, my function is more involved as it is part of a class, and has other functions it references - but this should be enough:
def fft_noise_gate(minAmplitude=0.001,check=True):
'''
zero the amplitudes of frequencies
with amplitudes below minAmplitude
across self.ffts
check - plot middle fft just because!
'''
nffts = ffts.shape[1]
gated_ffts = []
for f in range(nffts):
fft = ffts[...,f]
# Anyone got a more efficient noise gate formula? Best I could think up!
fft_gated = gate(fft,minAmplitude)
gated_ffts.append(fft_gated)
ffts = np.array(gated_ffts)
if check:
# plot middle fft just to see!
plt.plot(ffts[int(nffts/2)])
plt.show(block=False)
return ffts
This should give you a start I'm still working on it and will get back to you when I've got further - but if you have any ideas, please share them.
Any way my strategy from here is to:
1/ find the peaks ( ie start of any sounds) then
2/ Look for ranges of frequencies which rise and fall in unison (ie make up a sound).
And
3/ Differentiate them into individual instruments (sound sources more specifically), and plot the times and amplitudes thereof to create your analysis (score).
Hope you're having fun with it - I know I am.
As I said any thoughts...
Regards
Tony

Use a spectrum analyser to detect sections with high amplitude. If you program - you could take each section and make an average of the freqencies (and amplitudes) present to give you an idea of the instrument(s) involved in creating that amplitude peak.
Hope that helps - if you're using python I could give you some pointers how to program this!?
Regards
Tony

Related

Head related impulse response for binaural audio

I am working with audio digital signal processing and binaural audio processing.
I am still learning the basics.
Right now, the idea is to do deconvolution and get an impulse response.
Please see the attached screenshot
Detailed description of what is happening:
Here, an exponential sweep signal is taken and played back back through loudspeaker. The playback is recorded using microphone. The recorded signal is extended using zero padding(probably double the original length) and the original exponential sweep signal is also extended as well. FFTs are taken for both (extended recorded and the extended original), their FFT's are divided and we get room transfer function. Finally,Inverse FFT is taken and some windowing is performed to get Impulse response.
My question:
I am having difficulty implementing this diagram in python. How would you divide two FFT's? Is it possible? I can probably do all steps like zero padding and fft's, but I guess I am not going the correct way. I do not understand the windowing and discarding second half option.
Please can anyone with his/her knowledge show me how would I implement this in python with sweep signal? Just a small example would also help to get an idea with few plots. Please help.
Source of this image: http://www.four-audio.com/data/MF/aes-swp-english.pdf
Thanks in advance,
Sanket Jain

Yes, deviding two FFT-spectra is possible and actually quite easy to implement in python (but with some caveats).
Simply said: As convolution of two time signal corresponds to multiplying their spectra, vice versa the deconvolution can be realized by dividing the spectra.
Here is an example for a simple deconvolution with numpy:
(x is your excitation sweep signal and y is the recorded sweep signal, from which you want to obtain the impulse response.)
import numpy as np
from numpy.fft import rfft, irfft
# define length of FFT (zero padding): at least double length of input
input_length = np.size(x)
n = np.ceil(np.log2(input_length)) + 1
N_fft = int(pow(2, n))
# transform
# real fft: N real input -> N/2+1 complex output (single sided spectrum)
# real ifft: N/2+1 complex input -> N real output
X_f = rfft(x, N_fft)
Y_f = rfft(x, N_fft)
# deconvolve
H = Y_f / X_f
# backward transform
h = irfft(H, N_fft)
# truncate to original length
h = h[:input_length]
This simple solution is a practical one but can (and should be) be improved. A problem is that you will get a boost of the noise floor at those frequencies where X_f has a low amplitude. For example if your exponential sine sweep starts at 100Hz, for the frequency bins below that frequency, you get a division of (almost) zero. One simple possible solution to that is to first invert X_f, apply a bandlimit filter (highpass+lowpass) to remove the "boost areas" and then multiply it with Y_f:
# deconvolve
Xinv_f = 1 / X_f
Xinv_f = Xinv_f * bandlimit_filter
H = Y_f * Xinv_f
Regarding the distortion:
A nice property of the exponential sine sweep is that harmonic distortion production during the measurement (e.g. by nonlinearities in the loudpspeaker) will produce smaller "side" responses before the "main" response after deconvolution (see this for more details). These side responses are the distortion products and can be simply removed by a time window. If there is no delay of the "main" response (starts at t=0), those side responses will appear at the end of the whole iFFT, so you remove them by windowing out the second half.
I cannot guarantee that this is 100% correct from a signal-theory point of view, but I think it shows the point and it works ;)

This is a little over my head, but maybe the following bits of advice can help.
First, I came across a very helpful amount of sample code presented in Steve Smith's book The Scientist and Engineer's Guide to Digital Signal Processing. This includes a range operations, from basics of convolution to the FFT algorithm itself. The sample code is in BASIC, not Python. But the BASIC is perfectly readable, and should be easy to translate.
I'm not entirely sure about the specific calculation you describe, but many operations in this realm (when dealing with multiple signals) turn out to simply employ addition or subtraction of constituent elements. To get an authoritative answer, I think you will have better luck at Stack Overflow's Signal Processing forum or at one of the forums at DSP Related.
If you do get an answer elsewhere, it might be good to either recap it here or delete this question entirely to reduce clutter.

Optimising parameters for finding peaks in 1D array

I need to optimise a method for finding the number of data peaks in a 1D array. The data is a time-series of the amplitude of a wav file.
I have the code implemented already:
from scipy.io.wavfile import read
from scipy.signal import find_peaks
_, amplitudes = read('audio1.wav')
indexes, _ = find_peaks(amplitudes, height=80)
print(f'Number of peaks: {len(indexes)}')
When plotted, the data looks like this:
General scale
The 'peaks' that I am interested in are clear to the human eye - there are 23 in this particular dataset.
However, because the array is so large, the data is extremely variant within the peaks that are clear at a general scale (hence the many hundreds of peaks labelled with blue crosses):
Zoomed in view of one peak
Peak-finding questions have been asked many times before (I've been through a lot of them!) - but I can't find any help or explanation of optimising the parameters for finding only the peaks I want. I know a little about Python, but am blind when it comes to mathematical analysis!
Analysing by width seems useless because, as per the second image, the peaks clear at a large scale are actually interspersed with 'silent' ranges. Distance is not helpful because I do not know how close the peaks will be in other wav files. Prominence has been suggested as the best method but I could not get the results I needed; threshold likewise. I have also tried decimating the signal, smoothing the signal with a Savitzky-Golay filter, and different combinations of parameters and values, all with inaccurate results.
Height alone has been useful because I can see from the charts that peaks always reach above 80.

This is a common task in audio processing and there are several approaches which totally depend on your data.
However, there are implementations out there which are used for finding peaks in novelty functions (e.g., the output from a beat tracker). Try these ones:
https://madmom.readthedocs.io/en/latest/modules/features/onsets.html#madmom.features.onsets.peak_picking
https://librosa.github.io/librosa/generated/librosa.util.peak_pick.html#librosa.util.peak_pick
Basically they implement the same method but there might be differences in the details.
Furthermore, you could check, if you really need to work on this high sampling frequency. Try downsampling the signal or use a moving average filter.

Look at 0d persistent homology to find a good strategy, where the parameter you can optimize for is peak persistence. A nice blog post here explains the basics.
But in short the idea is to imagine your graph being filled by water, and then slowly draining the water. Every time a piece of the graph comes above water a new island is born. When two islands are next to each other they merge, which causes the younger island (with the lower peak) to die.
Then each data point has a birth time and a death time. The most significant peaks are those with the longest persistence, which is death - birth.
If the water level drops at a continuous rate, then the persistence is defined in terms of peak height. Another possibility is by dropping the water instantaneously from point to point as time goes from step t to step t+1, in wich case the persistence is defined in peak width in terms of signal samples.
For you it seems that using the original definition in terms of peak height > 70 finds all peaks you are interested in, albeit possibly too many, clustered together. You can limit this by choosing the first peak in each cluster or the highest peak in each cluster or by doing both approaches and only choosing peaks that have both great height persistence as well as width persistence.

How to determine the frequency range of interested sound with ambient noise

I'm very new to signal processing. I have two sound signal data right now. Each of the data is collected at a sample rate of 10 KHz, 2 seconds. I have imported this data into python. Both sound_1 and sound_2 is a numpy array right now. The length of each sound data is 20000 of course.
Sound_1 contains a water flow sound(which I'm interested) and environmental noise(I'm not interested), while sound_2 only contains environment noise(I'm not interested).
I'm looking for an algorithm(or package) which can help me determine the frequency range of this water flow sound. I think if I can find out the frequency range, I can use an inverse Fourier transform to filter the environment noise.
However, my ultimate purpose is to extract the water flow sound from sound_1 data and eliminate environmental noise. It would be great if there are other approaches.
I'm currently looking at this post: Python frequency detection
But I don't understand how they can find out the frequency by only one sound signal. I think we need to compare 2 signal data at least(one contains the sound I am interested, the other doesn't), so we can find out the difference.

Since sound_1 contains both water flow and environmental noise, there's no straightforward way of extracting the water flow. The Fourier transform will get you all frequencies in the signal, irrespective of the source.
The way to approach is get frequencies of environmental noise from sound_2 and then remove them from sound_1. After that is done, you can extract the frequencies from already denoised sound_1.
One of popular approaches to such noise reduction is with spectral gating. Essentially, you first determine how the noise sounds like and then remove smoothed spectrum from your signal. Smoothing is crucial, as sound is a wave, a continuous entity. If you simply chop out discrete frequencies from the wave, you will get very poor results (audio will sound unnatural and robotic). The amount of smoothing you apply will determine how much noise is reduced (mind it's never truly removed - you will always get some residue).
To the concrete solution.
As you're new to the subject, I'd recommend first how noise reduction works in a software that will do the work for you. Audacity is an excellent choice. I linked the manual for noise reduction, but there are plenty of tutorials out there.
After you know what you want to get, you can either implement spectral gating yourself or use existing package. Audacity has an excellent implementation in C++, but it may prove difficult to a newbie to port. I'd recommend going first with noisereduce package. It's based on Audacity implementation. If you use it, you will be done in a few lines.
Here's a snippet:
import noisereduce as nr
# load data
rate, data = wavfile.read("sound_1.wav")
# select section of data that is noise
noisy_part = wavfile.read("sound_2.wav")
# perform noise reduction
reduced_noise = nr.reduce_noise(audio_clip=data, noise_clip=noisy_part, verbose=True)
Now simply run FFT on the reduced_noise to discover the frequencies of water flow.
Here's how I am using noisereduce. In this part I am determining the frequency statistics.

Why should I discard half of what a FFT returns?

Looking at this answer:
Python Scipy FFT wav files
The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):
1) Why do I have to normalized (b=...) the frames? What would happen if I used the raw data?
2) Why should I only use half of the FFT result (d=...)?
3) Why should I abs(c) the FFT result?
Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.
Edit: in response to the comment by #Trilarion :
I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).
I'm still new to analyzing wav files and audio in general.
import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2 # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r')
plt.show()

To address these in order:
1) You don't need to normalize, but the input normalization is close to the raw structure of the digitized waveform so the numbers are unintuitive. For example, how loud is a value of 67? It's easier to normalize it to be in the range -1 to 1 to interpret the values. (But if you wanted to implement a filter, for example, where you did an FFT, modified the FFT values, followed by an IFFT, normalizing would be an unnecessary hassle.)
2) and 3) are similar in that they both have to do with the math living primarily in the complex numbers space. That is, FFTs take a waveform of complex numbers (eg, [.5+.1j, .4+.7j, .4+.6j, ...]) to another sequence of complex numbers.
So in detail:
2) It turns out that if the input waveform is real instead of complex, then the FFT has a symmetry about 0, so only the values that have a frequency >=0 are uniquely interesting.
3) The values output by the FFT are complex, so they have a Re and Im part, but this can also be expressed as a magnitude and phase. For audio signals, it's usually the magnitude that's the most interesting, because this is primarily what we hear. Therefore people often use abs (which is the magnitude), but the phase can be important for different problems as well.

That depends on what you're trying to do. It looks like you're only looking to plot the spectral density and then it's OK to do so.
In general the coefficient in the DFT is depending on the phase for each frequency so if you want to keep phase information you have to keep the argument of the complex numbers.
The symmetry you see is only guaranteed if the input is real numbered sequence (IIRC). It's related to the mirroring distortion you'll get if you have frequencies above the Nyquist frequency (half the sampling frequency), the original frequency shows up in the DFT, but also the mirrored frequency.
If you're going to inverse DFT you should keep the full data and also keep the arguments of the DFT-coefficients.

How to convert a pitch track from a melody extraction algorithm to a humming like audio signal

As part of a fun-at-home-research-project, I am trying to find a way to reduce/convert a song to a humming like audio signal (the underlying melody that we humans perceive when we listen to a song). Before I proceed any further in describing my attempt on this problem, I would like to mention that I am totally new to audio analysis though I have a lot of experience with analyzing images and videos.
After googling a bit, I found a bunch of melody extraction algorithms. Given a polyphonic audio signal of a song (ex: .wav file), they output a pitch track --- at each point in time they estimate the dominant pitch (coming from a singer's voice or some melody generating instrument) and track the dominant pitch over time.
I read a few papers, and they seem to compute a short time Fourier transform of the song, and then do some analysis on the spectrogram to get and track the dominant pitch. Melody extraction is only a component in the system I am trying to develop, so I don't mind using any algorithm that's available as far as it does a decent job on my audio files and the code is available. Since I am new to this, I would be happy to hear any suggestions on which algorithms are known to work well and where can I find its code.
I found two algorithms:
Yaapt pitch tracking
Melodia
I chose Melodia as the results on different music genres looked quite impressive. Please check this to see its results. The humming that you hear for each piece of music is essentially what I am interested in.
"It is the generation of this humming for any arbitrary song, that I want your help with in this question".
The algorithm (available as a vamp plugin) outputs a pitch track --- [time_stamp, pitch/frequency] --- an Nx2 matrix where in the first column is the time-stamp (in seconds) and the second column is dominant pitch detected at the corresponding time-stamp. Shown below is a visualization of the pitch-track obtained from the algorithm overlayed in purple color with a song's time-domain signal (above) and it spectrogram/short-time-fourier. Negative-values of pitch/frequency represent the algorithms dominant pitch estimate for un-voiced/non-melodic segments. So all pitch estimates >= 0 correspond to the melody, the rest are not important to me.
Now I want to convert this pitch track back to a humming like audio signal -- just like the authors have it on their website.
Below is a MATLAB function that I wrote to do this:
function [melSignal] = melody2audio(melody, varargin)
% melSignal = melody2audio(melody, Fs, synthtype)
% melSignal = melody2audio(melody, Fs)
% melSignal = melody2audio(melody)
%
% Convert melody/pitch-track to a time-domain signal
%
% Inputs:
%
% melody - [time-stamp, dominant-frequency]
% an Nx2 matrix with time-stamp in the
% first column and the detected dominant
% frequency at corresponding time-stamp
% in the second column.
%
% synthtype - string to choose synthesis method
% passed to synth function in synth.m
% current choices are: 'fm', 'sine' or 'saw'
% default='fm'
%
% Fs - sampling frequency in Hz
% default = 44.1e3
%
% Output:
%
% melSignal -- time-domain representation of the
% melody. When you play this, you
% are supposed to hear a humming
% of the input melody/pitch-track
%
p = inputParser;
p.addRequired('melody', #isnumeric);
p.addParamValue('Fs', 44100, #(x) isnumeric(x) && isscalar(x));
p.addParamValue('synthtype', 'fm', #(x) ismember(x, {'fm', 'sine', 'saw'}));
p.addParamValue('amp', 60/127, #(x) isnumeric(x) && isscalar(x));
p.parse(melody, varargin{:});
parameters = p.Results;
% get parameter values
Fs = parameters.Fs;
synthtype = parameters.synthtype;
amp = parameters.amp;
% generate melody
numTimePoints = size(melody,1);
endtime = melody(end,1);
melSignal = zeros(1, ceil(endtime*Fs));
h = waitbar(0, 'Generating Melody Audio' );
for i = 1:numTimePoints
% frequency
freq = max(0, melody(i,2));
% duration
if i > 1
n1 = floor(melody(i-1,1)*Fs)+1;
dur = melody(i,1) - melody(i-1,1);
else
n1 = 1;
dur = melody(i,1);
end
% synthesize/generate signal of given freq
sig = synth(freq, dur, amp, Fs, synthtype);
N = length(sig);
% augment note to whole signal
melSignal(n1:n1+N-1) = melSignal(n1:n1+N-1) + reshape(sig,1,[]);
% update status
waitbar(i/size(melody,1));
end
close(h);
end
The underlying logic behind this code is the following: at each time-stamp, I synthesize a short-lived wave (say a sine-wave) with frequency equal to the detected dominant pitch/frequency at that time-stamp for a duration equal to its gap with the next time-stamp in the input melody matrix. I only wonder if I am doing this right.
Then I take the audio signal I get from this function and play it with the original song (melody on the left channel and original song on the right channel). Though the generated audio signal seems to segment the melody-generating sources (voice/lead-intstrument) fairly well -- its active where voice is and zero everywhere else --- the signal itself is far from being a humming (I get something like beep beep beeeeep beep beeep beeeeeeeep) that the authors show on their website. Specifically, below is a visualization showing the time-domain signal of the input song in the bottom and the time-domain signal of the melody generated using my function.
One main issue is -- though I am given the frequency of the wave to generate at each time-stamp and also the duration, I don't know how to set the amplitude of the wave. For now, I set the amplitude to be flat/a-constant value, and i suspect this is where the problem is.
Does anyone have any suggestions on this? I welcome suggestions in any program language (preferably MATLAB, python, C++), but I guess my question here is more general --- How to generate the wave at each time-stamp?
A few ideas/fixes in my mind:
Set the amplitude by getting an averaged/max estimate of the amplitude from the time-domain signal of the original song.
Totally change my approach --- compute the spectrogram/short-time fourier transform of the song's audio signal. cut-off hardly/zero-out or softly all other frequencies except the ones in my pitch-track (or are close to my pitch-track). And then compute the inverse short-time fourier transform to get the time-domain signal.

If I understand correctly, you seem to already have an accurate representation of the pitch but your problem is that what you generate just doesn't "sound good enough".
Starting with your second approach: filtering out anything but the pitch isn't going to lead to anything good. By removing everything but a few frequency bins corresponding to your local pitch estimates, you will lose the texture of the input signal, what makes it sound good. In fact, if you took that to an extreme and removed everything but the one sample corresponding to the pitch and took an ifft, you would get exactly a sinusoid, which is what you are doing currently.
If you wanted to do this anyway, I recommend you perform all of this by just applying a filter to your temporal signal rather than going in and out of the frequency domain, which is more expensive and cumbersome. The filter would have a small cutoff around the frequency you want to keep and that would allow for a sound with better texture as well.
However, if you already have pitch and duration estimates that you are happy with but you want to improve on the sound rendering, I suggest that you just replace your sine waves--which will always sound like silly beep-beep no matter how much you massage them--with some actual humming (or violin or flute or whatever you like) samples for each frequency in the scale. If memory is a concern or if the songs you represent do not fall into a well tempered scale (thinking middle-east song for example), instead of having a humming sample for each note of the scale, you could only have humming samples for a few frequencies. You would then derive the humming sounds at any frequency by doing a sample rate conversion from one of these humming samples. Having a few samples to pick from for doing the sample conversion would allow you to pick the one that leans to "best" ratio with the frequency you need to produce, since the complexity of sampling conversion depends on that ratio. Obviously adding a sample rate conversion would be more work and computationally demanding compared to just having a bank of samples to pick from.
Using a bank of real samples will make a big difference in the quality of what you render. It will also allow you to have realistic attacks for each new note you play.
Then yes, like you suggest, you may want to also play with the amplitude by following the instantaneous amplitude of the input signal to produce a more nuanced rendering of the song.
Last, I would also play with the duration estimates you have so that you have smoother transitions from one sound to the next. Guessing from your performance of your audio file that I enjoyed very much (beep beep beeeeep beep beeep beeeeeeeep) and the graph that you display, it looks like you have many interruptions inserted in the rendering of your song. You could avoid this by extending the duration estimates to get rid of any silence that is shorter than, say .1 second. That way you would preserver the real silences from the original song but avoid cutting off each note of your song.

Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.
That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?
The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:
fs = 44100; % sampling frequency
% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1;
x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);
% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);
dur2 = 1;
t2 = 0:1/fs:dur2;
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!
x3 = [x1 x2]; % this should give you a waveform without any discontinuities
Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.
Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.
If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.
Hope this answers your question!

You have at least 2 problems.
First, as you surmised, your analysis has thrown away all the amplitude information of the melody portion of the original spectrum. You will need an algorithm that captures that information (and not just the amplitude of the entire signal for polyphonic input, or that of just the FFT pitch bin for any natural musical sounds). This is a non-trivial problem, somewhere between melodic pitch extraction and blind source separation.
Second, sound has timbre, including overtones and envelopes, even at a constant frequency. Your synthesis method is only creating a single sinewave, while humming probably creates a bunch of more interesting overtones, including a lot of higher frequencies than just the pitch. For a slightly more natural sound, you could try analyzing the spectrum of yourself humming a single pitch and try to recreate all those dozens of overtone sine waves, instead of just one, each at the appropriate relative amplitude, when synthesizing each frequency time-stamp in your analysis. You could also look at the amplitude envelope over time of yourself humming one short note, and use that envelope to modulate the amplitude of your synthesizer.

use libfmp.c8 to sonify the values
import IPython.display as ipd
import libfmp.b
import libfmp.c8
data = vamp.collect(audio, samplerate, "mtg-melodia:melodia", parameters=params)
hop, melody = data['vector']
timestamps=np.arange(0,len(melody)) * float(hop)
melody_pos = melody[:]
melody_pos[melody<=0] = 0 #get rid off - vals
d = {'time': ts, 'frequency':pd.Series(melody_pos) }
df=pd.DataFrame(d)
traj = df.values
x_traj_mono = libfmp.c8.sonify_trajectory_with_sinusoid(traj, len(audio), sr, smooth_len=50, amplitude=0.8)
ipd.display(ipd.Audio(x_traj_mono+y, rate=sr))```

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.