I'm doing an rfft and irfft from a wave file:
samplerate, data = wavfile.read(location)
input = data.T[0] # first track of audio
fftData = np.fft.rfft(input[sample:], length)
output = np.fft.irfft(fftData).astype(data.dtype)
So it reads from a file and then does rfft. However it produces a lot of noise when I play the audio with py audio stream. I tried to search an answer to this question and used this solution:
rfft or irfft increasing wav file volume in python
That is why I have the .astype(data.dtype) when doing the irfft. However it doesn't reduce the noise, it reduced it a bit but still it sounds all wrong.
This is the playback, where p is the pyAudio:
stream = p.open(format=pyaudio.paFloat32,
channels=1,
rate=fs,
output=True)
stream.write(output)
stream.stop_stream()
stream.close()
p.terminate()
So what am I doing wrong here?
Thanks!
edit: Also I tried to use .astype(dtype=np.float32) when doing the irfft as the pyaudio uses that when streaming audio. However it was still noisy.
The best working solution this far seems to be normalization with median value and using .astype(np.float32) as pyAudio output is float32:
samplerate, data = wavfile.read(location)
input = data.T[0] # first track of audio
fftData = np.fft.rfft(input[sample:], length)
fftData = np.divide(fftData, np.median(fftData))
output = np.fft.irfft(fftData).astype(dtype=np.float32)
If anyone has better solutions I'd like to hear. I tried with mean normalization but it still resulted in clipping audio, normalization with np.max made the whole audio too low. This normalization problem with FFT is always giving me trouble and haven't found any 100% working solutions here in SO.
Related
The Question
I want to load an audio file of any type (mp3, m4a, flac, etc) and write it to an output stream.
I tried using pydub, but it loads the entire file at once which takes forever and runs out of memory easily.
I also tried using python-vlc, but it's been unreliable and too much of a black box.
So, how can I open large audio files chunk-by-chunk for streaming?
Edit #1
I found half of a solution here, but I'll need to do more research for the other half.
TL;DR: Use subprocess and ffmpeg to convert the file to wav data, and pipe that data into np.frombuffer. The problem is, the subprocess still has to finish before frombuffer is used.
...unless it's possible to have the pipe written to on 1 thread while np reads it from another thread, which I haven't tested yet. For now, this problem is not solved.
I think the python package https://github.com/irmen/pyminiaudio can be of helpful. You can stream an audio file like this
import miniaudio
audio_path = "my_audio_file.mp3"
target_sampling_rate = 44100 #the input audio will be resampled a this sampling rate
n_channels = 1 #either 1 or 2
waveform_duration = 30 #in seconds
offset = 15 #this means that we read only in the interval [15s, duration of file]
waveform_generator = miniaudio.stream_file(
filename = audio_path,
sample_rate = target_sampling_rate,
seek_frame = int(offset * target_sampling_rate),
frames_to_read = int(waveform_duration * target_sampling_rate),
output_format = miniaudio.SampleFormat.FLOAT32,
nchannels = n_channels)
for waveform in waveform_generator:
#do something with the waveform....
I know for sure that this works on mp3, ogg, wav, flac but for some reason it does not on mp4/acc and I am actually looking for a way to read mp4/acc
I'm doing almost everything exactly as the pyaudio documentation describes. The main difference: in the stream loop, I am running my song data through a filter that outputs audio data as an array. The resulting output is only white noise. Here's my code:
import librosa as lr
y, sr = lr.load(song_filename)
# open stream
stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
channels=wf.getnchannels(),
rate=wf.getframerate(),
output=True)
while _:
x = my_filter(parameter, y=y[CHUNK])
stream.write(x)
CHUNK = updateCHUNK()
My guess is that librosa uses numpy array while pyaudio uses bytes. So I tried this:
while _:
x = my_filter(parameter, y=y[CHUNK])
x = x.tobytes()
stream.write(x)
CHUNK = updateCHUNK()
Nothing was fixed by .tobytes(). Can anyone suggest solutions to this issue?
Edit:
I realize that I'm not ready to look for solutions. Still trying to understand what the problem is.
I accidentally forgot to convert some NumPy arrays to bytes objects when using PyAudio, but to my surprise it still played audio, even if it sounded a bit off. I wrote a little test script (see below) for playing 1 second of a 440Hz tone, and it seems like writing a NumPy array directly to a PyAudio Stream cuts that tone short.
Can anyone explain why this happens? I thought a NumPy array was a contiguous sequence of bytes with some header information about its dtype and strides, so I would've predicted that PyAudio played the full second of the tone after some garbled audio from the header, not cut the tone off.
# script segment
import pyaudio
import numpy as np
RATE = 48000
p = pyaudio.PyAudio()
stream = p.open(format = pyaudio.paFloat32, channels = 1, rate = RATE, output = True)
TONE = 440
SECONDS = 1
t = np.arange(0, 2*np.pi*TONE*SECONDS, 2*np.pi*TONE/RATE)
sina = np.sin(t).astype(np.float32)
sinb = sina.tobytes()
# console commands segment
stream.write(sinb) # bytes object plays 1 second of 440Hz tone
stream.write(sina) # still plays 440Hz tone, but noticeably shorter than 1 second
The problem is more subtle than you describe. Your first call is passing a bytes array of size 192,000. The second call is passing a list of float32 values with size 48,000. pyaudio handles both of them, and passes the buffer to portaudio to be played.
However, when you opened pyaudio, you told it you were sending paFloat32 data, which has 4 bytes per sample. The pyaudio write handler takes the length of the array you gave it, and divides by the number of channels times the sample size to determine how many audio samples there are. In your second call, the length of the array is 48,000, which it divides by 4, and thereby tells portaudio "there are 12,000 samples here".
So, everyone understood the format, but were confused about the size. If you change the second call to
stream.write(sina, 48000)
then no one has to guess, and it works perfectly fine.
As you might notice, i am really new to python and sound processing. I (hopefully) extracted FFT data from a wave file using python and the logfbank and mfcc function. (The logfbank seems to give the most promising data, mfcc output looked a bit weird for me).
In my program i want to change the logfbank/mfcc data and then create wave data from it (and write them into a file). I didn't really find any information about the process of creating wave data from FFT data. Does anyone of you have an idea how to solve this? I would appreciate it a lot :)
This is my code so far:
from scipy.io import wavfile
import numpy as np
from python_speech_features import mfcc, logfbank
rate, signal = wavfile.read('orig.wav')
fbank = logfbank(signal, rate, nfilt=100, nfft=1400).T
mfcc = mfcc(signal, rate, numcep=13, nfilt=26, nfft=1103).T
#magic data processing of fbank or mfcc here
#creating wave data and writing it back to a .wav file here
A suitably constructed STFT spectrogram containing both magnitude and phase can be converted back to a time-domain waveform using the Overlap Add method. Important thing is that the spectrogram construction must have the constant-overlap-add property.
It can be challenging to have your modifications correctly manipulate both magnitude and phase of a spectrogram. So sometimes the phase is discarded, and magnitude manipulated independently. In order to convert this back into a waveform one must then estimate phase information during reconstruction (phase reconstruction). This is a lossy process, and usually pretty computationally intensive. Established approaches use an iterative algorithm, usually a variation on Griffin-Lim. But there are now also new methods using Convolutional Neural Networks.
Waveform from mel-spectrogram or MFCC using librosa
librosa version 0.7.0 contains a fast Griffin-Lim implementation as well as helper functions to invert a mel-spectrogram of MFCC.
Below is a code example. The input test file is found at https://github.com/jonnor/machinehearing/blob/ab7fe72807e9519af0151ec4f7ebfd890f432c83/handson/spectrogram-inversion/436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac
import numpy
import librosa
import soundfile
# parameters
sr = 22050
n_mels = 128
hop_length = 512
n_iter = 32
n_mfcc = None # can try n_mfcc=20
# load audio and create Mel-spectrogram
path = '436951__arnaud-coutancier__old-ladies-pets-and-train-02.flac'
y, _ = librosa.load(path, sr=sr)
S = numpy.abs(librosa.stft(y, hop_length=hop_length, n_fft=hop_length*2))
mel_spec = librosa.feature.melspectrogram(S=S, sr=sr, n_mels=n_mels, hop_length=hop_length)
# optional, compute MFCCs in addition
if n_mfcc is not None:
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), sr=sr, n_mfcc=n_mfcc)
mel_spec = librosa.feature.inverse.mfcc_to_mel(mfcc, n_mels=n_mels)
# Invert mel-spectrogram
S_inv = librosa.feature.inverse.mel_to_stft(mel_spec, sr=sr, n_fft=hop_length*4)
y_inv = librosa.griffinlim(S_inv, n_iter=n_iter,
hop_length=hop_length)
soundfile.write('orig.wav', y, samplerate=sr)
soundfile.write('inv.wav', y_inv, samplerate=sr)
Results
The reconstructed waveform will have some artifacts.
The above example got a lot of repetitive noise, more than I expected. It was possible to reduce it quite a lot using the standard Noise Reduction algorithm in Audacity.
I'm trying to create a spectrogram program (in python), which will analyze and display the frequency spectrum from a microphone input in real time. I am using a template program for recording audio from here: http://people.csail.mit.edu/hubert/pyaudio/#examples (recording example)
This template program works fine, but I am unsure of the format of the data that is being returned from the data = stream.read(CHUNK) line. I have done some research on the .wav format, which is used in this program, but I cannot find the meaning of the actual data bytes themselves, just definitions for the metadata in the .wav file.
I understand this program uses 16 bit samples, and the 'chunks' are stored in python strings. I was hoping somebody could help me understand exactly what the data in each sample represents. Even just a link to a source for this information would be helpful. I tried googling, but I don't think I know the terminology well enough to search accurately.
stream.read gives you binary data. To get the decimal audio samples, you can use numpy.fromstring to turn it into a numpy array or you use Python's built-in struct.unpack.
Example:
import pyaudio
import numpy
import struct
CHUNK = 128
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=44100, input=True, frames_per_buffer=CHUNK)
data = stream.read(CHUNK)
print numpy.fromstring(data, numpy.int16) # use external numpy module
print struct.unpack('h'*CHUNK, data) # use built-in struct module
stream.stop_stream()
stream.close()
p.terminate()