When readframes() is used in python, the online documention says sampling frequency is returned it looks it returns 2 bytes. I think there are 4 byte on each frame:
left = 2 bytes
right = 2 bytes
Do I have to check if it is mono or stereo and if it is stereo, read 2 frames at a time and if it is mono, read 1 frame at a time?
A wave file has:
sample rate of Wave_read.getframerate() per second (e.g 44100 if from an audio CD).
sample width of Wave_read.getsampwidth() bytes (i.e 1 for 8-bit samples, 2 for 16-bit samples)
Wave_read.getnchannels() channels (typically 1 for mono, 2 for stereo)
Every time you do a Wave_read.getframes(N), you get N * sample_width * n_channels bytes.
So, if you read 2048 frames from a 44100Hz, 16-bit stereo file, you get 8192 bytes as a result.
Related
I am trying to convert the image of DICOM files to arrays with the pydicom library with this code :
dcm_file = pydicom.dcmread(path)
print(dcm_file)
dcm_file.NumberOfFrames = 1
img = np.array(dcm_file.pixel_array)
Initially, the number of frames is 60. Images are 1024x1024, and here is the error I am getting :
The length of the pixel data in the dataset (1047504 bytes) doesn't match the expected length (2097152 bytes).
The dataset may be corrupted or there may be an issue with the pixel data handler.
For some files, the pixel data is an array of 2097152 elements and the code is working. I don't understand the differences between both as they are the same type of objects.
Some informations about the files for which it is not working :
(0028, 0008) Number of Frames IS: '60'
(0028, 0010) Rows US: 1024
(0028, 0011) Columns US: 1024
(0028, 0100) Bits Allocated US: 16
(0028, 0101) Bits Stored US: 16
(0028, 0102) High Bit US: 15
(0028, 0103) Pixel Representation US: 0
(7fe0, 0010) Pixel Data OW: Array of 1047504 elements
For all the files we also have :
(0002, 0010) Transfer Syntax UID UI: Explicit VR Little Endian
Background: I am writing a python script that will take in an audio file and modify it using pydub. Pydub seems to require converting audio input to a wav format though, which has a 4GB limit. So I put in a 400MB .m4a file into pydub and get an error that the file is too large.
Instead of having pydub run for a couple minutes, then throw an error if the converted decompressed size is too large, I would like to quickly calculate ahead of time what the decompressed filesize would be. If over 4GB, my script will chop the original audio and then run through pydub.
Thanks.
It's simple arithmetic to calculate the size of a theoretical .WAV file. The size, in bytes, is the bit depth divided by 8, multiplied by the sample rate, multiplied by the duration, multiplied by the number of channels.
So if you had an audio clip that was 3:20 long, 44100Hz, 16-bit and stereo, the calculation would be:
sample_rate = 44100 # Hz/Samples per second - CD Quality
bit_depth = 16 # 2 bytes; CD quality
channels = 2 # stereo
duration = 200.0 # seconds
file_size = sample_rate * (bit_depth / 8) * channels * duration
# = 44100 * (2) * 2 * 200
# = 35280000 bytes
# = 35.28 MB (megabytes)
I found this online audio file size calculator which you can also use to confirm your math: https://www.colincrawley.com/audio-file-size-calculator/
If instead you wanted to figure out the other direction, i.e. the size of a theoretical compressed file, it depends on how you're doing the compression. Typical compression, thankfully, uses just a fixed bitrate, meaning the math to figure out the resulting compressed file size is really simple.
So, if you had a 3:20 audio clip you wanted to convert to MP3, at a bitrate of 128kbps (kilobits per second, 128 is a common mid-range quality setting), the calculation would just be the bit rate, divided by 8 (bits per byte) multiplied by the duration:
bits_per_kb = 1000
bitrate_kbps = 128
bits_per_byte = 8
duration_seconds = 200
filesize_bytes = (bitrate_kbps * bits_per_kb / bits_per_byte) * duration_seconds
# = (128000 / 8) * 200
# = (16) * 200
# = 3200000 bytes
# = 3.2 MB
I understand how the readframes() method works for mono audio input, however I don't know how it will work for stereo input. Would it give a tuple of two byte objects?
A wave file has:
sample rate of Wave_read.getframerate() per second (e.g 44100 if from an audio CD).
sample width of Wave_read.getsampwidth() bytes (i.e 1 for 8-bit samples, 2 for 16-bit samples)
Wave_read.getnchannels() channels (typically 1 for mono, 2 for stereo)
Every time you do a Wave_read.getframes(N), you get N * sample_width * n_channels bytes.
I am using wave files for making deep learning model
they are in different length , so i want to pad all of them
to 16 sec length using python
If I understood correctly, the question wants to fix all lengths to a given length. Therefore, the solution will be slightly different:
from pydub import AudioSegment
pad_ms = 1000 # Add here the fix length you want (in milliseconds)
audio = AudioSegment.from_wav('you-wav-file.wav')
assert pad_ms > len(audio), "Audio was longer that 1 second. Path: " + str(full_path)
silence = AudioSegment.silent(duration=pad_ms-len(audio)+1)
padded = audio + silence # Adding silence after the audio
padded.export('padded-file.wav', format='wav')
This answer differs from this one in the sense that this one creates all audios from the same length where the other adds the same size of silence at the end.
Using pydub:
from pydub import AudioSegment
pad_ms = 1000 # milliseconds of silence needed
silence = AudioSegment.silent(duration=pad_ms)
audio = AudioSegment.from_wav('you-wav-file.wav')
padded = audio + silence # Adding silence after the audio
padded.export('padded-file.wav', format='wav')
AudioSegment objects are immutable
You can use Librosa. The Librosa.util.fix_length function adds silent patch to audio file by appending zeros to the end the numpy array containing the audio data:
from librosa import load
from librosa.util import fix_length
file_path = 'dir/audio.wav'
sf = 44100 # sampling frequency of wav file
required_audio_size = 5 # audio of size 2 second needs to be padded to 5 seconds
audio, sf = load(file_path, sr=sf, mono=True) # mono=True converts stereo audio to mono
padded_audio = fix_length(audio, size=5*sf) # array size is required_audio_size*sampling frequency
print('Array length before padding', np.shape(audio))
print('Audio length before padding in seconds', (np.shape(audio)[0]/fs))
print('Array length after padding', np.shape(padded_audio))
print('Audio length after padding in seconds', (np.shape(padded_audio)[0]/fs))
Output:
Array length before padding (88200,)
Audio length before padding in seconds 2.0
Array length after padding (220500,)
Audio length after padding in seconds 5.0
Although after looking through a number of similar questions, it seems like pydub.AudioSegment is the go to solution.
I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
audio_file = "sample.wav"
(rate,sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig,rate)
print len(sig) # 2130912
print len(mfcc_feat) # 4831
Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?
Run
mfcc_feat.shape
You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice
mfcc_feat[1900:2900,:]
Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).
If you want to get to the audio itself, it is
sig[time_beg_in_sec*rate:time_end_in_sec*rate]