TensorFlow Dataset - how to play / convert WAV-File (int64)? - python

i want to test the following Dataset: https://www.tensorflow.org/datasets/catalog/speech_commands
when i load & play the audio i just get ?random? noise.
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import IPython.display as ipd
ds, ds_info = tfds.load('speech_commands', shuffle_files=False, with_info=True)
ds_info
tfds.core.DatasetInfo(
name='speech_commands',
full_name='speech_commands/0.0.2',
description="""
An audio dataset of spoken words designed to help train and evaluate keyword
spotting systems. Its primary goal is to provide a way to build and test small
models that detect when a single word is spoken, from a set of ten target words,
with as few false positives as possible from background noise or unrelated
speech. Note that in the train and validation set, the label "unknown" is much
more prevalent than the labels of the target words or background noise.
One difference from the release version is the handling of silent segments.
While in the test set the silence segments are regular 1 second files, in the
training they are provided as long segments under "background_noise" folder.
Here we split these background noise into 1 second clips, and also keep one of
the files for the validation set.
""",
homepage='https://arxiv.org/abs/1804.03209',
data_path='C:\\Users\\abc\\tensorflow_datasets\\speech_commands\\0.0.2',
download_size=2.37 GiB,
dataset_size=9.07 GiB,
features=FeaturesDict({
'audio': Audio(shape=(None,), dtype=tf.int64),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=12),
}),
supervised_keys=('audio', 'label'),
splits={
'test': <SplitInfo num_examples=4890, num_shards=4>,
'train': <SplitInfo num_examples=106497, num_shards=128>,
'validation': <SplitInfo num_examples=121, num_shards=1>,
},
citation="""#article{speechcommandsv2,
author = {{Warden}, P.},
title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.03209},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
year = 2018,
month = apr,
url = {https://arxiv.org/abs/1804.03209},
}""",
)
The Audio files are Arrays of Type int64 with a Samplerate of 16000. I couldnt find any information on how to play the Files within this Dataset. From other Datasets i was able to play the WAV-Sounds. One of the difference is, that other DS used float arrays and this DS uses int array. Maybe im missing a conversation step?
ds_list = list(ds['validation'])
idx = -1
audio, label = ds_list[idx]['audio'], ds_list[idx]['label']
ipd.Audio(audio, rate=16_000)
I obviously tried multiple Indeces within the Dataset but i always just get noise. One Audio-Entry looks something like this:
tf.Tensor([ -112 1285 -2002 ... -140 1000 -595], shape=(16000,), dtype=int64)
Ty :)

As per the source code, on the description page [1], it is stated that:
Its primary goal is to provide a way to build and test small
models that detect when a single word is spoken, from a set of ten target words,
with as few false positives as possible from background noise or unrelated
speech.
At first, I was able to play noisy wavfile as you have shown. Then, I modify my codes based on [2] to produce cleaner voice.
I use the following code to convert tensor to wav format.
import scipy.io.wavfile as wavfile
import tensorflow as tf
import tensorflow_datasets as tfds
# load speech commands dataset
ds = tfds.load('speech_commands', split=['train', 'validation', 'test'],
shuffle_files=True)
# convert from tfds format to list
ds_train = list(ds[0])
ds_val = list(ds[1])
# convert from tensor int64 to numpy float32
sc1 = ds_list[0]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()
sv1 = ds_list[1]['audio'].numpy().astype(np.float32)/np.iinfo(np.int16).max()
# save as wav
wavfile('sc_train_1.wav', 16000, sc1)
wavfile('sc_train_1.wav', 16000, sv1)
The trick is to convert int64 to float32 and divide by max value of np.int16: .astype(np.float32)/np.iinfo(np.int16).max().
Now, I can hear a cleaner voice than previously int64 format.
[1]https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/audio/speech_commands.py
[2] https://github.com/google-research/google-research/blob/master/non_semantic_speech_benchmark/train_and_eval_sklearn_small_tfds_dataset.ipynb

Related

Map function of tf.data.Dataset API giving unexpected error | TypeError

I'm doing Audio classification project, i.e. finding if whale call is present or not in given .wav file. Following are the steps of data preprocessing. First cell creates dataset of paths to positive and negative samples. Second cell shows the data type of of dataset object. As you can see in the third cell when we iterate through some samples in dataset each sample is a tensor which contains a path to .wav file and label. Fourth cell is the preprocessing method which I need to apply on each dataset sample. Problem is when I run the data.map(preprocess) it throws error. You can see at the end for more detail.
import librosa
import matplotlib.pyplot as plt
import librosa.display
import math
import tensorflow as tf
import numpy
FRAME_SIZE = 2048
HOP_LENGTH = 512
SR = 200
POS = '/kaggle/input/datafestintegration2023/train/train/1'
NEG = '/kaggle/input/datafestintegration2023/train/train/0'
pos = tf.data.Dataset.list_files(POS+'/*.wav')
neg = tf.data.Dataset.list_files(NEG+'/*.wav')
positives = tf.data.Dataset.zip((pos, tf.data.Dataset.from_tensor_slices(tf.ones(len(pos)))))
negatives = tf.data.Dataset.zip((neg,tf.data.Dataset.from_tensor_slices(tf.zeros(len(neg)))))
dataset = positives.concatenate(negatives)
dataset
for d in dataset.take(5):
print(d)
def preprocess(file_path, label):
wav, _ = librosa.load(file_path, sr=SR)
wav = wav[:12000]
zero_padding = tf.zeros([12000] - tf.shape(wav), dtype=tf.float32)
wav = tf.concat([zero_padding, wav],0)
wav = np.array(wav)
mel_spectrogram = librosa.feature.melspectrogram(wav, sr=SR, n_fft=FRAME_SIZE, hop_length=HOP_LENGTH, n_mels=10)
log_mel_spectrogram = librosa.power_to_db(mel_spectrogram)
spectrogram = tf.expand_dims(log_mel_spectrogram, axis=2)
return spectrogram, label
dataset = dataset.map(preprocess)
When I run the above code cell it throws the below error. As per my understanding the preprocess method is not able to fetch the paths from dataset. What should I do?

Why Do Librosa, PyDub and Tensorflow read the same mp3 differently?

I have downloaded the Kaggle Speech Accent Archive to learn how to handle audio data. I'm comparing three ways of reading mp3's in this dataset. The first uses Tensorflow's AudioIOTensor, the second uses Librosa and the third uses PyDub. I let each of them read the same mp3 file, however, all 3 get different results on the same file.
I used this code:
import librosa
import numpy as np
import os
import pathlib
import pyaudio
from pydub import AudioSegment as pydub_AudioSegment
from pydub.utils import mediainfo as pydub_mediainfo
import tensorflow as tf
import tensorflow_io as tfio
DATA_DIR = <Path to data>
data_path = pathlib.Path(DATA_DIR)
mp3Files = [x for x in data_path.iterdir() if '.mp3' in x.name]
def load_audios(file_list):
dataset = []
for curr_file in file_list:
tf2 = tfio.audio.AudioIOTensor(curr_file.as_posix())
librsa, librsa_sr = librosa.load(curr_file.as_posix())
pdub = pydub_AudioSegment.from_file(curr_file.as_posix(),'mp3')
dataset.append([tf2, librsa, librsa_sr, pdub, curr_file.name])
return dataset
audios = load_audios(mp3Files[0:1]) # Reads 'afrikaans1.mp3' from
# Kaggle's Speech Accent Archive
tf2 = audios[0][0]
libr = audios[0][1]
libr_sr = audios[0][2]
pdub = audios[0][3]
But now when I start comparing the way these 3 modules read the same mp3 file I see this behavior for Tensorflow's AudioIOTensor:
>> tf2_arr = tf.squeeze(tf2.to_tensor(),-1).numpy()
>> tf2_arr, tf2, tf2_arr.shape # Gives raw data, sampling rate & shape
(array([ 0.00905748, 0.01102116, 0.00883307, ..., -0.00131128,
-0.00134344, -0.00090137], dtype=float32),
<AudioIOTensor: shape=[916057 1], dtype=<dtype: 'float32'>, rate=44100>,
(916057,))
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
This behavior for Librosa:
>> libr, libr_sr, libr.shape # Gives raw data, sampling rate & shape
(array([ 0.00711342, 0.01064209, 0.00806945, ..., -0.00168153,
-0.00148052, 0. ], dtype=float32),
22050,
(458029,))
And for PyDub, I see this:
>> pdub_data = np.array(pdub.get_array_of_samples())
>> pdub_data, pdub.frame_rate, pdub_data.shape # Gives raw data, sampling rate
# & shape
(array([297, 361, 289, ..., -43, -44, -30], dtype=int16), 44100, (916057,))
Although all the raw values disagreed with each other, the first confirming thing I noticed is that the AudioIOTensor and PyDub result had the same sampling frequency(44100) and the same shape((916057,)). Yet, Librosa's result had a sampling frequency(22050) and shape dimensions((458029)) that were half the sampling frequency and shape dimensions of the other two techniques.
Next, I looked to see where the max and mins of each array was. I found this:
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
>> np.argmax(pdub_data), np.argmin(pdub_data)
(113149, 106715)
>> np.argmax(libr)*2, np.argmin(libr)*2
(113150, 106714)
So, allowing for the fact that Librosa has half the sampling rate of the other two libraries, all three libraries agree on where the max's and min's are.
Lastly, I decided to see if Tensorflow's AudioIOTensor and PyDub's result were separated by a constant multiplicative factor by taking the average of the ratio of the maxes and mins:
>> pdub_data[113149]/tf2_arr[113149], pdub_data[106715]/tf2_arr[106715]
(32768.027, 32768.184)
>> test = tf2_arr * 32768.105
>> diff = test-pdub_data
>> np.max(diff), np.min(diff)
(0.578125, -0.5917969)
Since pdub_data had values ranging from 23864 to -22269 (i.e. I checked np.max(pdub_data) and np.min(dub_data)), I was willing to assume that if the differences were bounded by +/- 0.6, they were due to rounding and similar effects. I was willing to assume that the same would hold for Librosa, but now I'm left wondering why?
I would've thought that reading an mp3 file wouldn't leave room for interpretation. Raw data was stored using whatever rules mp3 uses and should be recovered when the file is read.
Why do these 3 libraries differ in the raw numbers they return, and in 1 case differ in the sampling rate corresponding to the returned data? How can I get one or all of them to return the raw data stored in the mp3 format? Should I attach any significance to the fact that the ratio between the pydub_data values and the tf2_arr values is 32678 (i.e. 2^15)?
=====================================================================
Later Thoughts: I'm wondering if part of the reason for the differences between these libraries lies in the variable type they use. Librosa uses float32 and PyDub uses int16. So, It might make sense that PyDub sees twice as many numbers as Librosa, which gives it twice the sampling rate. Similarly, AudioIOTensor differs from PyDub by a factor of 2^15. If one prepends 15 bits to a 16 bit int, with one more to handle the sign, one could conceivably get a 32 bit float. But both of these cases seem to imply that one set of numbers will be, in some sense, 'wrong'.....

Bound label to Image

From the mnist dataset example I know that the dataset look something like this (60000,28,28) and the labels are (60000,). When, I print the first three examples of Mnist dataset
and I print the first three labels of those which are:
The images and labels are bounded.
I want to know how can I bound a folder with (1200 images) with size 64 and 64 with an excel with a column named "damage", with 5 different classes so I can train a neural network.
Like image of a car door and damage is class 3.
Here's a rough sketch of how you can approach this problem.
Loading each image
The first step is how you pre-process each image. You can use Python Imaging Library for this.
Example:
from PIL import Image
def load_image(path):
image = Image.open(path)
# Images can be in one of several different modes.
# Convert to single consistent mode.
image = image.convert("RGB")
image = image.resize((64, 64))
return image
Optional step: cropping
Cropping the images to focus on the feature you want the network to pay attention to can improve performance, but requires some work for each training example and for each inference.
Loading all images
I would load the images like this:
import glob
import pandas as pd
image_search_path = "image_directory/*.png"
def load_all_images():
images = []
for path in glob.glob(image_search_path):
image = load_image(path)
images.append({
'path': path,
'img': image,
})
return pd.DataFrame(images)
Loading the labels
I would use Pandas to load the labels. Suppose you have an excel file with the columns path and label, named labels.xlsx.
labels = pd.read_excel("labels.xlsx")
You then have the problem that the images that are loaded are probably not in the same order as your file full of labels. You can fix this by merging the two datasets.
images = load_all_images()
images_and_labels = images.merge(labels, on="path", validate="1:1")
# check that no rows were dropped or added, say by a missing label
assert len(images.index) == len(images_and_labels.index)
assert len(labels.index) == len(images_and_labels.index)
Converting images to numpy
Next, you need to convert both the images and labels into a numpy dataframe.
Example for images:
import numpy as np
images_processed = []
for image in images_and_labels['img'].tolist():
image = np.array(image)
# Does the image have expected shape?
assert image.shape == (64, 64, 3)
images_process.append(image)
images_numpy = np.array(images_processed)
# Check that this has the expected shape. You'll need
# to replace 1200 with the number of training examples.
assert images_numpy.shape == (1200, 64, 64, 3)
Converting labels to numpy
Assuming you're setting up a classifier, like MNIST, you'll first want to decide on an ordering of categories, and map each element of that list of categories to its position within that ordering.
The ordering of categories is arbitrary, but you'll want to be consistent about it.
Example:
categories = {
'damage high': 0,
'damage low': 1,
'damage none': 2,
}
categories_num = labels_and_images['label'].map(categories)
# Are there any labels that didn't get mapped to something?
assert categories_num.isna().sum() == 0
# Convert labels to numpy
labels_np = categories_num.values
# Check shape. You'll need to replace 1200 with the number of training examples
assert labels_np.shape == (1200,)
You should now have the variables images_np and labels_np set up as numpy arrays in the same style as the MNIST example.

Python - Image recognition classifier

I want to evaluate if an event is happening in my screen, every time it happens a particular box/image shows up in a screen region with very similar structure.
I have collected a bunch of 84x94 .png RGB images from that screen region and I'd like to build a classifier to tell me if the event is happening or not.
Therefore my idea was to create a pd.DataFrame (df) containing 2 columns, df['np_array'] contains every picture as a np.array and df['is_category'] contains boolean values telling if that image is indicating that the event is happening or not.
The structure looks like this (with != size):
I have resized the images to 10x10 for training and converted to greyscale
df = pd.DataFrame(
{'np_array': [np.random.random((10, 10,2)) for x in range(0,10)],
'is_category': [bool(random.getrandbits(1)) for x in range(0,10)]
})
My problem is that I can't fit a scikit learn classifier by doing clf.fit(df['np_array'],df['is_category'])
I've never tried image recognition before, thanks upfront for any help!
If its a 10x10 grayscale image, you can flatten it:
import numpy as np
from sklearn import ensemble
# generate random 2d arrays
image_data = np.random.rand(10,10, 100)
# generate random labels
labels = np.random.randint(0,2, 100)
X = image_data.reshape(100, -1)
# then use any scikit-learn classification model
clf = ensemble.RandomForestClassifier()
clf.fit(X, y)
By the way, for images the best performing algorithms are convolutional neural networks.

Python/OpenCV - Machine Learning-based OCR (Image to Text)

I am experimenting with using OpenCV via the Python 2.7 interface to implement a machine learning-based OCR application to parse text out of an image file. I am using this tutorial (I've reposted the code below for convenience). I am completely new to machine learning, and relatively new to OpenCV.
OCR of Hand-written Digits:
import numpy as np
import cv2
from matplotlib import pyplot as plt
img = cv2.imread('digits.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
# Now we split the image to 5000 cells, each 20x20 size
cells = [np.hsplit(row,100) for row in np.vsplit(gray,50)]
# Make it into a Numpy array. It size will be (50,100,20,20)
x = np.array(cells)
# Now we prepare train_data and test_data.
train = x[:,:50].reshape(-1,400).astype(np.float32) # Size = (2500,400)
test = x[:,50:100].reshape(-1,400).astype(np.float32) # Size = (2500,400)
# Create labels for train and test data
k = np.arange(10)
train_labels = np.repeat(k,250)[:,np.newaxis]
test_labels = train_labels.copy()
# Initiate kNN, train the data, then test it with test data for k=1
knn = cv2.KNearest()
knn.train(train,train_labels)
ret,result,neighbours,dist = knn.find_nearest(test,k=5)
# Now we check the accuracy of classification
# For that, compare the result with test_labels and check which are wrong
matches = result==test_labels
correct = np.count_nonzero(matches)
accuracy = correct*100.0/result.size
print accuracy
# save the data
np.savez('knn_data.npz',train=train, train_labels=train_labels)
# Now load the data
with np.load('knn_data.npz') as data:
print data.files
train = data['train']
train_labels = data['train_labels']
OCR of English Alphabets:
import cv2
import numpy as np
import matplotlib.pyplot as plt
# Load the data, converters convert the letter to a number
data= np.loadtxt('letter-recognition.data', dtype= 'float32', delimiter = ',',
converters= {0: lambda ch: ord(ch)-ord('A')})
# split the data to two, 10000 each for train and test
train, test = np.vsplit(data,2)
# split trainData and testData to features and responses
responses, trainData = np.hsplit(train,[1])
labels, testData = np.hsplit(test,[1])
# Initiate the kNN, classify, measure accuracy.
knn = cv2.KNearest()
knn.train(trainData, responses)
ret, result, neighbours, dist = knn.find_nearest(testData, k=5)
correct = np.count_nonzero(result == labels)
accuracy = correct*100.0/10000
print accuracy
The 2nd code snippet (for the English alphabet) takes input from a .data file in the following format:
T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10
S,4,11,5,8,3,8,8,6,9,5,6,6,0,8,9,7
B,4,2,5,4,4,8,7,6,6,7,6,6,2,8,7,10
...there's about 20,000 lines of that. The data describes contours of characters.
I have a basic grasp on how this works, but I am confused as to how I can use this to actually perform OCR on an image. How can I use this code to write a function that takes a cv2 image as a parameter and returns a string representing the recognized text?
In general, machine-learning works like this: First you must train your program in understanding the domain of your problem. Then you start asking questions.
So if you are creating an OCR the first step is teaching your program what an A letter looks like, and the B and so on.
You use OpenCV to clear the image from noise and identify groups of pixels that could be letters and isolate them.
Then you feed those letters to your OCR program. On training mode, you will feed the image and explain what letter the image represents. On asking mode, you will feed the image and ask which letter it is. The better the training the more accurate is your answer will be (the program could get the letter wrong, there is always a chance of that).

Categories