Speech labels for speech recognition - python

Another newbie question, I’m trying to create a speech recognition ANN and I’m not entirely sure about the format of the labels. Since it’s a string of characters (converted to integers I assume), should the label be an array of characters? Should it be a single string? Should they have a fixed length? I’m using FASTAI in Python and I am basing the model on an image recognition model.

Related

Why is it labels are not displayed in yolov5 that are written in amharic language

I am developing character recognition for amharic character using yolov5. my model runs well but after prediction it does not display these characters on predicted image. please I need your help. may be encoding problem?
I need pyhton code that enables me to encode on yolov5

facial feature extraction using OPENCV, python

i have been working on an project which requires to extract the facial features in python. I will be using openCV in this project too. I have found a deep learning model, is there any other way to extract facial features other than that?
It seems that, haar cascades don't work in this case. You can use this link: https://www.pyimagesearch.com/2017/04/10/detect-eyes-nose-lips-jaw-dlib-opencv-python/
This uses a pretrained deep learning model that can be used. The features extracted are store / given in a list of coordinates. Hence they ca be manipulated by the opencv.
this model is very accurate and hepful.

Speaker Recognition System using Python

I'm trying to make a Speaker recognition (not speech but speaker) system using Python. I've extracted mfcc features of both train audio file and test audio file and have made a gmm model for each. I'm not sure how to compare the models to compute a score of similarity based on which I can program the system to validate the test audio. I'm struggling for 4 days to get this done. Would be glad if someone can help.
From what I can understand from the question, you are describing an aspect of the cocktail party problem
I have found a whitepaper with a solution to your problem using a modified iterative Wiener filter and a multi-layer perceptron neural network that can separate speakers into separate channels.
Intrestingly the cocktail party problem can be solved in one line in ocatve: [W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
you can read more about it on this stackoverflow post

Tensorflow model for OCR arabic

I am a beginner in Tensorflow and I want to build an OCR model with Tensorflow that detects Arabic words from cursive Arabic fonts (i.e. joint Arabic handwriting). Ideally, the model would be able to detect both Arabic and English. Please see the attached image of a page in a dictionary that I am currently trying to OCR. The other pages in the book have the same font and layout with both English and Arabic.
I have two questions:
(1) Would I be training with individual characters in the joint/cursive Arabic text or would I need bounding boxes for the entire words or individual characters?
(2) Are there any other OCR Tensorflow (or Keras) models available that deal with cursive writing particularly with Arabic.
Tesseract, an OCR engine from Google, has an Arabic trained model.
Learn more about it here: https://github.com/tesseract-ocr/tesseract
Languages it supports are here: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages
The Arabic dataset is here: https://github.com/tesseract-ocr/tessdata/blob/master/ara.traineddata
Hope this helps!
I don't think so you can use the whole page as the input image, maybe word by word is a better choice as a primitive solution, let's look at these links:
https://hackernoon.com/latest-deep-learning-ocr-with-keras-and-supervisely-in-15-minutes-34aecd630ed8
http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf
How to create dataset in the same format as the FSNS dataset?

Experimenting with creating OCR in tensorflow, what to do after training on letters?

Honestly, i'm just stuck and can't think. I have worked hard to create an amazing model that can read letters, but how do I move on to words, sentences, paragraphs and full papers?
This is a general question so forgive me for not providing code, but assume I have successfully trained a network at recognizing letters of many kinds and many fonts, with all sorts of different noise and distortions in the image.
(just to be technical, the images the model is trained on are 36*36 grayscale images only, and the model is a simple classifier with some conv2d layers)
Now I want to use this well-trained model with all it's parameters and give it something to read, to turn in into a full OCR program. This is where i'm stuck. I want to give the program a photo/scan of a paper, and have it recognize all the letters. But how do I "predict" using my model, when the image is obviously larger than the images it was trained on of single letter?
I have tried adding an additional layer of conv2d that would try to read features of parts of the image, but that was too complicated and I couldn't figure it out.
I have also looked at opencv programs that recognize where there is text in the image and crop that out, but none that I could find separate out single letters that could now be fed to the trained model to try and read.
What is my next step from here?
If the fonts of the letters will be the same throughout the whole image you could use the so called: "sliding window technique"
You start from the upper left corner and slide your scan window to the right for the size of the letter until you reach the end of the paper.
The sliding window will be the size of the scanned letter and when inputted to your neural network it will output the letter. Save those letters somewhere.
Other methods would include changing your neural network and being smarter about detecting blobs of text on the scanned paper
If you are looking for an off-the-shelf solution take a look at Tessaract-ocr.
Check out the following links for ideas:
STN-OCR: A single Neural Network for Text Detection and Text Recognition
STN-OCR on Medium
Attention-based Extraction of Structured Information from Street View Imagery
Another Attention-based OCR Repo
A model using both CNN and LSTM

Categories