How to detect sentences written in Hindi using roman script - python

I have a list of comments on YouTube videos in a csv file, each row contains one comment. But the problem is that the comments are in different languages, ex hindi in devnagri script, English in Roman script, and Hindi Comments in Roman script(some people call it Hinglish).
Is there a way to extract the rows having Hindi Comments in Roman script for further processing? If a regex to detect such pattern would be great help.

In the general case, regular expressions are not a good solution to this problem. This is related to Why is it such a bad idea to parse XML with regex? -- a regular expression is excellent for identifying a pattern which doesn't depend on its surroundings, but that's not how human language works. In Indo-Aryan languages, you have "action at distance" phenomena like sandhi which are hard to model with regex.
If your target is solely text which is either in English or in Hindi, you can probably find some heuristics which identify them with some limited accuracy, though. For example, observe that Hindi contains digraphs which are unusual in English, such as bh and dh and aa. Conversely, some digraphs of English are unlikely in Hindi.
However, a better solution with the same basic approach would be to train a simple language identification model which works out a statistical probability based on the characteristics of an entire input text, instead of having a regex make a black vs white decision based on individual letter pairs. Python: How to determine the language? has some suggestions for Python modules which do this.

Related

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

How to split a Chinese paragraph into sentences in Python?

Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as
我是中文段落,如何为我分句呢?我的宗旨是“先谷歌搜索,再来问问题”,我已经搜索了,但是没找到好的答案。
To my best knowledge,
from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")
does not work because tokenize.sent_tokenize() doesn't support Chinese.
All the methods I found through Google search rely on Regular Expression (such as
re.split('(。|!|\!|\.|?|\?)', paragraph_variable)
). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.

How do I recognize what string from a list is closest to my input?

I am working on a school project and have a function that recognized a comment and finds the information from the comment and writes it down to a file. When how could I check an input string against a list of strings of information. Like if I have an input
input = "How many fingers do I have?"
How do I check which of these is closest to it?
fingers = "You have 10."
pigs = "yummy"
I want it to respond with fingers. I want to match it with the variable name and not the variable's value.
I suggest you to read this chapter.
This is a chapter from Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper.
Detecting patterns is a central part of Natural Language Processing.
Words ending in -ed tend to be past tense verbs (5). Frequent use of
will is indicative of news text (3). These observable patterns — word
structure and word frequency — happen to correlate with particular
aspects of meaning, such as tense and topic. But how did we know where
to start looking, which aspects of form to associate with which
aspects of meaning?
The goal of this chapter is to answer the following questions:
How can we identify particular features of language data that are
salient for classifying it? How can we construct models of language
that can be used to perform language processing tasks automatically?
What can we learn about language from these models?
It all described in python, and it's very efficient.
http://www.nltk.org/book/ch06.html
Also, processing the text by using a keyword that matches a variable name is not good and not efficient. I won't recommend it.

Is there anyway in python to count syllables without the use of a dictionary?

CMUdict works for the english language, but what if I want to count the syllables of content in another language?
This depends on the language. This may sound like an obvious answer, but it all comes down to how the orthography is designed. In English, syllables are pretty much independent of how the words are written, so you'd need a dictionary. Many other languages are like this.
Certain other languages though (like (South) Korean, Japanese Hiragana and Katakana (but not Kanji)) are written in such a way that the characters themselves are obviously matched up with a syllable or a specific number of syllables. In that case, if you know how those languages work, you could theoretically use Python to break the writing up into syllables.
Otherwise, you'd need a dictionary, or some other compling platform that takes care of this. Poke around nltk and see what you can find.
In general, no. For some languages there might be, but if you don't have a dictionary you'd need knowledge of those languages' linguistic structure. How words are divided into syllables varies from language to language.
You certainly can't do it in a general way for all languages, because different languages render sounds to text differently.
For example, the Hungarian word "vagy" looks like 2 syllables to an English speaker, but it's only one. And the English word "bike" would naturally be read as 2 syllables by speakers of many other languages.
Furthermore, for English you probably can't do this very accurately without a dictionary anyway, because English has so much bizarre variation in its spelling. For example, we pronounce the "oe" in "poet" as two distinct syllables, but only one in "does". This is probably true of some other languages as well.

How would I go about categorizing sentences according to tense (present, past, future, etc.)?

I want to parse a text and categorize the sentences according to their grammatical structure, but I have a very small understanding of NLP so I don't even know where to start.
As far as I have read, I need to parse the text and find out (or tag?) the part-of-speech of every word. Then I search for the verb clause or whatever other defining characteristic I want to use to categorize the sentences.
What I don't know is if there is already some method to do this more easily or if I need to define the grammar rules separately or what.
Any resources on NLP that discuss this would be great. Program examples are welcome as well. I have used NLTK before, but not extensively. Other parsers or languages are OK too!
Python Natural Language Toolkit is a library which is suitable for doing such a work. As with any NLP library, you will have to download the dataset for training separately and corpus(data) and scripts for training are available too.
There are also certain example tutorials which will help you identify parts of the speech for words. By all means, I think nltk.org should be the place to go for what you are looking for.
Specific questions could be posted here again.
May be you need simply define patterns like "noun verb noun" etc for each type of grammatical structure and search matches in part-of-speach tagger output sequence.

Categories