Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as
我是中文段落,如何为我分句呢?我的宗旨是“先谷歌搜索,再来问问题”,我已经搜索了,但是没找到好的答案。
To my best knowledge,
from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")
does not work because tokenize.sent_tokenize() doesn't support Chinese.
All the methods I found through Google search rely on Regular Expression (such as
re.split('(。|!|\!|\.|?|\?)', paragraph_variable)
). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.
Related
I have a list of comments on YouTube videos in a csv file, each row contains one comment. But the problem is that the comments are in different languages, ex hindi in devnagri script, English in Roman script, and Hindi Comments in Roman script(some people call it Hinglish).
Is there a way to extract the rows having Hindi Comments in Roman script for further processing? If a regex to detect such pattern would be great help.
In the general case, regular expressions are not a good solution to this problem. This is related to Why is it such a bad idea to parse XML with regex? -- a regular expression is excellent for identifying a pattern which doesn't depend on its surroundings, but that's not how human language works. In Indo-Aryan languages, you have "action at distance" phenomena like sandhi which are hard to model with regex.
If your target is solely text which is either in English or in Hindi, you can probably find some heuristics which identify them with some limited accuracy, though. For example, observe that Hindi contains digraphs which are unusual in English, such as bh and dh and aa. Conversely, some digraphs of English are unlikely in Hindi.
However, a better solution with the same basic approach would be to train a simple language identification model which works out a statistical probability based on the characteristics of an entire input text, instead of having a regex make a black vs white decision based on individual letter pairs. Python: How to determine the language? has some suggestions for Python modules which do this.
I want to segment the text when we encounter the punctuation mark in a sentence or paragraph. If I use comma(,) in my regex it is also chunking the individual nouns verbs or adjectives separated by comma.
Suppose we have "dogs, cats, rats and other animals". Dogs becomes a separate chunk, which I do not want to happen.
Is there anyway I can ignore that using regex or any other means in nltk where I can only get comma separated clause as a text segment
Code
from nltk import sent_tokenize
import re
text = "Peter Mattei's 'Love in the Time of Money' is a visually stunning film to watch. Mrs. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situation we encounter.
text= re.sub("(?<=..Dr|.Mrs|..Mr|..Ms|Prof)[.]","<prd>", text)
txt = re.split(r'\.\s|;|:|\?|\'\s|"\s|!|\s\'|\s\"', text)
print(txt)
This is too complicated to be solved with a regex: there is no way for the regex to know that there is a predicate (verb) within the clause candidate and if you expand it, you would break into another clause.
The problem you are going to solve is called chunking in NLP. Traditionally, here were regex-based algorithms based on POS tags (so, you need to do POS tagging first). NLTK has a tutorial for that, however, this is a rather outdated approach.
Now, when fast and reliable taggers and parsers are available (e.g., in Spacy). I would suggest analyzing the sentence first and then finding chunks in a constituency parse.
I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.
Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.
Data: The data is in German and contains a lot of technical jargon.
My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:
no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE
However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.
I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.
What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?
Thanks a lot for any suggestions and experiences!
I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.
Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.
If it is of any help, Spacey supports German Language. You use it to train your model.
In general, tRuEcasIng helps.
Truecasing is the process of restoring case information to badly-cased or noncased text.
See
How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
definitely convert the majority of the words to lowercase, cut consider the following cases:
Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A
I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?
Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further.
Thus some sort of lemmatization and tokenization are needed
I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)
Background: I got a lot of text that has some technical expressions, which are not always standard.
I know how to find the bigrams and filter them.
Now, I want to use them when tokenizing the sentences. So words that should stay together (according to the calculated bigrams) are kept together.
I would like to know if there is a correct way to doing this within NLTK. If not, I can think of various non efficient ways of rejoining all the broken words by checking dictionaries.
The way how topic modelers usually pre-process text with n-grams is they connect them by underscore (say, topic_modeling or white_house). You can do that when identifying big rams themselves. And don't forget to make sure that your tokenizer does not split by underscore (Mallet does if not setting token-regex explicitly).
P.S. NLTK native bigrams collocation finder is super slow - if you want something more efficient look around if you haven't yet or create your own based on, say, Dunning (1993).
I have 30,000+ French-language articles in a JSON file. I would like to perform some text analysis on both individual articles and on the set as a whole. Before I go further, I'm starting with simple goals:
Identify important entities (people, places, concepts)
Find significant changes in the importance (~=frequency) of those entities over time (using the article sequence number as a proxy for time)
The steps I've taken so far:
Imported the data into a python list:
import json
json_articles=open('articlefile.json')
articlelist = json.load(json_articles)
Selected a single article to test, and concatenated the body text into a single string:
txt = ' '.join(data[10000]['body'])
Loaded a French sentence tokenizer and split the string into a list of sentences:
nltk.data.load('tokenizers/punkt/french.pickle')
tokens = [french_tokenizer.tokenize(s) for s in sentences]
Attempted to split the sentences into words using the WhiteSpaceTokenizer:
from nltk.tokenize import WhitespaceTokenizer
wst = WhitespaceTokenizer()
tokens = [wst.tokenize(s) for s in sentences]
This is where I'm stuck, for the following reasons:
NLTK doesn't have a built-in tokenizer which can split French into words. White space doesn't work well, particular due to the fact it won't correctly separate on apostrophes.
Even if I were to use regular expressions to split into individual words, there's no French PoS (parts of speech) tagger that I can use to tag those words, and no way to chunk them into logical units of meaning
For English, I could tag and chunk the text like so:
tagged = [nltk.pos_tag(token) for token in tokens]
chunks = nltk.batch_ne_chunk(tagged)
My main options (in order of current preference) seem to be:
Use nltk-trainer to train my own tagger and chunker.
Use the python wrapper for TreeTagger for just this part, as TreeTagger can already tag French, and someone has written a wrapper which calls the TreeTagger binary and parses the results.
Use a different tool altogether.
If I were to do (1), I imagine I would need to create my own tagged corpus. Is this correct, or would it be possible (and premitted) to use the French Treebank?
If the French Treebank corpus format (example here) is not suitable for use with nltk-trainer, is it feasible to convert it into such a format?
What approaches have French-speaking users of NLTK taken to PoS tag and chunk text?
There is also TreeTagger (supporting french corpus) with a Python wrapper. This is the solution I am currently using and it works quite good.
As of version 3.1.0 (January 2012), the Stanford PoS tagger supports French.
It should be possible to use this French tagger in NLTK, using Nitin Madnani's Interface to the Stanford POS-tagger
I haven't tried this yet, but it sounds easier than the other approaches I've considered, and I should be able to control the entire pipeline from within a Python script. I'll comment on this post when I have an outcome to share.
Here are some suggestions:
WhitespaceTokenizer is doing what it's meant to. If you want to split on apostrophes, try WordPunctTokenizer, check out the other available tokenizers, or roll your own with Regexp tokenizer or directly with the re module.
Make sure you've resolved text encoding issues (unicode or latin1), otherwise the tokenization will still go wrong.
The nltk only comes with the English tagger, as you discovered. It sounds like using TreeTagger would be the least work, since it's (almost) ready to use.
Training your own is also a practical option. But you definitely shouldn't create your own training corpus! Use an existing tagged corpus of French. You'll get best results if the genre of the training text matches your domain (articles). Also, you can use nltk-trainer but you could also use the NLTK features directly.
You can use the French Treebank corpus for training, but I don't know if there's a reader that knows its exact format. If not, you must start with XMLCorpusReader and subclass it to provide a tagged_sents() method.
If you're not already on the nltk-users mailing list, I think you'll want to get on it.