How to print tokenized arabic text in python/nltk? - python

I'm doing the sentiment analysis for arabic language, I'm using python /nltk and the dream pie shell ,this problem occurs when I apply the function of tokenization , how to display these words ?
>>> import nltk
>>> sentence = "مصادمات عنيفه في"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['\xd9\x85\xd8\xb5\xd8\xa7\xd8\xaf\xd9\x85\xd8\xa7\xd8\xaa', '\xd8\xb9\xd9\x86\xd9\x8a\xd9\x81\xd9\x87', '\xd9\x81\xd9\x8a']

By printing tokens, you're printing the list and the \x... is the bytecode representation. If you want to print out the arabic form, just loop through the list and print the tokens one by one.
>>> import nltk
>>> sentence = "مصادمات عنيفه في"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['\xd9\x85\xd8\xb5\xd8\xa7\xd8\xaf\xd9\x85\xd8\xa7\xd8\xaa', '\xd8\xb9\xd9\x86\xd9\x8a\xd9\x81\xd9\x87', '\xd9\x81\xd9\x8a']
>>> for i in tokens:
... print i
...
مصادمات
عنيفه
في

Related

How do I singularize and lemmatize an entire pandas dataframe column using TextBlob?

I have a pandas dataframe that contains the following columns: df['adjectives'], df['nouns'], and df['adverbs']. Each of these columns contains lists of tokens based on their respective parts of speech.
I would like to use TextBlob to create three new columns in my data frame, df['adjlemmatized'], df['nounlemmatized'], and df['advlemmatized'].
Each of these columns should contain wordlists consisting of words in their singularized, lemma form.
I have tried following the TextBlob documentation, but I am stuck writing functions that will iterate over my entire dataframe.
Words Inflection and Lemmatization
Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.
>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'
Words can be lemmatized by calling the lemmatize method.
>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v") # Pass in WordNet part of speech (verb)
'go'
Here is the code I used to get the parts of speech from my text:
# get adjectives
def get_adjectives(text):
blob = TextBlob(text)
print(text)
return [word for (word,tag) in blob.tags if tag.startswith("JJ")]
df['adjectives'] = df['clean_reviews'].apply(get_adjectives)
If your words are already tokenized and you want to keep them that way, it's easy:
df['adjlemmatized'] = df.adjectives.apply(lambda x: [ TextBlob(w) for w in x])
df['adjlemmatized'] = df.adjlemmatized.apply(lambda x: [ w.lemmatize() for w in x])

NLTK tokenize text with dialog into sentences

I am able to tokenize non-dialog text into sentences but when I add quotation marks to the sentence the NLTK tokenizer doesn't split them up correctly. For example, this works as expected:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text1 = 'Is this one sentence? This is separate. This is a third he said.'
tokenizer.tokenize(text1)
This results in a list of three different sentences:
['Is this one sentence?', 'This is separate.', 'This is a third he said.']
However, if I make it into a dialogue, the same process doesn't work.
text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.'
tokenizer.tokenize(text2)
This returns it as a single sentence:
['“Is this one sentence?” “This is separate.” “This is a third” he said.']
How can I make the NLTK tokenizer work in this case?
It seems the tokenizer doesn't know what to do with the directed quotes. Replace them with regular ASCII double quotes and the example works fine.
>>> text3 = re.sub('[“”]', '"', text2)
>>> nltk.sent_tokenize(text3)
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

Python - specify corpus from whose sentence to perform a function on?

I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!
Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.
You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.

Removing stopwords from list using python3

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code
word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.
This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

To find frequency of every word in text file in python

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them.
Can someone please help me the command to be used for that.
import nltk
text1 = "hello he heloo hello hi " // example text
fdist1 = FreqDist(text1)
I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character.
Also i want to know how to input text using text file.
I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.
import nltk
text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))
If you want to read from a file and get the word count, you can do it like so:
input.txt
hello he heloo hello hi
my username is heinst
your username is frooty
python code
import nltk
with open ("input.txt", "r") as myfile:
data=myfile.read().replace('\n', ' ')
data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))
For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.
from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())
text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):
>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
If your input is indeed space-separated words then to find the frequency, use #Boa's answer:
freq = Counter(text_with_space_separated_words.split())
Note: FreqDist is a Counter but it also defines additional methods such as .plot().
If you want to use nltk tokenizers instead:
#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk
with open('your_text.txt') as file:
text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})
sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.
In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:
import nltk
from nltk.tokenize import word_tokenize
for f in word_tokenize(inputSentence):
dict[f] = fre[f]
print dict
I think below code is useful for you to get the frequency of each word in the file in dictionary form
myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
z=item.split(" ")
y.append(z)
count=dict()
for name in y:
for items in name:
if items not in count:`enter code here`
count[items]=1
else:
count[items]=count[items]+1
print(count)

Categories