PYTHON: Extract Non-English words and iterate it over a dataframe - python

I have a table of about 30,000 rows and need to extract non-English words from a column named dummy_df from a dummy_df dataframe. I need to put the non-english words in an adjacent column named non_english.
A dummy data is as thus:
dummy_df = pandas.DataFrame({'outcome': ["I want to go to church", "I love Matauranga", "Take me to Oranga Tamariki"]})
My idea is to extract non-English words from a sentence, and then iterate the process over a dataframe. I was able to accurately extract non-English words from a sentence with this code:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
The result of the above code is 'Matauranga' which is perfectly correct.
But when I try to iterate the code over a dataframe using this code:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
I got an undesirable result in that the non_english column has none value instead of the desired non-english words (see below):
outcome non_english
0 I want to go to church None
1 I love Matauranga None
2 Take me to Oranga Tamariki None
3 None
Instead, the desired result should be:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki

You are missing the return in your function:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
output:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki

Related

Text Clustering and Visualize Similarity in Python

I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

Word count frequency: removing stopwords

I have the following list of words frequency generated by the code below.
Frequency
the 3
15 5
18 1
a 1
2020 4
... ...
house 1
apartment 1
hotel 5
pool 1
swimming 1
The code is
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Sentences'])
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq, index=word_vectorizer.get_feature_names(), columns=['Frequency'])
w_df
I would like to remove the stopwords from the the list of words above (not in the column of my dataframe, but just in the output, creating a new variable in case it would be needed).
I have tried with w_df =[w for w in w_df if not w in stop_words] but it gave me ['Frequency'] as output.
I think this happens because it is not a list.
Could you please tell me how to remove stopwords (numbers included) from there?
Thanks
CountVectorizer has a parameter that does that for you. You can feed it a custom list of stopwords, or set it to english, a built-in stop word list. Here's an example:
s = pd.Series('Just a random sentence with more than one stopword')
word_vectorizer = CountVectorizer(ngram_range=(1,1),
analyzer='word',
stop_words='english')
sparse_matrix = word_vectorizer.fit_transform(s)
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq,
index=word_vectorizer.get_feature_names(),
columns=['Frequency'])
print(w_df)
Frequency
just 1
random 1
sentence 1
stopword 1
Just to add, your approach wasn't all that wrong. You needed just a minor change.
w_df = [w for w in w_df.index if not w in stop_words]
Your problem was simply that, in the list comprehension, you iterated over the dataframe itself rather than the tokens which are in its index. This would also return the desired result.

why does if not in(x,y) not work at all in python

I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.
This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:
data = [word for word in data if word not in (stopwords|set(string.punctuation))]
then you need to change
data = [word for word in data if word not in (stopwords,string.punctuation)]
to
data = [word for word in data if word not in stopwords and word not in string.punctuation]
in function process you must Convert type(String) to pandas.core.series.Series and use
concat
the function will be:
'
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in
pd.concat([stopwords,pd.Series(string.punctuation)]) ]
return data

How can I find and print unmatched/dissimilar words from the documents(dataset)?

I am trying to rewrite algorithm that basically takes a input text file and compares with different documents and results the similarities.
Now I want to print output of unmatched words and output a new textile with unmatched words.
From this code, "hello force" is the input and is checked against the raw_documents and prints out rank for matched document between 0-1(word "force" is matched with second document and ouput gives more rank to second document but "hello" is not in any raw_document i want to print unmatched word "hello" as not matched ), But what i want is to print unmatched input word that was not matched with any of the raw_document
import gensim
import nltk
from nltk.tokenize import word_tokenize
raw_documents = ["I'm taking the show on the road",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
s += len(i)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf]
print result

Counting non stop words in an NLTK corpus

In python using NLTK how would I find a count of the number of non stop words in a document filtered by category?
I can figure out how to get the words in a corpus filtered by a category e.g. all the words in the brown corpus for category ‘news’ is:
text = nltk.corpus.brown.words(categories=category)
And separately I can figure out how to get all the words for a particular document e.g. all the words in the document ‘cj47’ in the brown corpus is:
text = nltk.corpus.brown.words(fileids='cj47')
And then I can loop through the results and count up the words that are not stopwords e.g.
stopwords = nltk.corpus.stopwords.words('english')
for w in text:
if w.lower() not in stopwords:
#found a non stop words
But how do I put it together so that I am filtering by category for a particular document? If I try to specify a category and a filter at the same time e.g.
text = nltk.corpus.brown.words(categories=category, fields=’cj47’)
I get an error saying:
ValueError: Specify fields or categories, not both
Get fileids for a category:
fileids = nltk.corpus.brown.fileids(categories=category)
For each file, count the non-stopwords:
for f in fileids:
words = nltk.corpus.brown.words(fileids=f)
sum = sum([1 for w in words if w.lower() not in stopwords])
print "Document %s: %d non-stopwords." % (f, sum)

Categories