I'm a new student of natural language processing and I have a task regarding simple corpus analysis. Given an input file (MovieCorpus.txt) we are assigned to compute the following statistics:
Number of sentences, tokens, types (lemmas)
Distribution of sentence length, types, POS
import nltk
import spacy as sp
from nltk import word_tokenize
# Setting Spacy Modelsp
nlp = sp.load('en_core_web_sm')
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
# Tokenize, POS, Lemma
tokens = []
lemma = []
pos = []
for doc in nlp.pipe(read_data):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
pos.append([n.pos_ for n in doc])
else:
tokens.append(None)
lemma.append(None)
pos.append(None)
ls = len(read_data)
print("The amount of sentences is %d:" %ls)
lt = len(tokens)
print("The amount of tokens is %d:" %lt)
ll = len(lemma)
print("The amount of lemmas is %d:" %ll)
This is attempt at answering those questions but since the file is very large (>300.000 sentences) it takes forever to analyze. Is there anything I did wrong? Should I rather use NLTK instead of spacy?
import pandas as pd
import nltk
from nltk import word_tokenize
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
df = pd.DataFrame({"text": read_data}) # Assuming your data has no header
data = data.head(10)
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
data['lemma'] = data.text.apply(lemmatize_text)
data["tokens"] = data.text.apply(nltk.word_tokenize)
data["posR"] = data.tokens.apply(lambda x: nltk.pos_tag(x))
tags = [[tag for word, tag in _] for _ in data["posR"].to_list()]
data["pos"] = tags
print(data)
From here on you should be able to do all other tasks by yourself.
Related
Im working on text summarization extraction on long text data. I have multiple users text data in the input csv file. But current code is appending all the text column data in to sentences and then apply logic. How do I apply the code for each row instead of merging all the column values? Any Help will appreciated.
Input.csv (^ delimited)
uid^name^text
36d73f013aa7^Don Howard^The Irvine Foundation has entered into a partnership with College Futures Foundation that starts a new chapter in our support of postsecondary success in California.To achieve Irvine’s singular goal.
36d73f013aa8^Simon Haris^That’s why we have long provided funding to expand postsecondary success. Now with our focus on low-wage workers, we have decided to split our postsecondary funding into two parts:. Strengthening and expanding work-ready credentialing programs (which we will do directly, primarily as part of our Better Careers initiative).
36d73f013aa8^David^Accelerating and expanding the attainment of bachelor’s degrees (which we will fund through our partnership with College Futures). We believe that College Futures is in a stronger position than we are to make grants to support improvements in how the CSUs and the California Community Colleges can better serve students.
pseudo code
Loop each record
apply below logic to text column to get summary
Code : Text Summarization Code
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
from nltk.corpus import stopwords
import re
# Read the CSV file
import io
df = pd.read_csv('/home/sshuser/textsummerisation/input.csv',sep='^')
# split the the text in the articles into sentences
sentences = []
for s in df['text']:
sentences.append(sent_tokenize(s))
# flatten the list
sentences = [y for x in sentences for y in x]
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')# one time execution
stop_words = stopwords.words('english')
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# Extract word vectors
word_embeddings = {}
fopen = open('/home/sshuser/textsummerisation/glove.6B.100d.txt', encoding='utf-8')
for line in fopen:
values = line.split()
word = values[0]
print(values)
print(word)
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
fopen.close()
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
else:
v = np.zeros((100,))
sentence_vectors.append(v)
len(sentence_vectors)
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Specify number of sentences to form the summary
sn = 10
# Generate summary
for i in range(sn):
print(ranked_sentences[i][1])
Expected output: Output of the above code should come in summary column for each record
uid^name^text^summary
I wrote the code below and restrict the words in the tweets to content words, i.e., nouns, verbs, and adjectives, Now I want to Transform the words to lower case and add the POS with an underderscore. E.g.:
love_VERB old-fashioneds_NOUN
but I dont know how, can anyone help me?
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')
from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
zf.extractall()
import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000) # nrows , max amount of rows
documents = df.text.values.tolist()
print(documents[:4])
import spacy
nlp = spacy.load('en_core_web_sm') #you can use other methods
# excluded tags
included_tags = {"NOUN", "VERB", "ADJ"}
#document = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
sentences = documents[:103] #first 10 sentences
new_sentences = []
for sentence in sentences:
new_sentence = []
for token in nlp(sentence):
if token.pos_ in included_tags:
new_sentence.append(token.text)
new_sentences.append(" ".join(new_sentence))
#Creates a list of lists of tokens
tokens = [[token.text for token in nlp(new_sentence)] for new_sentence in documents[:200]]
tokens
# import itertools
# tok = itertools.chain.from_iterable(
# [[token.text for token in nlp(new_sentence)] for new_sentence in documents[:200]])
# tok
I believe if you change
new_sentence.append(token.text)
to
new_sentence.append(token.text.lower()+'_'+token.POS)
you'll have what you are after.
I have five plain text documents in a directory that are already clustered based on their content and named as such cluster1.txt, cluster2.txt and so on, so they are functioning as my corpus. Otherwise they don't have any labels, they are just named as such.
My task is to cluster a new text document with new sentences, but not the whole document as a whole, instead I should cluster each sentence into one of these 5 cluster or classes and also do a confusion matrix with the recall and precision score to show how similar the sentences are to the clusters.
I first tried to do it with a kNN and then a kmeans, but I think my logic is flawed since this is not a clustering problem, it's a classification problem, right?
Well at least I tried to preprocess the text (removing stop words, lemmatize, lowercasing, tokenizing) and then I calculated the termfrequency with a countvectorizer and then the tf-idf
I kinda have problems with the logic with this problem.
Anyway, this is what I tried so far, but now I'm kinda stuck, any help is appreciated
import glob
import os
file_list = glob.glob(os.path.join(os.getcwd(), 'C:/Users/ds191033/FH/Praktikum/Testdaten/Clusters', "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="utf8") as f_input:
corpus.append(f_input.read())
stopwords = nltk.corpus.stopwords.words('german')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('german')
lem = WordNetLemmatizer()
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
doc = doc.lower()
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
#lemmatize
for w in filtered_tokens:
lemmatized_tokens = [lem.lemmatize(t) for t in filtered_tokens]
# re-create document from filtered tokens
doc = ' '.join(lemmatized_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)
norm_corpus
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
I am trying to do a sentimental analysis with python on a bunch of txt documents.
I did so far the preprocessing and extracted only the important words from the text, e.g. I deleted stop-words, the punctuation. Also I created a kind of bag-of-words counting the term frequency. The next step would be to implement a corresponding model.
I am not experienced in machine learning resp. text mining. I am also uncertain about the way I created the bag-of-words model. Could you please have a look at my code and tell me if I am on the right track. I would also like to know if my previous path is a good basis for a model and how do I build on that basis a good model in order to categorize my documents.
This is my code:
import spacy
import string
import os,sys
import re
import numpy as np
np.set_printoptions(threshold=sys.maxsize)
from collections import Counter
# Load English tokenizer, tagger, parser, NER and word vectors
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
path_train = "Sentiment/Train/"
path_test = "Sentiment/Test/"
text_train = []
text_text = []
# Process whole documents
for filename in os.listdir(path_train):
text = open(os.path.join(path_train, filename),encoding="utf8", errors='ignore').read()
text = text.replace("\ue004","s").replace("\ue006","y")
text = re.sub(r'^http?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
text = "".join(filter(lambda x: x in string.printable, text))
text = " ".join(text.split())
text = re.sub('[A-Z]+', lambda m: m.group(0).lower(), text)
if filename.startswith("de_"):
text_train.append(nlp_de(text))
else:
text_train.append(nlp_en(text))
docsClean = []
for doc in text_train:
#for token in doc:
#print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop)
cleanWords = [token.lemma_ for token in doc if token.is_stop == False and token.is_punct == False and token.pos_ != "NUM"]
docsClean.append(cleanWords)
print(docsClean)
for doc in docsClean:
bag_vector = np.zeros(len(doc))
for w in doc:
for i,word in enumerate(doc):
if word == w:
bag_vector[i] += 1
print(bag_vector)
This is how my bow-model looks like:
You can try using pandas and get_dummies for this.
import re
import spacy
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet
inputfile = open('inputfile.txt', 'r')
String= inputfile.read()
nlp = spacy.load('en_core_web_sm')
def candidate_name_extractor(input_string, nlp):
input_string = str(input_string)
doc = nlp(input_string)
# Extract entities
doc_entities = doc.ents
# Subset to person type entities
doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
doc_persons = list(map(lambda x: x.text.strip(), doc_persons))
print(doc_persons)
# Assuming that the first Person entity with more than two tokens is the candidate's name
candidate_name = doc_persons[0]
return candidate_name
if __name__ == '__main__':
names = candidate_name_extractor(String, nlp)
print(names)
I want to extract the name of candidate from text file, but it returns the wrong value. when i remove list with map then map is also not working and gives the error
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet
String = 'Ravana was killed in a war'
Sentences = nltk.sent_tokenize(String)
Tokens = []
for Sent in Sentences:
Tokens.append(nltk.word_tokenize(Sent))
Words_List = [nltk.pos_tag(Token) for Token in Tokens]
Nouns_List = []
for List in Words_List:
for Word in List:
if re.match('[NN.*]', Word[1]):
Nouns_List.append(Word[0])
Names = []
for Nouns in Nouns_List:
if not wordnet.synsets(Nouns):
Names.append(Nouns)
print (Names)
Check this code. I am getting Ravana as output.
EDIT:
I used a few sentences from my resume to create a text file, and gave it as input to my program. Only the changed portion of the code is shown below:
import io
File = io.open("Documents\\Temp.txt", 'r', encoding = 'utf-8')
String = File.read()
String = re.sub('[/|.|#|%|\d+]', '', String)
And it is returning all the names that are not in the wordnet corpus, like my name, my house name, place, college name and place.
From the word list obtained after parts-of-speech tagging, extract all the words having noun tag using regular expression:
Nouns_List = []
for Word in nltk.pos_tag(Words_List):
if re.match('[NN.*]', Word[1]):
Nouns_List.append(Word[0])
For each word in the Nouns_List, check whether it is an English word. This can be done by checking whether synsets are available for that word in wordnet:
from nltk.corpus import wordnet
Names = []
for Nouns in Nouns_List:
if not wordnet.synsets(Nouns):
#Not an English word
Names.append(Nouns)
Since Indian names cannot be entries in English dictionary, this can be a possible method to extract them from a text.