Im working on text summarization extraction on long text data. I have multiple users text data in the input csv file. But current code is appending all the text column data in to sentences and then apply logic. How do I apply the code for each row instead of merging all the column values? Any Help will appreciated.
Input.csv (^ delimited)
uid^name^text
36d73f013aa7^Don Howard^The Irvine Foundation has entered into a partnership with College Futures Foundation that starts a new chapter in our support of postsecondary success in California.To achieve Irvine’s singular goal.
36d73f013aa8^Simon Haris^That’s why we have long provided funding to expand postsecondary success. Now with our focus on low-wage workers, we have decided to split our postsecondary funding into two parts:. Strengthening and expanding work-ready credentialing programs (which we will do directly, primarily as part of our Better Careers initiative).
36d73f013aa8^David^Accelerating and expanding the attainment of bachelor’s degrees (which we will fund through our partnership with College Futures). We believe that College Futures is in a stronger position than we are to make grants to support improvements in how the CSUs and the California Community Colleges can better serve students.
pseudo code
Loop each record
apply below logic to text column to get summary
Code : Text Summarization Code
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
from nltk.corpus import stopwords
import re
# Read the CSV file
import io
df = pd.read_csv('/home/sshuser/textsummerisation/input.csv',sep='^')
# split the the text in the articles into sentences
sentences = []
for s in df['text']:
sentences.append(sent_tokenize(s))
# flatten the list
sentences = [y for x in sentences for y in x]
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')# one time execution
stop_words = stopwords.words('english')
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# Extract word vectors
word_embeddings = {}
fopen = open('/home/sshuser/textsummerisation/glove.6B.100d.txt', encoding='utf-8')
for line in fopen:
values = line.split()
word = values[0]
print(values)
print(word)
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
fopen.close()
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
else:
v = np.zeros((100,))
sentence_vectors.append(v)
len(sentence_vectors)
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Specify number of sentences to form the summary
sn = 10
# Generate summary
for i in range(sn):
print(ranked_sentences[i][1])
Expected output: Output of the above code should come in summary column for each record
uid^name^text^summary
Related
I'm a new student of natural language processing and I have a task regarding simple corpus analysis. Given an input file (MovieCorpus.txt) we are assigned to compute the following statistics:
Number of sentences, tokens, types (lemmas)
Distribution of sentence length, types, POS
import nltk
import spacy as sp
from nltk import word_tokenize
# Setting Spacy Modelsp
nlp = sp.load('en_core_web_sm')
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
# Tokenize, POS, Lemma
tokens = []
lemma = []
pos = []
for doc in nlp.pipe(read_data):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
pos.append([n.pos_ for n in doc])
else:
tokens.append(None)
lemma.append(None)
pos.append(None)
ls = len(read_data)
print("The amount of sentences is %d:" %ls)
lt = len(tokens)
print("The amount of tokens is %d:" %lt)
ll = len(lemma)
print("The amount of lemmas is %d:" %ll)
This is attempt at answering those questions but since the file is very large (>300.000 sentences) it takes forever to analyze. Is there anything I did wrong? Should I rather use NLTK instead of spacy?
import pandas as pd
import nltk
from nltk import word_tokenize
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
df = pd.DataFrame({"text": read_data}) # Assuming your data has no header
data = data.head(10)
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
data['lemma'] = data.text.apply(lemmatize_text)
data["tokens"] = data.text.apply(nltk.word_tokenize)
data["posR"] = data.tokens.apply(lambda x: nltk.pos_tag(x))
tags = [[tag for word, tag in _] for _ in data["posR"].to_list()]
data["pos"] = tags
print(data)
From here on you should be able to do all other tasks by yourself.
This is officially doing my head in. I am web scraping a collection of tweets for text analysis. The tweets have been scraped and put into a dataframe, where each row is a string containing the entire tweet. I can't for the life of me remove quotation marks or apostrophes, but removing all other punctuation is OK.
What I am trying to do is extract just the verbs, nouns and adjectives from each of the scraped tweets, which I have done, but anything in quotation marks is excluded.
The code that I have been using so far is below, but I can't for the life of me add quotation marks or apostrophes. I have also tried every other method I can find on this site , but it either does nothing, or produces errors.
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
The entire code base up until this point is:
import GetOldTweets3 as got
import pandas as pd
import re
from wordcloud import WordCloud# Join the different processed titles together.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
import os
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import gensim
from gensim import corpora, models, similarities
import logging
import tempfile
from nltk.corpus import stopwords
from string import punctuation
from collections import OrderedDict
import pyLDAvis.gensim
import tempfile
%matplotlib inline
init_notebook_mode(connected=True) #do not miss this line
warnings.filterwarnings("ignore")
# Function that pulls tweets based on a general search query and turns to csv file
# Parameters: (text query you want to search), (max number of most recent tweets to pull from)
def text_query_to_csv(text_query, count):
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
.setMaxTweets(count)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text] for tweet in tweets]
# Creation of dataframe from tweets
tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text'])
# Converting tweets dataframe to csv file
tweets_df.to_csv('{}-{}-tweets.csv'.format(text_query, int(count)), sep=',')
############################################
# Search word and number of tweets to scrape
############################################
text_query = '#barackobama'
count = 5
# Calling function to query X amount of relevant tweets and create a CSV file
text_query_to_csv(text_query, count)
filename = '#barackobama-5-tweets.csv'
tweets = pd.read_csv(filename)
# Convert tweets to strings and lower case
tweets['Text'] = tweets['Text'].astype(str)
tweets['Text'] = tweets['Text'].map(lambda x: x.lower())
tweets
This is the offending bit of code below...
# remove punctuation
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
tweets['Text_processed'].head()
#####################################
# Extract nouns, verbs and adjectives
#####################################
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from IPython.display import display
lemmatizer = nltk.WordNetLemmatizer()
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label() =='NP'):
yield subtree.leaves()
def get_word_postag(word):
if pos_tag([word])[0][1].startswith('J'):
return wordnet.ADJ
if pos_tag([word])[0][1].startswith('V'):
return wordnet.VERB
if pos_tag([word])[0][1].startswith('N'):
return wordnet.NOUN
else:
return wordnet.NOUN
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
postag = get_word_postag(word)
word = lemmatizer.lemmatize(word,postag)
return word
def get_terms(tree):
for leaf in leaves(tree):
terms = [normalise(w) for w,t in leaf]
yield terms
tidied_tweets = []
for t in tweets['Text']:
#word tokenizeing and part-of-speech tagger
document = t
tokens = [nltk.word_tokenize(sent) for sent in [document]]
postag = [nltk.pos_tag(sent) for sent in tokens][0]
# Rule for NP chunk and VB Chunk
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
{<RB.?>*<VB.?>*<JJ>*<VB.?>+<VB>?} # Verbs and Verb Phrases
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
#Chunking
cp = nltk.RegexpParser(grammar)
# the result is a tree
tree = cp.parse(postag)
terms = get_terms(tree)
features = []
for term in terms:
_term = ''
for word in term:
_term += ' ' + word
features.append(_term.strip())
tidied_tweets.append(features)
tidied_tweets
The code base I have after this works OK, but the inability to remove quoted text or anything with an apostrophe is causing real problems.
EDITED TO ADD
I've managed to solve the problem, but in doing so, created another. The latest bit of code to extract the words sans the punctuation is:
tweet_list = []
ind_tweet = []
for tweets in tidied_tweets:
for words in tweets:
a = re.findall(r"[\w']+", words)
ind_tweet.append(a)
tweet_list.append(ind_tweet)
re.findall(r"[\w']+", words) does the job of extracting the word, but I can't create the same structured list I started with. What I wanted is 'tweet_list' to act as the parent list, and 'ind_tweet' to act as a sucession of child lists (nested). When I print out the result of the code above, I'm not able to create the nested list I am looking for. ind_tweet produces the output but in a single list with no nesting, and tweet_list duplicates ind_tweet. It probably isn't helping that it's 2:30am on a Saturday, but this should be much easier than I am making it...
And the answer is...
tweet_list = [[] for i in range(len(tidied_tweets))]
for tweets, t in zip(tidied_tweets, range(len(tidied_tweets))):
for words in tweets:
a = re.findall(r"[\w']+", words)
tweet_list[t].append(a)
I want to (read) the book ( in a text file), as it is in its current form. The format of book is in text file, like
In Proceedings of the American Mathematical Society, volume 9, pages
541–544, 1958. [314] R. De Nicola and F. Vaandrager. Three logics for
branching bisimulation (extended abstract). In 5th Annual IEEE
Symposium on Logic in Computer Science (LICS), pages 118–129. IEEE
Computer Society Press, Springer-Verlag, 1990. [315] X. Nicollin and
J.-L. Richier and J. Sifakis and J. Voiron. ATP: an algebra for timed
processes. In IFIP TC2 Working Conference on Programming Concepts and
Methods, pages 402–427. North Holland, 1990.
The book can be any book (.txt file) with 400 or more pages.
What I am trying to do is that, first of all I split each sentence of the book (txt file) to a list using sent_tokenize. Thereafter, now the file is having a list of list of sentences with respect to the input text file and I need to write the file (as it is with list of list of sentences) into a new file.
The program is executed successfully, but when I open the new generated file, which is suppose to contain the sentence of lists, it prints blank [][][]. I hope there is some encoding issue. This is the output looking like
I am interested to figure out this issue. Any help is appreciated.
The code whatever I have tried is as follows:
import nltk, re
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk.stem import WordNetLemmatizer
def remove_punctuation(from_text):
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in from_text]
return stripped
def preprocessing():
with open("F:\\Pen Drive 8 GB\\PDF\\Code\\Books Handbooks\\Books Handbooks Text\\b1.txt", encoding="utf-8") as f:
tokens_sentences = sent_tokenize(f.read())
tokens = [[word.lower() for word in line.split()] for line in tokens_sentences]
global stripped_tokens
stripped_tokens = [remove_punctuation(i) for i in tokens]
sw = (stopwords.words('english'))
filter_set = [[token for token in sentence if (token.lower() not in sw and token.isalnum() and token.isalpha() and re.findall(r"[^_ .'\"-[A-Za-z]]+", token))] for sentence in stripped_tokens]
lemma = WordNetLemmatizer()
lem = []
for w in filter_set:
lem.append([wi for wi in map(lemma.lemmatize, w)])
return lem
result = preprocessing()
with open('F:\\Pen Drive 8 GB\\PDF\\Code\\Books Handbooks\\Books Handbooks Text\\b1list.txt', "w", encoding="utf-8") as f:
for e in result[:]:
f.write(str(e))
preprocessing()
I have a set of documents stored in a JOSN file. Along this line, I retrieve them using the following code so that they are stored under the term data:
import json
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
Integrating all texts into a single one to form the corpus is done by:
corpus = []
for i in range(len(data) -1):
corpus.append(data[i]['body'] + data[i+1]['body'])
Until now pretty straightforward manipulations. To build the tfidf I use the following lines of codes which remove stop words and punctuation, stems each term and tokenize the data.
import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a functionthat contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
I then try to apply this function to the corpus such:
tfidf = vectorizer.fit_transform(corpus)
print(((tfidf*tfidf.T).A)[0,1])
But nothing happens, any idea of how to proceed?
Kind regards
I am trying to build a small program that calculates the tfidf in python. There are two very nice tutorials which I have used (I have code from here and another function from kaggle )
import nltk
import string
import os
from bs4 import *
import re
from nltk.corpus import stopwords # Import the stop word list
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
path = 'my/path'
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
def review_to_words( raw_review ):
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
shakes = open(file_path, 'r')
text = shakes.read()
token_dict[file] = review_to_words(text)
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
str = 'this sentence has unseen text such as computer but also king lord lord this this and that lord juliet'#teststring
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
The code seems to work fine but then I have a look at the results.
thi - 0.612372435696
text - 0.204124145232
sentenc - 0.204124145232
lord - 0.612372435696
king - 0.204124145232
juliet - 0.204124145232
ha - 0.204124145232
comput - 0.204124145232
The IDFs seem to be the same for all the words because the TFIDFs are just n*0.204. I have checked with tfidf.idf_
and this seems to be the case.
Is there something in the method that I have not implemented correctly?
Do you know why the idf_s are the same?
Since you provided a list containing 1 document, all terms idfs will have an equal 'binary frequency'.
idf is the inverted term frequency over the set of documents (or just inverted document frequency). Most if not all idf formulas only checks for term presence in a document, so it does not matter how many times it appears per document.
Try feeding a list with 3 distinct documents for instance, this way the idfs will not be the same.
The inverse document frequency of a term t is calculated as follows.
N is the total number of documents and df_t is the number of documents where the term t appears.
In this case, your program has one document (str variable).
Therefore, both N and df_t equal 1.
As a result, the IDF for all terms are the same.