NO CODE NEEDED
I am checking probability that given a series of words that, following that series, the index is some given word. I am currently working with nltk/python and was wondering if there was a simple function to do this or if I need to hard code this kind of thing myself by iterating through and counting all occurrences sort of thing.
Thanks
You have to iterate over the whole text first and count the n-grams so that you can compute their probability given a preceding sequence.
Here is a very simple example:
import re
from collections import defaultdict, Counter
# Tokenize the text in a very naive way.
text = "The Maroon Bells are a pair of peaks in the Elk Mountains of Colorado, United States, close to the town of Aspen. The two peaks are separated by around 500 meters (one-third of a mile). Maroon Peak is the higher of the two, with an altitude of 14,163 feet (4317.0 m), and North Maroon Peak rises to 14,019 feet (4273.0 m), making them both fourteeners. The Maroon Bells are a popular tourist destination for day and overnight visitors, with around 300,000 visitors every season."
tokens = re.findall(r"\w+", text.lower(), re.U)
def get_ngram_mapping(tokens, n):
# Add markers for the beginning and end of the text.
tokens = ["[BOS]"] + tokens + ["[EOS]"]
# Map a preceding sequence of n-1 tokens to a list
# of following tokens. 'defaultdict' is used to
# give us an empty list when we acces a key that
# does not exist yet.
ngram_mapping = defaultdict(list)
# Iterate through the text using a moving window
# of length n.
for i in range(len(tokens) - n + 1):
window = tokens[i:i+n]
preceding_sequence = tuple(window[:-1])
following_token = window[-1]
# Example for n=3: 'it is good' =>
# ngram_mapping[("it", "is")] = ["good"]
ngram_mapping[preceding_sequence].append(following_token)
return ngram_mapping
def compute_ngram_probability(ngram_mapping):
ngram_probability = {}
for preceding, following in ngram_mapping.items():
# Let's count which tokens appear right
# behind the tokens in the preceding sequence.
# Example: Counter(['a', 'a', 'b'])
# => {'a': 2, 'b': 1}
token_counts = Counter(following)
# Next we compute the probability that
# a token 'w' follows our sequence 's'
# by dividing by the frequency of 's'.
frequency_s = len(following)
token_probability = defaultdict(float)
for token, token_frequency in token_counts.items():
token_probability[token] = token_frequency / frequency_s
ngram_probability[preceding] = token_probability
return ngram_probability
ngrams = count_ngrams(tokens, n=2)
ngram_probability = compute_ngram_probability(ngrams)
print(ngram_probability[("the",)]["elk"]) # = 0.14285714285714285
print(ngram_probability[("the",)]["unknown"]) # = 0.0
I needed to solve the same issue as well. I used nltk.ngrams() function to get n-grams and then extend into a list as bi-grams because nltk.ConditionalFreqDist() function requires bi-grams. Then feed the results into nltk.ConditionalProbDist(). You can find the following example code;
from collections import defaultdict
ngram_prob = defaultdict(float)
ngrams_as_bigrams=[]
ngrams_as_bigrams.extend([((t[:-1]), t[-1]) for t in nltk.ngrams(tokens, n)])
cfd = nltk.ConditionalFreqDist(ngrams_as_bigrams)
cpdist = nltk.ConditionalProbDist(cfd, nltk.LidstoneProbDist, gamma=0.2, bins=len(tokens))
for (pre,follow) in ngrams_as_bigrams:
all_st = pre + (follow,)
ngram_prob[all_st] = cpdist[pre].prob(follow)
sorted_ngrams = [' '.join(k) for k, v in sorted(ngram_prob.items(), key=lambda x: x[1])[::-1]][:topk]
Related
I have quite a large dataframe (2000+ entries) with a column for text. I want to calculate the amount of 'rare' words per each column. I think I have it mostly worked out, but at the last line
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)] doesn't seem to be iterating over the entire dataframe, instead the output is only for the first two columns so I can't add that list back into my dataframe with df['count']=final.
Also, I am concerned about processing times, so I am wondering if there is a more efficient way of doing this?
!pip install clean-text
import nltk
nltk.download('punkt')
import pandas as pd
import string
from collections import Counter
from cleantext.sklearn import CleanTransformer
import string
# Sample data here
df = pd.DataFrame()
df['text']=['Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers. Where’s the peck of pickled peppers Peter Piper picked?',
'Betty Botter bought some butter But she said the butter’s bitter If I put it in my batter, it will make my batter bitter But a bit of better butter will make my batter better So ‘twas better Betty Botter bought a bit of better butter',
'How much wood would a woodchuck chuck if a woodchuck could chuck wood?. He would chuck, he would, as much as he could, and chuck as much wood. As a woodchuck would if a woodchuck could chuck wood',
'Susie works in a shoeshine shop. Where she shines she sits, and where she sits she shines']
#--
# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]
# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean
# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token
# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary] # create a list of the top occuring words
# Compare the lists
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)]
Just for completeness I wanted to post what I ended up doing. #Panagiotis Papastathis brought up a good point about the 'most_common words', in that I was specifying the top 10 words, but I was not taking into account their frequency. I eneded up replacing
tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary] # create a list of the top occuring words
with
dictionary = Counter(Words_s.split()).most_common() # dictionary
most_common = [x for x, y in dictionary if y >= 4 ] # take into account frequency when filtering
which I think accounts for the problem (also removing the line where I tokenize the words)
And as #Panagiotis Papastathis pointed up the last line was changed to
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['count']=final
so all together
from cleantext.sklearn import CleanTransformer
import string
# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]
# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean
# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token
# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
dictionary = Counter(Words_s.split()).most_common() # dictionary
most_common = [x for x, y in dictionary if y >= 4 ]
# Compare the lists
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['uncommon_words'] = final
This is the CSV in question. I got this data from Extra History, which is a video series that talks about many historical topics through many mini-series, such as 'Rome: The Punic Wars', or 'Europe: The First Crusades' (in the CSV). Episodes in these mini-series are numbered #1 up to #6 (though Justinian Theodora has two series, the first numbered #1-#6, the second #7-#12).
I would like to do some statistical analysis on these mini-series, which entails sorting these numbered episodes (e.g. episodes #2-#6) into their appropriate series, i.e. end result look something like this; I can then easily automate sorting into the appropriate python list.
My python code matches the #2-#6 episodes to the #1 episode correctly 99% of the time, with only 1 big error in red, and 1 slight error in yellow (because the first episode of that series is #7, not #1). However, I get the nagging feeling that there is an easier and foolproof way since the strings are well organized, with regular patterns in their names. Is that possible? And can I achieve that with my current code, or should I change and approach it from a different angle?
import csv
eh_csv = '/Users/Work/Desktop/Extra History Playlist Video Data.csv'
with open(eh_csv, newline='', encoding='UTF-8') as f:
reader = csv.reader(f)
data = list(reader)
import re
series_first = []
series_rest = []
singles_music_lies = []
all_episodes = [] #list of name of all episodes
#seperates all videos into 3 non-overlapping list: first episodes of a series,
#the other numbered episodes of the series, and singles/music/lies videos
for video in data:
all_episodes.append(video[0])
#need regex b/c normall string search of #1 also matched (Justinian &) Theodora #10
if len(re.findall('\\b1\\b', video[0])) == 1:
series_first.append(video[0])
elif '#' not in video[0]:
singles_music_lies.append(video[0])
else:
series_rest.append(video[0])
#Dice's Coefficient
#got from here; John Rutledge's answer with NinjaMeTimbers modification
#https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
#------------------------------------------------------------------------------------------
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
pairs2.remove(y)
break
return (2.0 * hit_count) / union
#-------------------------------------------------------------------------------------------
#only take couple words of the episode's names for comparison, b/c the first couple words are 99% of the
#times the name of the series; can't make too short or words like 'the, of' etc will get matched (now or
#in future), or too long because will increase chance of superfluous match; does much better than w/o
#limitting to first few words
def first_three_words(name_string):
#eg ''.join vs ' '.join
first_three = ' '.join(name_string.split()[:5]) #-->'The Haitian Revolution' slightly worse
#first_three = ''.join(name_string.split()[:5]) #--> 'TheHaitianRevolution', slightly better
return first_three
#compared given episode with all first videos, and return a list of comparison scores
def compared_with_first(episode, series_name = series_first):
episode_scores = []
for i in series_name:
x = first_three_words(episode)
y = first_three_words(i)
#comparison_score = round(string_similarity(episode, i),4)
comparison_score = round(string_similarity(x,y),4)
episode_scores.append((comparison_score, i))
return episode_scores
matches = []
#go through video number 2,3,4 etc in a series and compare them with the first episode
#of all series, then get a comparison score
for episode in series_rest:
scores_list = compared_with_first(episode)
similarity_score = 0
most_likely_match = []
#go thru list of comparison scores returned from compared_with_first,
#then append the currentepisode/highest score/first episode to
#most_likely_match; repeat for all non-first episodes
for score in scores_list:
if score[0] > similarity_score:
similarity_score = score[0]
most_likely_match.clear() #MIGHT HAVE BEEN THE CRUCIAL KEY
most_likely_match.append((episode,score))
matches.append(most_likely_match)
final_match = []
for i in matches:
final_match.append((i[0][0], i[0][1][1], i[0][1][0]))
#just to get output in desired presentation
path = '/Users/Work/Desktop/'
with open('EH Sorting Episodes.csv', 'w', newline='',encoding='UTF-8') as csvfile:
csvwriter = csv.writer(csvfile)
for currentRow in final_match:
csvwriter.writerow(currentRow)
#print(currentRow)
I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word
What I have so far
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
I have 10000+ rows in my output.
My Expected Output
I would like to group the output by the first word and extract it as a dataframe
What I have tried among other solutions
I have tried adapting solutions given here and here, but no satisfactory results.
Any help/guidance appreciated.
Try the following (documentation is inside the code):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
Sample Output
Given a model, e.g.
from gensim.models.word2vec import Word2Vec
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
texts = [d.lower().split() for d in documents]
w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)
It's possible to remove the word from the w2v vocabulary, e.g.
# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]
>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)
# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]
# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"
But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.
>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]
How to remove a word completely from a Word2Vec model in gensim?
Updated
To answer #vumaasha's comment:
could you give some details as to why you want to delete a word
Lets say my universe of words in all words in the corpus to learn the dense relations between all words.
But when I want to generate the similar words, it should only come from a subset of domain specific word.
It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.
It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.
I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.
def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []
for i in range(len(w2v.vocab)):
word = w2v.index2entity[i]
vec = w2v.vectors[i]
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)
w2v.vocab = new_vocab
w2v.vectors = new_vectors
w2v.index2entity = new_index2entity
w2v.index2word = new_index2entity
w2v.vectors_norm = new_vectors_norm
It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.
Usage:
w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")
[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]
restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")
[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]
There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.
The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done
limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
dists = dot(limited, mean)
if not topn:
return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
Update:
limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited
the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)
so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work
Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.
Suppose you only want to keep the top 5000 words in your model.
wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case
# words_to_trim = ['graph']
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]
for w in words_to_trim:
del wv.vocab[w]
wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)
for i in sorted(ids_to_trim, reverse=True):
del(wv.index2word[i])
This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.
The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.
Have tried and felt that the most straightforward way is as follows:
Get the Word2Vec embeddings in text file format.
Identify the lines corresponding to the word vectors that you would like to keep.
Write a new text file Word2Vec embedding model.
Load model and enjoy (save to binary if you wish, etc.)...
My sample code is as follows:
line_no = 0 # line0 = header
numEntities=0
targetLines = []
with open(file_entVecs_txt,'r') as fp:
header = fp.readline() # header
while True:
line = fp.readline()
if line == '': #EOF
break
line_no += 1
isLatinFlag = True
for i_l, char in enumerate(line):
if not isLatin(char): # Care about entity that is Latin-only
isLatinFlag = False
break
if char==' ': # reached separator
ent = line[:i_l]
break
if not isLatinFlag:
continue
# Check for numbers in entity
if re.search('\d',ent):
continue
# Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
if re.match('^ENTITY/.*#',ent):
continue
targetLines.append(line_no)
numEntities += 1
# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)
# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)
line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
while ptr < len(targetLines):
target_line_no = targetLines[ptr]
while (line_no != target_line_no):
fp.readline()
line_no+=1
line = fp.readline()
line_no+=1
ptr+=1
txtAppend(line,file_entVecs_SHORT_txt)
FYI. FAILED ATTEMPT I tried out #zsozso's method (with the np.array modifications suggested by #Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). This is perhaps because I have a lot of entities... But my text-file method works within an hour.
FAILED CODE
# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []
print('Building new vocab..')
for i in range(len(w2v.vocab)):
if (i%int(1e6)==0) and (i!=0):
print(f'working on {i}')
word = w2v.index2entity[i]
vec = np.array(w2v.vectors[i])
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)
print('Assigning new vocab')
w2v.vocab = new_vocab
print('Assigning new vectors')
w2v.vectors = np.array(new_vectors)
print('Assigning new index2entity, index2word')
w2v.index2entity = new_index2entity
w2v.index2word = new_index2entity
print('Assigning new vectors_norm')
w2v.vectors_norm = np.array(new_vectors_norm)
I came across the following code in the book Programming collective intelligence called newsfeatures.py.
Here's the code:
import feedparser
import re
feedlist=['http://today.reuters.com/rss/topNews',
'http://today.reuters.com/rss/domesticNews',
'http://today.reuters.com/rss/worldNews',
'http://hosted.ap.org/lineups/TOPHEADS-rss_2.0.xml',
'http://hosted.ap.org/lineups/USHEADS-rss_2.0.xml',
'http://hosted.ap.org/lineups/WORLDHEADS-rss_2.0.xml',
'http://hosted.ap.org/lineups/POLITICSHEADS-rss_2.0.xml',
'http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml',
'http://www.nytimes.com/services/xml/rss/nyt/International.xml',
'http://news.google.com/?output=rss',
'http://feeds.salon.com/salon/news',
'http://www.foxnews.com/xmlfeed/rss/0,4313,0,00.rss',
'http://www.foxnews.com/xmlfeed/rss/0,4313,80,00.rss',
'http://www.foxnews.com/xmlfeed/rss/0,4313,81,00.rss',
'http://rss.cnn.com/rss/edition.rss',
'http://rss.cnn.com/rss/edition_world.rss',
'http://rss.cnn.com/rss/edition_us.rss']
def stripHTML(h):
p=''
s=0
for c in h:
if c=='<': s=1
elif c=='>':
s=0
p+=' '
elif s==0: p+=c
return p
def separatewords(text):
splitter=re.compile('\\W*')
return [s.lower( ) for s in splitter.split(text) if len(s)>3]
def getarticlewords( ):
allwords={}
articlewords=[]
articletitles=[]
ec=0
# Loop over every feed
for feed in feedlist:
f=feedparser.parse(feed)
# Loop over every article
for e in f.entries:
# Ignore identical articles
if e.title in articletitles: continue
# Extract the words
txt=e.title.encode('utf8')+stripHTML(e.description.encode('utf8'))
words=separatewords(txt)
articlewords.append({})
articletitles.append(e.title)
# Increase the counts for this word in allwords and in articlewords
for word in words:
allwords.setdefault(word,0)
allwords[word]+=1
articlewords[ec].setdefault(word,0)
articlewords[ec][word]+=1
ec+=1
return allwords,articlewords,articletitles
def makematrix(allw,articlew):
wordvec=[]
# Only take words that are common but not too common
for w,c in allw.items( ):
if c>3 and c<len(articlew)*0.6:
wordvec.append(w)
# Create the word matrix
l1=[[(word in f and f[word] or 0) for word in wordvec] for f in articlew]
return l1,wordvec
from numpy import *
def showfeatures(w,h,titles,wordvec,out='features.txt'):
outfile=file(out,'w')
pc,wc=shape(h)
toppatterns=[[] for i in range(len(titles))]
patternnames=[]
# Loop over all the features
for i in range(pc):
slist=[]
# Create a list of words and their weights
for j in range(wc):
slist.append((h[i,j],wordvec[j]))
# Reverse sort the word list
slist.sort( )
slist.reverse( )
# Print the first six elements
n=[s[1] for s in slist[0:6]]
outfile.write(str(n)+'\n')
patternnames.append(n)
# Create a list of articles for this feature
flist=[]
for j in range(len(titles)):
# Add the article with its weight
flist.append((w[j,i],titles[j]))
toppatterns[j].append((w[j,i],i,titles[j]))
# Reverse sort the list
flist.sort( )
flist.reverse( )
# Show the top 3 articles
for f in flist[0:3]:
outfile.write(str(f)+'\n')
outfile.write('\n')
outfile.close( )
# Return the pattern names for later use
return toppatterns,patternnames
The usage is as follows:
>>> import newsfeatures
>>> allw,artw,artt= newsfeatures.getarticlewords( )
>>> artt[1]
u'Fatah, Hamas men abducted freed: sources'
As you can see, this line, produces the news headline.
>>> artt[1]
u'Fatah, Hamas men abducted freed: sources'
What I want to know is is there someway through which the program not only displays the headline, but also displays the source of the headline from the feedlist.
Could anyone help?
Thanks!
Replace
articletitles.append(e.title)
in getarticlewords() with something like
articletitles.append(' '.join([e.title, ', from', feed]))