I have quite a large dataframe (2000+ entries) with a column for text. I want to calculate the amount of 'rare' words per each column. I think I have it mostly worked out, but at the last line
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)] doesn't seem to be iterating over the entire dataframe, instead the output is only for the first two columns so I can't add that list back into my dataframe with df['count']=final.
Also, I am concerned about processing times, so I am wondering if there is a more efficient way of doing this?
!pip install clean-text
import nltk
nltk.download('punkt')
import pandas as pd
import string
from collections import Counter
from cleantext.sklearn import CleanTransformer
import string
# Sample data here
df = pd.DataFrame()
df['text']=['Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers. Where’s the peck of pickled peppers Peter Piper picked?',
'Betty Botter bought some butter But she said the butter’s bitter If I put it in my batter, it will make my batter bitter But a bit of better butter will make my batter better So ‘twas better Betty Botter bought a bit of better butter',
'How much wood would a woodchuck chuck if a woodchuck could chuck wood?. He would chuck, he would, as much as he could, and chuck as much wood. As a woodchuck would if a woodchuck could chuck wood',
'Susie works in a shoeshine shop. Where she shines she sits, and where she sits she shines']
#--
# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]
# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean
# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token
# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary] # create a list of the top occuring words
# Compare the lists
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df)]
Just for completeness I wanted to post what I ended up doing. #Panagiotis Papastathis brought up a good point about the 'most_common words', in that I was specifying the top 10 words, but I was not taking into account their frequency. I eneded up replacing
tokens = nltk.word_tokenize(Words_s) # tokenize
dictionary = Counter(Words_s.split()).most_common(10) # dictionary of most 10 occuring words and their frequency
most_common = [x for x, y in dictionary] # create a list of the top occuring words
with
dictionary = Counter(Words_s.split()).most_common() # dictionary
most_common = [x for x, y in dictionary if y >= 4 ] # take into account frequency when filtering
which I think accounts for the problem (also removing the line where I tokenize the words)
And as #Panagiotis Papastathis pointed up the last line was changed to
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['count']=final
so all together
from cleantext.sklearn import CleanTransformer
import string
# Convert strings to list
df['text_cleaned'] = [[i] for i in df['text']]
# Clean text for each row in dataframe
cleaner = CleanTransformer(no_punct=True, lower=True) # defining parameteres of the cleaner
full_text_clean = [cleaner.transform(element) for element in df['text_cleaned']]
df['text_cleaned']=full_text_clean
# Tokenize each row in dataframe
text_clean_string = [' '.join(list_element) for list_element in df['text_cleaned']]
Token = [nltk.word_tokenize(token_words) for token_words in text_clean_string]
df['text_cleaned']=Token
# ----
# create a list of all the words in the dataframe, to calculate the high frequency words accross the entier sample
full_text = [element for element in df['text']] # create a list
cleaner = CleanTransformer(no_punct=True, lower=True) # clean the list
full_text_clean = cleaner.transform(full_text)
Words_s = ' '.join(full_text_clean) # convert the list to a string
dictionary = Counter(Words_s.split()).most_common() # dictionary
most_common = [x for x, y in dictionary if y >= 4 ]
# Compare the lists
final = [(len([w for w in df['text_cleaned'][idx] if w not in most_common])) for idx, w in enumerate(df["text_cleaned"])]
df['uncommon_words'] = final
Related
I have a dataframe with two columns, the first are names of organisms and the second is there sequence which is a string of letters. I am trying to create an algorithm to see if an organism's sequence is in a string of a larger genome also comprised of letters. If it is in the genome, I want to add the name of the organism to a list. So for example if flu is in the genome below I want flu to be added to a list.
dict_1={'organisms':['flu', 'cold', 'stomach bug'], 'seq_list':['HTIDIJEKODKDMRM',
'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df=pd.DataFrame(dict_1)
organisms seq_list
0 flu HTIDIJEKODKDMRM
1 cold AGGTTTEFGFGEERDDTER
2 stomach bug EGHDGGEDCGRDSGRDCFD
genome='TLTPSRDMEDHTIDIJEKODKDMRM'
This first functions finds the index of the match if there is one where p is the organism and t is the genome. The second portion is the one I am having trouble with. I am trying to use a for loop to search each entry in the df, but if I get a match I am not sure how to reference the first column in the df to add the name to the empty list. Thank you for your help!
def naive(p, t):
occurences = []
for i in range(len(t) - len(p) + 1):
match = True
for j in range(len(p)):
if t[i+j] != p[j]:
match = False
break
if match:
occurences.append(i)
return occurences
Organisms_that_matched = []
for x in df:
matches=naive(genome, x)
if len(matches) > 0:
#add name of organism to Organisms_that_matched list
I'm not sure if you are learning about different ways to transverse and apply custom logic in a list, but you can use list comprehensions:
import pandas as pd
dict_1 = {
'organisms': ['flu', 'cold', 'stomach bug'],
'seq_list': ['HTIDIJEKODKDMRM', 'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df = pd.DataFrame(dict_1)
genome = 'TLTPSRDMEDHTIDIJEKODKDMRM'
organisms_that_matched = [dict_1['organisms'][index] for index, x in enumerate(dict_1['seq_list']) if x in genome]
print(organisms_that_matched)
Im working on text summarization extraction on long text data. I have multiple users text data in the input csv file. But current code is appending all the text column data in to sentences and then apply logic. How do I apply the code for each row instead of merging all the column values? Any Help will appreciated.
Input.csv (^ delimited)
uid^name^text
36d73f013aa7^Don Howard^The Irvine Foundation has entered into a partnership with College Futures Foundation that starts a new chapter in our support of postsecondary success in California.To achieve Irvine’s singular goal.
36d73f013aa8^Simon Haris^That’s why we have long provided funding to expand postsecondary success. Now with our focus on low-wage workers, we have decided to split our postsecondary funding into two parts:. Strengthening and expanding work-ready credentialing programs (which we will do directly, primarily as part of our Better Careers initiative).
36d73f013aa8^David^Accelerating and expanding the attainment of bachelor’s degrees (which we will fund through our partnership with College Futures). We believe that College Futures is in a stronger position than we are to make grants to support improvements in how the CSUs and the California Community Colleges can better serve students.
pseudo code
Loop each record
apply below logic to text column to get summary
Code : Text Summarization Code
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
from nltk.corpus import stopwords
import re
# Read the CSV file
import io
df = pd.read_csv('/home/sshuser/textsummerisation/input.csv',sep='^')
# split the the text in the articles into sentences
sentences = []
for s in df['text']:
sentences.append(sent_tokenize(s))
# flatten the list
sentences = [y for x in sentences for y in x]
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')# one time execution
stop_words = stopwords.words('english')
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# Extract word vectors
word_embeddings = {}
fopen = open('/home/sshuser/textsummerisation/glove.6B.100d.txt', encoding='utf-8')
for line in fopen:
values = line.split()
word = values[0]
print(values)
print(word)
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
fopen.close()
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
else:
v = np.zeros((100,))
sentence_vectors.append(v)
len(sentence_vectors)
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Specify number of sentences to form the summary
sn = 10
# Generate summary
for i in range(sn):
print(ranked_sentences[i][1])
Expected output: Output of the above code should come in summary column for each record
uid^name^text^summary
I'm trying to remove stopwords in my data. So it would go from this
data['text'].head(5)
Out[25]:
0 go until jurong point, crazy.. available only ...
1 ok lar... joking wif u oni...
2 free entry in 2 a wkly comp to win fa cup fina...
3 u dun say so early hor... u c already then say...
4 nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object
to this
data['newt'].head(5)
Out[26]:
0 [go, jurong, point,, crazy.., available, bugis...
1 [ok, lar..., joking, wif, u, oni...]
2 [free, entry, 2, wkly, comp, win, fa, cup, fin...
3 [u, dun, say, early, hor..., u, c, already, sa...
4 [nah, think, goes, usf,, lives, around, though]
Name: newt, dtype: object
I have two options on how to do this. I'm trying both options separately so it won't overwrite anything. Firstly i'm applying a function to the data column. This works, it removes achieve what i wanted to do.
def process(data):
data = data.lower()
data = data.split()
data = [row for row in data if row not in stopwords]
return data
data['newt'] = data['text'].apply(process)
And second option in without using apply function parameter. It's exactly like the function but why it returns TypeError: unhashable type: 'list'? i check that if row not in stopwords in the line is what causing this because when i delete it, it runs but it doesn't do the stopwords removal
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [row for row in data['newt'] if row not in stopwords]
Your list comprehension fails because it checks if your entire dataframe row is in the stopwords list. This is never true, so what [row for row in data['newt'] if row not in stopwords] produces is simply the list of values in the original data['newt'] column.
I think that following your logic, your last lines for stopwords removal may read
data['newt'] = data['text'].str.lower()
data['newt'] = data['newt'].str.split()
data['newt'] = [[word for word in row if word not in stopwords] for row in data['newt']]
If you are OK using apply, the last line can be replaced with
data['newt'] = data['newt'].apply(lambda row: [word for word in row if word not in stopwords])
Finally, you could also call
data['newt'].apply(lambda row: " ".join(row))
to get back strings at the end of the process.
Mind that str.split may not be the best way to do tokenization, and you may opt for solutions using a dedicated library like spacy using a combination of removing stop words using spacy and adding custom stopwords with Add/remove custom stop words with spacy
To convince yourself of the above argument, try out the following code:
import spacy
sent = "She said: 'beware, your sentences may contain a lot of funny chars!'"
# spacy tokenization
spacy.cli.download("en_core_web_sm")
nlp = spacy.load('en_core_web_sm')
doc = nlp(sent)
print([token.text for token in doc])
# simple split
print(sent.split())
and compare the two outputs.
I have a list of words in one list (word_list), and I created another list that is just a row of article headlines (headline_col). The headlines are strings of many words, while the word_list are single words. I want to search the headlines to see if they contain any of the words in my word list and, if so, append another list (slam_list) with the headline.
I've looked this up and all the things I see are only matching an exact string to another of the same. For example looking to see if the entry is exactly "apple," not to see if it is in "john ate an apple today."
I've tried using sets, but I was only able to get it to return True if there was a match, I didn't know how to get it to append the slam_list, or even just print the entry. This is what I have. How would I use this to get what I need?
import csv
word_list = ["Slam", "Slams", "Slammed", "Slamming",
"Blast", "Blasts", "Blasting", "Blasted"]
slam_list = []
csv_data = []
# Creating the list I need by opening a csv and getting the column I need
with open("website_headlines.csv", encoding="utf8") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
headline_col = [headline[2] for headline in csv_data]
So using sets, as you mentioned, is definitely the way to go here. This is because lookups in sets are much faster than in lists. If you want to know why, do a quick google search on hashing. All you need to do to make this change is change the square brackets in word_list to curly braces.
The real issue that you need to deal with is "The headlines are strings of many words, while the word_list are single words"
What you need to do is iterate over the many words. I'm assuming headline_col is a list of headlines, where headline is a string containing one or more words. We'll iterate over all the headlines, then iterate over each word in the headline.
word_list = {"Slam", "Slams", "Slammed", "Slamming", "Blast", "Blasts", "Blasting", "Blasted"}
# Iterate over each headline
for headline in headline_col:
# Iterate over each word in headline
# Headline.split will break the headline into a list of words (breaks on whitespace)
for word in headline.split():
# if we've found our word
if word in word_list:
# add the word to our list
slam_list.append(headline)
# we're done with this headline, so break from the inner for loop
break
Here, since you are reading a csv, it is likely going to be easier to use pandas to accomplish your goals.
What you want to do is identify the column by its index, which looks like it is 2. Then you find the values of the third column that are in word_list.
import pandas as pd
df = pd.read_csv("website_headlines.csv")
col = df.columns[2]
df.loc[df[col].isin(word_list), col]
Consider the following example
import numpy as np
import pandas as pd
word_list = ["Slam", "Slams", "Slammed", "Slamming",
"Blast", "Blasts", "Blasting", "Blasted"]
# add some extra characters to see if limited to exact matches
word_list_mutated = np.random.choice(word_list + [item + '_extra' for item in word_list], 10)
data = {'a': range(1, 11), 'b': range(1, 11), 'c': word_list_mutated}
df = pd.DataFrame(data)
col = df.columns[2]
>>>df.loc[df[col].isin(word_list), col]
a b c
0 1 1 Slams
1 2 2 Slams
2 3 3 Blasted_extra
3 4 4 Blasts
4 5 5 Slams_extra
5 6 6 Slamming_extra
6 7 7 Slam
7 8 8 Slams_extra
8 9 9 Slam
9 10 10 Blasting
NO CODE NEEDED
I am checking probability that given a series of words that, following that series, the index is some given word. I am currently working with nltk/python and was wondering if there was a simple function to do this or if I need to hard code this kind of thing myself by iterating through and counting all occurrences sort of thing.
Thanks
You have to iterate over the whole text first and count the n-grams so that you can compute their probability given a preceding sequence.
Here is a very simple example:
import re
from collections import defaultdict, Counter
# Tokenize the text in a very naive way.
text = "The Maroon Bells are a pair of peaks in the Elk Mountains of Colorado, United States, close to the town of Aspen. The two peaks are separated by around 500 meters (one-third of a mile). Maroon Peak is the higher of the two, with an altitude of 14,163 feet (4317.0 m), and North Maroon Peak rises to 14,019 feet (4273.0 m), making them both fourteeners. The Maroon Bells are a popular tourist destination for day and overnight visitors, with around 300,000 visitors every season."
tokens = re.findall(r"\w+", text.lower(), re.U)
def get_ngram_mapping(tokens, n):
# Add markers for the beginning and end of the text.
tokens = ["[BOS]"] + tokens + ["[EOS]"]
# Map a preceding sequence of n-1 tokens to a list
# of following tokens. 'defaultdict' is used to
# give us an empty list when we acces a key that
# does not exist yet.
ngram_mapping = defaultdict(list)
# Iterate through the text using a moving window
# of length n.
for i in range(len(tokens) - n + 1):
window = tokens[i:i+n]
preceding_sequence = tuple(window[:-1])
following_token = window[-1]
# Example for n=3: 'it is good' =>
# ngram_mapping[("it", "is")] = ["good"]
ngram_mapping[preceding_sequence].append(following_token)
return ngram_mapping
def compute_ngram_probability(ngram_mapping):
ngram_probability = {}
for preceding, following in ngram_mapping.items():
# Let's count which tokens appear right
# behind the tokens in the preceding sequence.
# Example: Counter(['a', 'a', 'b'])
# => {'a': 2, 'b': 1}
token_counts = Counter(following)
# Next we compute the probability that
# a token 'w' follows our sequence 's'
# by dividing by the frequency of 's'.
frequency_s = len(following)
token_probability = defaultdict(float)
for token, token_frequency in token_counts.items():
token_probability[token] = token_frequency / frequency_s
ngram_probability[preceding] = token_probability
return ngram_probability
ngrams = count_ngrams(tokens, n=2)
ngram_probability = compute_ngram_probability(ngrams)
print(ngram_probability[("the",)]["elk"]) # = 0.14285714285714285
print(ngram_probability[("the",)]["unknown"]) # = 0.0
I needed to solve the same issue as well. I used nltk.ngrams() function to get n-grams and then extend into a list as bi-grams because nltk.ConditionalFreqDist() function requires bi-grams. Then feed the results into nltk.ConditionalProbDist(). You can find the following example code;
from collections import defaultdict
ngram_prob = defaultdict(float)
ngrams_as_bigrams=[]
ngrams_as_bigrams.extend([((t[:-1]), t[-1]) for t in nltk.ngrams(tokens, n)])
cfd = nltk.ConditionalFreqDist(ngrams_as_bigrams)
cpdist = nltk.ConditionalProbDist(cfd, nltk.LidstoneProbDist, gamma=0.2, bins=len(tokens))
for (pre,follow) in ngrams_as_bigrams:
all_st = pre + (follow,)
ngram_prob[all_st] = cpdist[pre].prob(follow)
sorted_ngrams = [' '.join(k) for k, v in sorted(ngram_prob.items(), key=lambda x: x[1])[::-1]][:topk]