python word grouping based on words before and after - python

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.

We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Related

Python: How to match the words in split and non split?

I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:
Table A:
Cat Comments
Stat A power down due to electric shock
Stat A powerdown because short circuit
Stat A top 10 on re work
Stat A top10 on rework
I wish to get the output as below:
Repeated words= ['Powerdown', 'top10','on','rework']
Anyone have ideas?
I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.
import pandas as pd
df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()
This leads to
['power down due to electric shock',
'powerdown because short circuit',
'top 10 on re work',
'top10 on rework']
Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:
newa = []
for s in words:
a = s.split()
for i in range(len(a)-1):
w = a[i]+a[i+1]
a.append(w)
newa.append(a)
which yields:
[['power',
'down',
'due',
'to',
'electric',
'shock',
'powerdown',
'downdue',
'dueto',
'toelectric',
'electricshock'],...
Finally we flatten the list and use Counter to find words which occur more than once:
from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]
leading to
['powerdown', 'on', 'top10', 'rework']
Let's try:
words = df['Comments'].str.split(' ').explode()
biwords = words + words.groupby(level=0).shift(-1)
(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates), # remove duplicates words within a comment
biwords.groupby(level=0).apply(pd.Series.drop_duplicates)]) # remove duplicate bi-words within a comment
.dropna() # remove NaN created by shifting
.to_frame().join(df[['Cat']]) # join with original Cat
.loc[lambda x: x.duplicated(keep=False)] # select the duplicated `Comments` within `Cat`
.groupby('Cat')['Comments'].unique() # select the unique values within each `Cat`
)
Output:
Cat
Stat A [powerdown, on, top10, rework]
Name: Comments, dtype: object

Python; use list of initial characters to retrieve full word from other list?

I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.
Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)
That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.
Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]
This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine

Why isn't there function to count Document Frequency (DF) in NLTK?

I am looking for a function to get the DF of certain term (meaning how many documents contain a certain word in a corpus), but I can't seem to find the function here. The page only has function to get values of tf, idf, and tf_idf. I am looking specifically for DF only. I copied the code below from the documentation,
matches = len([True for text in self._texts if term in text])
but I don't like the result it gives. For example if I have a list of strings and I am looking for the word Pete, it also includes the name Peter which is not I want. For example.
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
So I am looking for pete which appears TWICE, but the code I showed above will tell you that there are THREE pete's because it also counts peter. How do I solve this? Thanks.
Your description is incorrect. The expression you posted does indeed give 1, not 3, when you search for pete in texts:
>>> texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
>>> len([True for text in texts if 'pete' in text])
1
The only way you could have matched partial words is if your texts were not tokenized (i.e. if texts is a list of strings, not a list of token lists).
But the above code is terrible, it builds a list for no reason at all. A better (and more conventional) way to count hits is this:
>>> sum(1 for text in texts if 'pete' in text))
1
As for the question that you pose (Why (...)?) : I don't know.
As a solution to your example (noting that peter occurs twice and pete just once:
texts = [['the', 'boy', 'peter'],['pete','the', 'boy'],['peter','rabbit']]
def flatten(l):
out = []
for item in l:
if isinstance(item, (list, tuple)):
out.extend(flatten(item))
else:
out.append(item)
return out
flat = flatten(texts)
len([c for c in flat if c in ['pete']])
len([c for c in flat if c in ['peter']])
Compare the two results
Edit:
import collections
def counts(listr, word):
total = []
for i in range(len(texts)):
total.append(word in collections.Counter(listr[i]))
return(sum(total))
counts(texts,'peter')
#2

Python: Fastest way to find list of words [with n-grams] in text conversations

Am looking for how many times all words in word list are found in an conversation. Not considering individual frequency of each word but just the total count. The word list includes ngrams uptill 3
from nltk.util import ngrams
find = ['car', 'motor cycle', 'heavy traffic vehicle']
data = pd.read_csv('inputdata.csv')
def count_words(doc, find):
onegram = [' '.join(grams) for grams in ngrams(doc.split(), 1)]
bigrams = [' '.join(grams) for grams in ngrams(doc.split(), 2)]
trigrams = [' '.join(grams) for grams in ngrams(doc.split(), 3)]
n_gram = onegrams + bigrams + trigrams
''' get count of unique bag of words present in doc '''
lst = ".".join([i for i in find if i in n_gram])
cnt = np.count_nonzero(np.unique(lst.split(".")))
return cnt
result = data['text'].apply(lambda x: count_words(x, find))
this steps are very process heavy and take long time to run in case of large datasets. what are options to optimize present approach or are there other alternative steps?
First, split the doc once, not three times on each call.
def count_words(doc, find):
word_list = doc.split()
onegram = [' '.join(grams) for grams in ngrams(word_list, 1)]
...
Second, you can count nicely using the collections Counter class. Then counting is trivial in your code, and as fast as Python can make it.

Tagging words based on a dictionary/list in Python

I have the following dictionary of gene names:
gene_dict = {"repA1":1, "leuB":1}
# the actual dictionary is longer, around ~30K entries.
# or in list format
# gene_list = ["repA1", "leuB"]
What I want to do is given any sentence, we search for terms that is contained in the above dictionary and then tagged them.
For example given this sentence:
mytext = "xxxxx repA1 yyyy REPA1 zzz."
It will be then tagged as:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
Is there any efficient way to do that? In practicality we would process couple of millions of sentences.
If you "gene_list" in not really-really-really long, you could use a compiled regular expression, like
import re
gene_list = ["repA1", "leuB"]
regexp = re.compile('|'.join(gene_list), flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', 'xxxxx repA1 yyyy REPA1 zzz.')
and put in a loop for all your sentences. I think this should be quite fast.
If most of the sentences are short and separated by single spaces, something like:
gene_dict = {"repA1":1, "leuB":1}
format_gene = "<GENE>{}</GENE>".format
mytext = " ".join(format_gene(word) if word in gene_dict else word for word in mytext.split())
is going to be faster.
For slightly longer sentences or sentences you cannot reform with " ".join it might be more efficient or more correct to use several .replaces:
gene_dict = {"repA1":1, "leuB":1}
genes = set(gene_dict)
format_gene = "<GENE>{}</GENE>".format
to_replace = genes.intersection(mytext.split())
for gene in to_replace:
mytext = mytext.replace(gene, format_gene(gene))
Each of these assume that splits of the sentences will not take extortionate time, which is fair assuming genes_dict is a much longer than the sentences.

Categories