I'm working in some kind of NLP. I compare a daframe of articles with inputs words. The main goal is classify text if a bunch of words were found
I've tried to extract the values in the dictionary and convert into a list and then apply stemming to it. The problem is that later I'll do another process to split and compare according to the keys. I think if more practical to work directly in the dictionary.
search = {'Tecnology' : ['computer', 'digital', 'sistem'], 'Economy' : ['bank', 'money']}
words_list = list()
for key in search.keys():
words_list.append(search[key])
search_values = [val for sublist in words_list for val in sublist]
search_values_stem = [stemmer.stem(word) for word in test]
I expect a dictionary stemmed to compare directly with the column of the articles stemmed
If I understood your question correctly, you are looking to apply stemming to the values of your dictionary (and not the keys), and in addition the values in your dictionary are all lists of strings.
The following code should do that:
def stemList(l):
return([stemmer.stem(word) for word in l])
# your initial dictionary is called search (as in your example code)
#the following creates a new dictionary where stemming has been applied to the values
stemmedSearch = {}
for key in search:
stemmedSearch[key] = stemList(search[key])
Related
Hey guys I have a quick question about dataframe stemming. You see I want to know how to iterate through a dataframe of text, stem each word, and then return the dataframe to the text is was originally in (With the stemmed word changes of course).
I will give an example
dic = {0:['He was caught eating by the tree'],1:['She was playing with her friends']}
dic = pd.DataFrame(dic)
dic = dic.T
dic.rename(columns = {0:'text'},inplace=True)
When you run these 4 codes, you will get a dataframe of text. I would like to a method to iterate and stem each word as I have a dataframe constiting of over 30k of such sentences and would like to stem it. Thank you very much.
I'm trying to use the list of shortened words to select & retrieve the corresponding full word identified by its initial sequence of characters:
shortwords = ['appe', 'kid', 'deve', 'colo', 'armo']
fullwords = ['appearance', 'armour', 'colored', 'developing', 'disagreement', 'kid', 'pony', 'treasure']
Trying this regex match with a single shortened word:
import re
shortword = 'deve'
retrieved=filter(lambda i: re.match(r'{}'.format(shortword),i), fullwords)
print(retrieved*)
returns
developing
So the regex match works but the question is how to adapt the code to iterate through the shortwords list and retrieve the full words?
EDIT: The solution needs to preserve the order from the shortwords list.
Maybe using a dictionary
# Using a dictionary
test= 'appe is a deve arm'
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
#Building the dictionary
d={}
for i in range(len(shortwords)):
d[shortwords[i]]=fullwords[i]
# apply dictionary to test
res=" ".join(d.get(s,s) for s in test.split())
# print test data after dictionary mapping
print(res)
That is one way to do it:
shortwords = ['appe', 'deve', 'colo', 'arm', 'pony', 'disa']
fullwords = ['appearance', 'developing', 'colored', 'armour', 'pony', 'disagreement']
# Dict comprehension
words = {short:full for short, full in zip(shortwords, fullwords)}
#Solving problem
keys = ['deve','arm','pony']
values = [words[key] for key in keys]
print(values)
This is a classical key - value problem. Use a dictionary for that or consider pandas in long-term.
Your question text seems to indicate that you're looking for your shortwords at the start of each word. That should be easy then:
matched_words = [word for word in fullwords if any(word.startswith(shortword) for shortword in shortwords]
If you'd like to regex this for some reason (it's unlikely to be faster), you could do that with a large alternation:
regex_alternation = '|'.join(re.escape(shortword) for shortword in shortwords)
matched_words = [word for word in fullwords if re.match(rf"^{regex_alternation}", word)]
Alternately if your shortwords are always four characters, you could just slice the first four off:
shortwords = set(shortwords) # sets have O(1) lookups so this will save
# a significant amount of time if either shortwords
# or longwords is long
matched_words = [word for word in fullwords if word[:4] in shortwords]
This snippet has the functionality I wanted. It builds a regular expression pattern at each loop iteration in order to accomodate varying word length parameters. Further it maintains the original order of the wordroots list. In essence it looks at each word in wordroots and fills out the full word from the dataset. This is useful when working with the bip-0039 words list which contains words of 3-8 characters in length and which are uniquely identifiable by their initial 4 characters. Recovery phrases are built by randomly selecting a sequence of words from the bip-0039 list, order is important. Observed security practice is often to abbreviate each word to a maximum of its four initial characters. Here is code which would rebuild a recovery phrase from its abbreviation:
import re
wordroots = ['sun', 'sunk', 'sunn', 'suns']
dataset = ['sun', 'sunk', 'sunny', 'sunshine']
retrieved = []
for root in wordroots:
#(exact match) or ((exact match at beginning of word when root is 4 or more characters) else (exact match))
pattern = r"(^" + root + "$|" + ("^" + root + "[a-zA-Z]+)" if len(root) >= 4 else "^" + root + "$)")
retrieved.extend( filter(lambda i: re.match(pattern, i), dataset) )
print(*retrieved)
Output:
sun sunk sunny sunshine
I have to make a function that given a text of concatenated words without spaces and a list that both contains words that appear and do not appear in said text.
I have to create a tuple that contains a new list that only includes the words that are in the text in order of appearance and the word that appears the most in the text. If there are two words that appear the most number of times, the function will chose one on an alphabetical order (if the words appear like "b"=3,"c"=3,"a"=1, then it will chose "b")
Also I have to modify the original list so that it only includes the words that are not in the text without modifying its order.
For example if I have a
list=["tree","page","law","gel","sand"]
text="pagelawtreetreepagepagetree"`
then the tuple will be
(["page","tree","law"], "page")
and list will become
list=["gel","sand"]
Now I have done this function but it's incredibly slow, can someone help?
ls=list
def es3(ls,text):
d=[]
v={}
while text:
for item in ls:
if item in text[:len(item)]:
text=text[len(item):]
if item not in d:
d+=[item]
v[item]=1
else:
v[item]+=1
if text=="":
p=sorted(v.items())
f=max(p, key=lambda k: k[1])
M=(d,f[0])
for b in d:
if b in lista:
ls.remove(b)
return (M)
In python strings are immutable - if you modify them you create new objects. Object creation is time/memory inefficient - almost all of the times it is better to use lists instead.
By creating a list of all possible k-lenght parts of text - k being the (unique) lenghts of the words you look for ( 3 and 4 in your list) you create all splits that you could count and filter out that are not in your word-set:
# all 3+4 length substrings as list - needs 48 lookups to clean it up to whats important
['pag', 'page', 'age', 'agel', 'gel', 'gela', 'ela', 'elaw', 'law', 'lawt', 'awt',
'awtr', 'wtr', 'wtre', 'tre', 'tree', 'ree', 'reet', 'eet', 'eetr', 'etr', 'etre',
'tre', 'tree', 'ree', 'reep', 'eep', 'eepa', 'epa', 'epag', 'pag', 'page', 'age',
'agep', 'gep', 'gepa', 'epa', 'epag', 'pag', 'page', 'age', 'aget', 'get', 'getr',
'etr', 'etre', 'tre', 'tree']
Using a set for "is A in B" checks makes the coder faster as well - sets have O(1) lookup - list take the longer the more lements are in it (worst case: n). So you eliminate all words from the k-lenght parts list that do not match any of the words you look for (i.e. 'eter'):
# whats left after the list-comprehension including the filter criteria is done
['page', 'gel', 'law', 'tree', 'tree', 'page', 'page', 'tree']
For counting iterables I use collections.Counter - a specialiced dictionary .. that counts things. It's most_common() method returns sorted tuples (key,count) sorted by most occured first which I format to a return-value that matches your OP.
One version to solve it respection overlapping results:
from collections import Counter
def findWordsInText(words,text):
words = set(words) # set should be faster for lookup
lens = set(len(w) for w in words)
# get all splits of len 3+4 (in this case) from text
splitted = [text[i:i+ll] for i in range(len(text)-min(lens)) for ll in lens
if text[i:i+ll] in words] # only keep whats in words
# count them
counted = Counter(splitted)
# set-difference
not_in = words-set(splitted)
# format as requested: list of words in order, most common word
most_common = counted.most_common()
ordered_in = ( [w for w,_ in most_common], most_common[0][0] )
return list(not_in), ordered_in
words = ["tree","page","law","gel","sand"]
text = "pagelawtreetreepagepagetree"
not_in, found = findWordsInText(words,text)
print(not_in)
print(found)
Output:
['sand']
(['page', 'tree', 'gel', 'law'], 'page')
I am trying to extract separated multi words from a python list with two different list as a query string. My sentences list is
lst = ['we have the terrible HIV epidemic that takes down the life expectancy of the African ','and I take the regions down here','The poorest are down']
lst_verb = ['take','go','wake']
lst_prep = ['down','up','in']
import re
output=[]
item = 'down'
p = re.compile(r'(?:\w+\s+){1,20}'+item)
for i in lst:
output.append(p.findall(i))
for item in output:
print(item)
with this i am able to extract word from the list, However I am only want to extract separated multiwords, i.e it should extract the word from the list "and I take the regions down here".
furthermore, I want to use the word from lst_verb and lst_prep as query string.
for example
re.findall(r \lst_verb+'*.\b'+ \lst_prep)
Thank you for your answer.
You can use regex like
(?is)^(?=.*\b(take)\b)(?=.*?\b(go)\b)(?=.*\b(where)\b)(?=.*\b(wake)\b).*
To match Multiple words
like this your example
use functions to create regex string from the verbs and prep.
hope this helps
I've got a piece of code here that checks anagrams of a long list of words. I'm trying to find out how to search through every word in my long word list to find other anagrams that can match this word. Some words should have more than one anagram in my word list, yet I'm unable to find a solution to join the anagrams found in my list.
set(['biennials', 'fawn', 'unsupportable', 'jinrikishas', 'nunnery', 'deferment', 'surlinesss', 'sonja', 'bioko', 'devon'] ect...
Since I've been using sets, the set never reads to the end, and it returns only the shortest words. I know there should be more. I've been trying to iterate over my key over my whole words set so I can find all the ones that are anagrams to my key.
anagrams_found = {'diss': 'sids', 'abels': 'basel', 'adens': 'sedna', 'clot': 'colt', 'bellow': 'bowell', 'cds': 'dcs', 'doss': 'sods', '
als': 'las', 'abes': 'base', 'fir': 'fri', 'blot': 'bolt', 'ads': 'das', 'elm': 'mel', 'hops': 'shop', 'achoo': 'ochoa'... and more}
I was wondering where my code has been cutting off short. It should be finding a lot more anagrams from my Linux word dictionary. Can anyone see what's wrong with my piece of code? Simpley put, first the program iterates through every word I have, then checks if the sets contain my keys. This will append the keys to my dictionary for words later that will also match my same key. If there is already a key that I've added an anagram for, I will update my dictionary by concatenating the old dict value with a new word (anagram)
anagram_list = dict()
words = set(words)
anagrams_found = []
for word in words:
key = "".join(sorted([w for w in word]))
if (key in words) and (key != word):
anagrams_found.append(word)
for name, anagram in anagram_list.iteritems():
if anagram_list[name] == key:
anagram = " ".join([anagram],anagram_found)
anagram_list.update({key:anagram})
anagram_list[key] = word
return anagram_list
All in all, this program is maybe not efficient. Can someone explain the shortcomings of my code?
anagram_dict = {} # You could also use defaultdict(list) here
for w in words:
key = "".join(sorted(w))
if key in anagram_dict:
anagram_dict[key].append(w)
else:
anagram_dict[key] = [w]
Now entries that only have one item in the list aren't anagrams so
anagram_list = []
for v in anagram_dict.iteritems():
if len(v) > 1:
anagram_list += v