I am doing some analysis on a large corpus, but my function to remove custom stop words is just not working. I tried several different solutions from questions already asked here, but I can't find why words are not being removed from the Test list.
Any help pointing out my stupidness is welcome.
test = [['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'anual',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']]
removed = ['anual']
new_words = [word for word in test if word not in removed]
new_words
I see, maybe the function is not working properly, so you can use the following code just add it and set everything to work properly yourself it would be easy.
words = ['a', 'b', 'a', 'c', 'd']
stopwords = ['a', 'c']
for word in list(words): # iterating on a copy since removing will mess things up
if word in stopwords:
words.remove(word)
This program works in python compiler.
But in jupyter you have to change the following:
remove one of [ and ] from source data:
test = ['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'anual',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']
removed = ['anual']
new_words = [word for word in test if word not in removed]
new_words
Output:
['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']
And if you want to access the list elements inside another list, you can enter its index:
new_words = [word for word in test[0] if word not in removed]
In this case, there is no need to delete [].
Your dataframe seems a bit strange, but if you want to create a new list of lists with some stop words removed, this may work for you.
new_words = [[word for word in mywords if word not in removed] for mywords in test]
Try it on Binder.
Related
this is how I obtained the data input at the beginning:
with open("wordslist.txt") as f:
words_list = {word.removesuffix("\n") for word in f}
with open("negation_handling.csv") as g:
for tweete in g:
for word in tweete.split():
if word not in words_list:
print(word)
This code resulting in a data with the type of <class 'str'>. this class string contains a lot of words that have duplicates. I wanted to print all of the words, but no words are repeated (delete all the duplicates). here is how the class looks, the class name is word:
gfg
best
gfg
I
am
I
two
two
three
..............
my list of strings contains around 4500 words, and it is separated by a newline (enter) just like in the example of my question. now I cannot copy paste the strings because they are too many, so I store it in a class called "word" but I don't know how to call the class. I wanted the code to do a looping and remove all the duplicate words so the output would be like this:
gfg best I am two three..........
this is what I tried:
input_list_of_strings = word
# Create empty list to store unique
unique_words = []
# Loop through each word and check if it exists in unique words list
for word in input_list_of_strings:
if word not in unique_words:
unique_words.append(word)
# Print the result
print(unique_words)
but the results are like this:
['e']
how can I call the class word correctly?
If your input is a list of strings you could simply remember all unique strings using loop:
input_list_of_strings = ['gfg', 'best', 'gfg', 'I', 'am', 'I', 'two', 'two', 'three']
# Create empty list to store unique
unique_words = []
# Loop through each word and check if it exists in unique words list
for word in input_list_of_strings:
if word not in unique_words:
unique_words.append(word)
# Print the result
print(unique_words)
Also, you could use python set, but pay attention that it won't save the initial word order
input_list_of_strings = ['gfg', 'best', 'gfg', 'I', 'am', 'I', 'two', 'two', 'three']
# Create a set of unique words from the list
unique_words = set(input_list_of_strings)
unique_words_list = list(unique_words)
# Print the result
print(unique_words_list)
Hope it helps =)
I don't understand why I don't remove the stopword "a" in this loop. It seems so obvious that this should work...
Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. Output: stripped_paragraph = 'want figure out how can better data scientist'
Below I define 'stopwords'
I split all the words by a space, make a set of words while retaining the order
loop through the ordered and split substring set ('osss' var) and conditionally remove each word if it's a word in the list 'stopwords'
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
for word in osss:
if word.strip() in stopwords:
osss.remove(word)
else:
next
return ' '.join(osss)
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
My incorrect output is: 'want figure out how can a better data scientist'
Correct output: 'want figure out how can better data scientist'
edit: note that .strip() in the condition check with word.strip() is unnecessary and I still get the same output, that was me checking to make sure there wasn't an extra space somehow
edit2: this is an interview question, so I can't use any imports
What your trying to do can be achieved with much fewer lines of code.
The main problem in your code is your changing the list while iterating over it.
This works and is much simpler. Essentially looping over the list of your paragraph words, and only keeping the ones that aren't in the stopwords list. Then joining them back together with a space.
paragraph = 'I want to figure out how I can be a better data scientist'
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
filtered = ' '.join([word for word in paragraph.split() if word not in stopwords])
print(filtered)
You may also consider using nltk, which has a predefined list of stopwords.
You should not change(delete/add) a collection(osss) while iterating over it.
del_list = []
for word in osss:
if word.strip() in stopwords:
del_list.append(word)
else:
next
osss = [e for e in osss if e not in del_list]
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
x = list(osss)
for word in osss:
if word.strip() in stopwords:
x.remove(word)
#else:
# next
ret = ' '.join(x)
return ret
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.
I am doing a little project in which my program unscrambles a string and finds every possible combination of it.
I have two lists; comboList and wordList. comboList holds EVERY combination of the word; for example, the comboList for 'ABC' is:
['ABC','ACB','BAC','BCA','CAB','CBA']
(Only 'CAB' is a real word)
wordList holds about 56,000 words imported from a text file. These are all found in the English dictionary and sorted by length and then alphabetically.
isRealWord(comboList,wordList) is my function to test which words in comboList are real by checking if it is in the wordList. Here's the code:
def isRealWord(comboList, wordList):
print 'Debug 1'
for combo in comboList:
print 'Debug 2'
if combo in wordList:
print 'Debug 3'
print combo
listOfActualWords.append(combo)
print 'Debug 4'
This is the output:
run c:/Users/uzair/Documents/Programming/Python/unscramble.py
Please give a string of scrambled letters to unscramble: abc
['A', 'B', 'C']
['ABC', 'ACB', 'BAC', 'BCA', 'CAB', 'CBA']
Loading word list...
55909 words loaded
Debug 1
Debug 2
Debug 2
Debug 2
Debug 2
Debug 2
Debug 2
Debug 4
[]
Why is if combo in wordList not returning True and how do I fix it?
I'd suggest using set because it would be much faster due to its implementation. There is set.intersection method. Here is a case insensitive solution:
listOfActualWords = set.intersection(set(word.upper() for word in comboList), set(word.upper() for word in wordList))
I think the problem here is that you compare two strings with the same letters but with mixed lower/upper cases.
To see if i'm correct try to convert all word in wordList to upper-case and then in isRealWord compare with upper-case word (just to be sure) as follows:
UpperCaseWordList = [word.upper() for word in wordList]
...
def isRealWord(comboList, wordList):
for combo.upper() in comboList:
if combo in wordList:
print combo
listOfActualWords.append(combo)
I am stuck with list comprehension in Python
I have the following data structure
dataset = [sentence1, sentence2,...]
sentence = [word1, word2,...]
in addition, I have a list of special words
special_words = [special_word1, special_word2, special_word3,...]
I want to run over all special_words in special_words and fetch all words that occur together in sentence with special_words.
As result I expect,
data=[special_word1_list, special_word2_list, ...],
,where special_word1_list = [word1, word2, ...]
it means word1, word2 ,... were in the sentences together with special_word1_list
I tried many difference ways to construct the list comprehension, unfortunately without any success.
I would appreciate any help, in addition if you know any good article about list comprehension, post it here.
data = [
{
word
for sentence in sentences
if special_word in sentence
for word in sentence
}
for special_word in special_words
]
I think you want:
data = [sentence for sentence in data
if any(word in special_words
for word in sentence)]
Alternatively, I'd suggest creating a dictionary mapping special words to sets of other words occurring in the same sentence as the respective special word:
{sw : {w for ds in dataset if sw in ds for w in ds if w != sw}
for sw in special_words}