how to print only unique words in a class string - python

this is how I obtained the data input at the beginning:
with open("wordslist.txt") as f:
words_list = {word.removesuffix("\n") for word in f}
with open("negation_handling.csv") as g:
for tweete in g:
for word in tweete.split():
if word not in words_list:
print(word)
This code resulting in a data with the type of <class 'str'>. this class string contains a lot of words that have duplicates. I wanted to print all of the words, but no words are repeated (delete all the duplicates). here is how the class looks, the class name is word:
gfg
best
gfg
I
am
I
two
two
three
..............
my list of strings contains around 4500 words, and it is separated by a newline (enter) just like in the example of my question. now I cannot copy paste the strings because they are too many, so I store it in a class called "word" but I don't know how to call the class. I wanted the code to do a looping and remove all the duplicate words so the output would be like this:
gfg best I am two three..........
this is what I tried:
input_list_of_strings = word
# Create empty list to store unique
unique_words = []
# Loop through each word and check if it exists in unique words list
for word in input_list_of_strings:
if word not in unique_words:
unique_words.append(word)
# Print the result
print(unique_words)
but the results are like this:
['e']
how can I call the class word correctly?

If your input is a list of strings you could simply remember all unique strings using loop:
input_list_of_strings = ['gfg', 'best', 'gfg', 'I', 'am', 'I', 'two', 'two', 'three']
# Create empty list to store unique
unique_words = []
# Loop through each word and check if it exists in unique words list
for word in input_list_of_strings:
if word not in unique_words:
unique_words.append(word)
# Print the result
print(unique_words)
Also, you could use python set, but pay attention that it won't save the initial word order
input_list_of_strings = ['gfg', 'best', 'gfg', 'I', 'am', 'I', 'two', 'two', 'three']
# Create a set of unique words from the list
unique_words = set(input_list_of_strings)
unique_words_list = list(unique_words)
# Print the result
print(unique_words_list)
Hope it helps =)

Related

Stop Words not being removed from list

I am doing some analysis on a large corpus, but my function to remove custom stop words is just not working. I tried several different solutions from questions already asked here, but I can't find why words are not being removed from the Test list.
Any help pointing out my stupidness is welcome.
test = [['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'anual',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']]
removed = ['anual']
new_words = [word for word in test if word not in removed]
new_words
I see, maybe the function is not working properly, so you can use the following code just add it and set everything to work properly yourself it would be easy.
words = ['a', 'b', 'a', 'c', 'd']
stopwords = ['a', 'c']
for word in list(words): # iterating on a copy since removing will mess things up
if word in stopwords:
words.remove(word)
This program works in python compiler.
But in jupyter you have to change the following:
remove one of [ and ] from source data:
test = ['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'anual',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']
removed = ['anual']
new_words = [word for word in test if word not in removed]
new_words
Output:
['acesso em',
'agindo',
'alegre',
'ambiente escolar',
'ambientes digitais',
'aplicativos digitais',
'apresentar conclusões',
'argumentação cuidado',
'articuladas projeto',
'associadas eixos',
'associação',
'ativas',
'atos linguagem',
'avaliar oportunidades',
'bairro',
'base critérios',
'base estudos',
'bibliográfica exploratória',
'blogs',
'buscando apresentar',
'campo artístico']
And if you want to access the list elements inside another list, you can enter its index:
new_words = [word for word in test[0] if word not in removed]
In this case, there is no need to delete [].
Your dataframe seems a bit strange, but if you want to create a new list of lists with some stop words removed, this may work for you.
new_words = [[word for word in mywords if word not in removed] for mywords in test]
Try it on Binder.

Python: How to split a string by phrase from right to left (first occurrence only)

Is it possible to split a string on a phrase (potentially more than one word) in Python 3 from right to left (first occurrence only)?
Currently I'm able to split a string based on a list of phrases but I have an edge case in that if more than one of those specified phrases occurs in the string then it splits on both.
The problem
Given a sample CSV containing the following:
SENTENCES
THIS IS SENTENCE THREE
1
THIS IS SENTENCE TWO
2
I CONTAIN ONE BUT ALSO TWO
3
And my code which opens a CSV, loops through each row, and then looks to split out specified phrases:
import re
import csv
def split_phrase(string):
phrases = ['ONE', 'TWO', 'THREE']
print(f'Raw: {string}')
split_phrase = '' # Only needed for testing purposes to prevent error on output
for phrase in phrases:
if phrase in string:
list = re.split(f'\\b({phrase})\\b', string)
print(f'Split: {list}')
sentence = list[0]
split_phrase = list[1]
print(f'Phrase: {split_phrase}')
return sentence, split_phrase
input_dir = 'input1/'
output_dir = 'output1/'
filename = 'demo.csv'
with open(input_dir + filename, 'r') as input_csv:
csv_reader = csv.reader(input_csv)
data = list(csv_reader)
input_csv.close()
for row in data[1:]: # Ignore the header row
sentence = row[0] # First column
sentence = split_phrase(sentence) # Split out specified phrase
I get the following output:
$ python3 demo.py
Raw: THIS IS SENTENCE THREE
Split: ['THIS IS SENTENCE ', 'THREE', '']
Phrase: THREE
Raw: THIS IS SENTENCE TWO
Split: ['THIS IS SENTENCE ', 'TWO', '']
Phrase: TWO
Raw: I CONTAIN ONE BUT ALSO TWO
Split: ['I CONTAIN ', 'ONE', ' BUT ALSO TWO']
Phrase: ONE
Split: ['I CONTAIN ONE BUT ALSO ', 'TWO', '']
Phrase: TWO
NOTE: The last sentence is processed by the for loop twice due to it containing two of the phrases in the phrase list.
Desired outcome
I know that of the listed phrases to split out it will always be the last one on the right. So I'd like to grab only the first occurrence from right to left.
NOTE: A "phrase" can contain one or more words.
Is this possible? And if so, how may I achieve it?
I've answered this by using string.rfind() to search from the end of the string, and iterating through the list of possible phrases. There may be better ways to do this that do not iterate, but this is the best I've found.
one = "THIS IS SENTENCE THREE"
two = "THIS IS SENTENCE TWO"
three = "I CONTAIN ONE BUT ALSO TWO"
four = "I CONTAIN ONE BUT ALSO TWO AND SOME MORE TEXT"
phrases = ['ONE', 'TWO', 'THREE']
def find_words(phrases, string):
i = -1
p = ""
for phrase in phrases:
newI = string.rfind(phrase)
if newI > i:
i = newI
p = phrase
return (string[:i], string[i:i+len(p)], string[i+len(p)::])
print(find_words(phrases, one))
print(find_words(phrases, two))
print(find_words(phrases, three))
print(find_words(phrases, four))
Output:
('THIS IS SENTENCE ', 'THREE', '')
('THIS IS SENTENCE ', 'TWO', '')
('I CONTAIN ONE BUT ALSO ', 'TWO', '')
('I CONTAIN ONE BUT ALSO ', 'TWO', ' AND SOME MORE TEXT')
I believe it will work if you use "rsplit()" instead of "split()"
The key, I think, is to split into words, then reverse that list, then search for the all the hits and pick the lowest number:
def split_word(string):
words = ['ONE', 'TWO', 'THREE']
search = string.split()
rsearch = list(reversed(search))
locs = [rsearch.index(w) for w in words if w in rsearch]
if not locs:
return None
target = len(search) - min(locs) - 1
return ' '.join(search[0:target]), search[target], ' '.join(search[target+1:])
print(split_word("THIS IS SENTENCE THREE"))
print(split_word("THIS IS SENTENCE TWO"))
print(split_word("I CONTAIN ONE BUT ALSO TWO"))
Output:
('THIS IS SENTENCE', 'THREE', '')
('THIS IS SENTENCE', 'TWO', '')
('I CONTAIN ONE BUT ALSO', 'TWO', '')

Removing stopwords from a string with ordered set and join retains a single stopword

I don't understand why I don't remove the stopword "a" in this loop. It seems so obvious that this should work...
Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. Output: stripped_paragraph = 'want figure out how can better data scientist'
Below I define 'stopwords'
I split all the words by a space, make a set of words while retaining the order
loop through the ordered and split substring set ('osss' var) and conditionally remove each word if it's a word in the list 'stopwords'
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
for word in osss:
if word.strip() in stopwords:
osss.remove(word)
else:
next
return ' '.join(osss)
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
My incorrect output is: 'want figure out how can a better data scientist'
Correct output: 'want figure out how can better data scientist'
edit: note that .strip() in the condition check with word.strip() is unnecessary and I still get the same output, that was me checking to make sure there wasn't an extra space somehow
edit2: this is an interview question, so I can't use any imports
What your trying to do can be achieved with much fewer lines of code.
The main problem in your code is your changing the list while iterating over it.
This works and is much simpler. Essentially looping over the list of your paragraph words, and only keeping the ones that aren't in the stopwords list. Then joining them back together with a space.
paragraph = 'I want to figure out how I can be a better data scientist'
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
filtered = ' '.join([word for word in paragraph.split() if word not in stopwords])
print(filtered)
You may also consider using nltk, which has a predefined list of stopwords.
You should not change(delete/add) a collection(osss) while iterating over it.
del_list = []
for word in osss:
if word.strip() in stopwords:
del_list.append(word)
else:
next
osss = [e for e in osss if e not in del_list]
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
x = list(osss)
for word in osss:
if word.strip() in stopwords:
x.remove(word)
#else:
# next
ret = ' '.join(x)
return ret
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Categories