How to modify list element within a for loop? - python

I'm trying to modify list elements and replace the original element with the newly modified one. However, I've noticed that the desired behavior differs depending on how I construct my for loop. For example:
samples = ['The cat sat on the mat.', 'The dog at my homework.']
punctuation = ['\'', '\"', '?', '!', ',', '.']
for sample in samples:
sample = [character for character in sample if character not in punctuation]
sample = ''.join(sample)
print(samples)
for i in range(len(samples)):
samples[i] = [character for character in samples[i] if character not in punctuation]
samples[i] = ''.join(samples[i])
print(samples)
This program outputs:
['The cat sat on the mat.', 'The dog at my homework.']
['The cat sat on the mat', 'The dog at my homework']
The second for loop is the desired output with the punctuation removed from the sentence, but I'm having trouble understanding why that happens. I've searched online and found this Quora answer to be helpful in explaining the technical details, but I'm wondering if it's impossible to modify list elements using the first method of for loops, and if I have to resort to using functions like range or enumerate to modify list elements within loops.
Thank you.

Modifying the iterator is not enough,
You need to modify the list as well:

You need to replace the item in the list, not update the local variable created by the for loop. One option would be to use a range and update by index.
for i in range(len(samples)):
sample = [character for character in samples[i] if character not in punctuation]
samples[i] = ''.join(sample)
That said, a more pythonic approach would be to use a comprehension. You can also use the regex library to do the substitution.
import re
clean_samples = [
re.sub("['\"?!,.]", "", sample)
for sample in samples
]

Try this out:
samples = ['The cat sat on the mat.', 'The dog at my homework.']
punctuation = ['\'', '\"', '?', '!', ',', '.']
new_sample = []
for sample in samples:
sample = [character for character in sample if character not in punctuation]
sample = ''.join(sample)
new_sample.append(sample)
print(new_sample)
In this case, sample is an iterator, not the element of the list, so when you try to modify sample you are not updating the element.

Related

python: find the last appearance of a word (of a list of words) in a text

given a list of stop words and a string:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
I would like to get the text starting from the end up to the first stop word of the list.
expected result 'the long nice red hair'
I tried with several for loops but it is super cumbersome there should be a straight way, probably even a one liner.
my super verbose solution:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
reversed_sentence =mystring.split()[::-1]
for i,word in enumerate(reversed_sentence):
if word in list_stop_words:
position = i
words = reversed_sentence[0:i+1]
print(' '.join(words[::-1]))
break
for word in mastering[::-1]:
Any suggestion for a better approach?
EDIT AFTER THE ANSWER (SEE BELLOW)
you can try something like this
mystring[max([mystring.rfind(stop_word) for stop_word in list_stop_words]):]
basically you find the last occurence of each word with rfind then you find the last from all the words with max then you slice it out

Problem during replacing word/phrases from a text file using a set in python?

Suppose I am having a list (new_list) with 3000 sentences, where each sentences are separated by a comma mark (,).
Example (a part):
new_list = ['air purity controller, to detect pollution and letting cold air in', 'air quality in my home by air conditioning', 'air conditioner depending on home', 'household alarm clock for time']
I want to replace certain words (single word or phrase) from the new_list by adding some special characters (at the start and end). I am doing this by the help of a set.
Example of the set:
dict = {'air conditioner', 'air', 'air quality', 'house', 'air conditioning', 'alarm clock'}
The size of the set (dict) is 317. I want to scan each word of the new_list and replace when there is a match with the set by appending special characters at the starting and end position. Further if a match occur and the resulted word is a phrase from the set, then additionally it would add a special character (_) in between along with appending special character at the both starting and end points.
I have tried but failing. Please suggest where I am going wrong (i don't think so, I am wrong). The new_list and dict are shown above.
import re, csv, nltk
from nltk.corpus import stopwords
from nltk import regexp_tokenize
with open("raw_data.txt", 'r', encoding = 'utf-8') as f1:
reader = csv.reader(f1, skipinitialspace=True)
new_list = next(reader)
with open('updatd_file.txt', 'w', encoding='utf-8') as f2:
dic = {'air conditioner', 'air quality', 'air conditioning', 'air', 'house', 'alarm clock'}
dic = {i : i.replace(' ', '_') for i in dic}
pattern = re.compile(r"\b("+"|".join(dic)+r")\b")
modify_reqs = [pattern.sub(lambda x: "_{}_".format(dic[x.group()]), i) for i in new_list]
sw = (stopwords.words('english'))
unfiltered_tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|[\d]|[^\wa-z+]", gaps=True) if word not in sw] for word in modify_reqs]
f2.write(str(unfiltered_tokens))
I am executing this program and writing the results onto a file. When I check the output file, I can see the words in the desired order (missing few words) but sometimes i am unable to see this. How is this strange behavior, I am unable to understand and explore.
That is sometimes, I am able to find the phrase in the correct order (as expected) that is '_air_conditioning_' but next time when I execute this fragment, i find the same word as '_air_', 'conditioning' (separated). Same thing also happened with the other phrases like air quality, air conditioning, etc. The problem is with the phrases not with the single word.
Please note that in the set (dict) I have 317 words and new_list containing almost 3000 sentences. Not possible to show all here.
How is this possible? I am trying this since 7-8 days its frustrating now.
The comment by #Toto really helped me to resolve the issue.
I have sorted the used set in the descending order of the length of words using the keyword sorted.
dic = sorted(dic, key=len, reverse=True)

Getting list of string array into separate string arrays in python

This is my code.
SENTENCE = "He sad might have lung cancer. It’s just a rumor."
sent=(sent_tokenize(SENTENCE))
The output is
['He sad might have lung cancer.', 'It’s just a rumor.']
I want to get this array as
['He sad might have lung cancer.']
['It’s just a rumor.']
Is their any way of doing this and if so how?
Since you want to split according to a sentence, you can simply do this:
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
single_sentence = [sentence + '.']
If you actually want all lists containing a single sentence in the same data structure, you'd have to use a list of lists or a dictionary:
my_sentences = []
sentence_list = SENTENCE.split('.')
for sentence in sentence_list:
my_sentences.append([sentence + '.'])
To shorten this out using a list comprehension:
my_sentences = [[sentence + '.'] for sentence in SENTENCE.split('.')]
with the only culprit being that the SENTENCE splitting part will happen more often so it'll be slower working with a massive amount of sentences.
The solution using re.split() function:
import re
s = "He sad might have lung cancer. It’s just a rumor."
parts = [l if l[-1] == '.' else l + '.' for l in re.split(r'\.\s?(?!$)', s)]
print(parts)
The output:
['He sad might have lung cancer.', 'It’s just a rumor.']
r'\.\s?(?!$)' pattern, defines separator as . except that which is at the end of the text (?!$)
l if l[-1] == '.' else l + '.' - recovering . at the end of each line(as the dilimiter was not captured while splitting)

In Python, how to check if words in a string are keys in a dictionary?

For a class I am talking the twitter sentiment analysis problem. I have looked at the other questions on the site and they don't help for my particular issue.
I am given a string that is one tweet with its letters changed so that they are all in lowercase. For example,
'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
as well as a dictionary of words where the key is the word and the value is the value for the sentiment for that word. To be more specific, a key can be a single word (such as 'hello'), more than one word separated by a space (such as 'yellow hornet'), or a hyphenated compound word (such as '2-dimensional'), or a number (such as '365').
I need to find the sentiment of the tweet by adding the sentiments for every eligible word and dividing by the number of eligible words (by eligible word, I mean word that is in the dictionary). I'm not sure what's the best way to go about checking if a tweet has a word in the dictionary.
I tried using the "key in string" convention with looping through all the keys, but this was problematic because there are a lot of keys and word-in-words would be counted (e.g. eradicate counts cat, ate, era, etc. as well)
I then tried using .split(' ') and looping through the elements of the resultant list but I ran into problems because of punctuation and keys which are two words.
Anyone have any ideas on how I can more suitably tackle this?
For example: using the example above, still : -0.625, love : 0.625, every other word is not in the dictionary. so this should return (-0.625 + 0.625)/2 = 0.
The whole point of dictionaries is that they are quick at looking things up:
for word in instring.split():
if wordsdict.has_key(word):
print word
You would probably do better at getting rid of punctuation, etc, (thank-you Soke), by using regular expressions rather than split, e.g.
for word in re.findall(r'[\w]', instring):
if wordsdict.get(word) is not None:
print word
Of course you will have to have some maximum length of word groupings, possibly generated with a single run through of the dictionary and then take your pairs, triples, etc. and also check them.
you can use nltk its very powerfull what you want to do, it can be done by split too:
>>> import string
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> import nltk
>>> my_dict = {'still' : -0.625, 'love' : 0.625}
>>> words = nltk.word_tokenize(a)
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(', '#', 'tel', 'aviv', 'kosher', 'pizza', ')', 'http', ':', '//t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
using split:
>>> words = a.split()
>>> words
['after', '23', 'years', 'i', 'still', 'love', 'this', 'place.', '(#', 'tel', 'aviv', 'kosher', 'pizza)', 'http://t.co/jklp0uj']
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.0
my_dict.get(key,default), so get will return value if key is found in dictionary else it will return default. In this case '0'
check this example: you asked for place
>>> import string
>>> my_dict = {'still' : -0.625, 'love' : 0.625,'place':1}
>>> a= 'after 23 years i still love this place. (# tel aviv kosher pizza) http://t.co/jklp0uj'
>>> words = nltk.word_tokenize(a)
>>> sum(my_dict.get(x.strip(string.punctuation),0) for x in words)/2
0.5
going by length of the dictionary key might be one solution.
For example, you have the dict as:
Sentimentdict = {"habit":5, "bad habit":-1}
the sentence might be:
s1="He has good habit"
s2="He has bad habit"
s1 should be getting good sentiment compare to s2. Now, you can do this:
for w in sorted(Sentimentdict.keys(), key=lambda x: len(x)):
if w in s1:
remove the word and do your sentiment calculation

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

Categories