Python - nested list comprehension with tokenizing - python

Python question: I have a list of sentences on which I want to apply nltk stemming. So for each word in each sentence, I want to apply, in this case, the nltk snowball.stem function.
I want to write that as short as possible via list comprehension.
Below code works fine, but I want to write it in less lines:
data_stemming=[]
for sentence in data:
word_list=word_tokenize(sentence)
stemmed_sentence=' '.join([stemmer.stem(w) for w in word_list])
data_stemming.append(stemmed_sentence)
print(data_stemming)
output:
['do do done', 'do requir', 'shoe shoe']
Can someone help me out here?
Thanks a lot!

nltk.word_tokenize accepts as input a string, but data is a list of strings. What you need is:
data = ['doing do done', 'requires require', 'shoe shoes']
joined_data = ' '.join(data)
data_stemming=[[snowball.stem(w) for w in word_list] for word_list in nltk.word_tokenize(joined_data)]

You can try doing it in a single List comprehension
data_stemming = [''.join([stemmer.stem(w) for w in word_tokenize]) for sentence in data]

Related

How to replace multiple substrings in a list of sentences using regex in python?

I have a list of sentences as below :
sentences = ["I am learning to code", "coding seems to be intresting in python", "how to code in python", "practicing how to code is the key"]
Now I wish to replace few substrings in this list of sentences using dictionary of words and its replacements.
word_list = {'intresting': 'interesting', 'how to code': 'learning how to code', 'am learning':'love learning', 'in python': 'using python'}
I tried the following code:
replaced_sentences = [' '.join([word_list.get(w, w) for w in sentence.split()])
for sentence in sentences]
But only the one word string is getting replaced and not the keys with more than one word. It is because am using sentence.split() which tokenizes sentences word by word and misses out replacing substrings greater than one word.
How do I get to replace the substring with exact match using regex or any other suggestions?
expected output:
sentences = ["I love learning to code", "coding seems to be interesting using python", "learning how to code using python", "practicing learning how to code is the key"]
Thanks in advance.
It's probably easiest to read if you break this into a function that replaces all the words for a single sentence. Then you can apply it to all the sentences in the list. Here we make a single regex by concaving all the keys of the dict with '|'. Then use re.sub grab the found value associated with the key, and return it as the replacement.
import re
def replace_words(s, word_lookup):
rx = '|'.join(word_lookup.keys())
return re.sub(rx, lambda match: word_lookup[match.group(0)], s)
[replace_words(s, word_list) for s in sentences]
This will result in:
['I love learning to code',
'coding seems to be interesting using python',
'learning how to code using python',
'practicing learning how to code is the key']
You could optimize a bit by making the regex once instead of each time in the function. This would allow you to do something like:
import re
rx = re.compile('|'.join(word_list.keys()))
[rx.sub(lambda match: word_list[match.group(0)], s) for s in sentences]

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Flattening 3D list of words to 2D

I have a pandas column with text strings. For simplicity ,lets assume I have a column with two strings.
s=["How are you. Don't wait for me", "this is all fine"]
I want to get something like this:
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
Basically take each sentence of a document and tokenism into list of words. So finally I need list of list of string.
I tried using a map like below:
nlp=spacy.load('en')
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
global log_txt
x=re.sub("\s\s+" , " ", x.strip())
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
#log_txt=log_txt.extend(txt_to_words)
return txt_to_words
The nlp from spacy is used to split a string of text into list of sentences.
log_txt=list(map(text_to_words,s))
log_txt
But this as you know would put all of the result from both the documents into another list
[[['How', 'are', 'you'], ["Don't", 'wait', 'for', 'me']],
[['this', 'is', 'all', 'fine']]]
You'll need a nested list comprehension. Additionally, you can get rid of punctuation using re.sub.
import re
data = ["How are you. Don't wait for me", "this is all fine"]
words = [
re.sub([^a-z\s], '', j.lower()).split() for i in data for j in nlp(i).sents
]
Or,
words = []
for i in data:
... # do something here
for j in nlp(i).sents:
words.append(re.sub([^a-z\s], '', j.lower()).split())
There is a much simpler way for list comprehension.
You can first join the strings with a period '.' and split them again.
[x.split() for x in '.'.join(s).split('.')]
It will give the desired result.
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
For Pandas dataframes, you may get an object, and hence a list of lists after tolist function in return. Just extract the first element.
For example,
import pandas as pd
def splitwords(s):
s1 = [x.split() for x in '.'.join(s).split('.')]
return s1
df = pd.DataFrame(s)
result = df.apply(splitwords).tolist()[0]
Again, it will give you the preferred result.
Hope it helps ;)

Removing stopwords from list using python3

I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
Your tokenized_models is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code
word_tokenize(str(i)) returns a list of words, so tokenized_models is a list of lists. You need to flatten that list, or better yet just make article a single string, since I don't see why it's a list at the moment.
This is because the in operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False

How to convert this function/for loop into a list comprehensionor higher order function using python?

Hello all I wrote the following simple translator program using a function and for loops but am trying to understand list comprehension/higher order functions better. I have a very basic grasp of functions such as map and listcomprehensions, but don't know how to work with them when the loop requires a placeholder value such as place_holder in the below code. Also, any suggestions on what I can do better would be greatly appreciated. Thanks in advance, you guys rock!
P.S how do you get that fancy formatting where my posted code looks like it does in notepad++?
sweedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english =('merry christmas and happy new year')
def translate(s):
new = s.split() #split the string into a list
place_holder = [] #empty list to hold the translated word
for item in new: #loop through each item in new
if item in sweedish:
place_holder.append(sweedish[item]) #if the item is found in sweedish, add the corresponding value to place_holder
for item in place_holder: #only way I know how to print a list out with no brackets, ' or other such items. Do you know a better way?
print(item, end=' ')
translate(english)
edit to show chepner's answer and chisaku's formatting tips:
sweedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english =('merry christmas and happy new year')
new = english.split()
print(' '.join([sweedish[item] for item in new if item in sweedish] ))
A list comprehension simply builds a list all at once, rather than individually calling append to add items to the end inside a for loop.
place_holder = [ sweedish[item] for item in new if item in sweedish ]
The variable itself is unnecessary, since you can put the list comprehension directly in the for loop:
for item in [ sweedish[item] for item in new if item in sweedish ]:
As #chepner says, you can use a list comprehension to concisely build your new list of words translated from English to Swedish.
To access the dictionary, you might want to use swedish.get(word, 'null_value_placeholder'), so you don't get a KeyError if your English word isn't in the dictionary.
In my example, 'None' is the placeholder for English words without a translation in the dictionary. You could just use '' as a placeholder acknowledging that the gaps in the dictionary only provide an approximate translation.
swedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english ='merry christmas and happy new year'
def translate(s):
words = s.split()
translation = [swedish.get(word, 'None') for word in words]
print ' '.join(translation)
translate(english)
>>>
god jul och nytt None ar
Alternatively, you can put a conditional expression in your list comprehension so the list comprehension only attempts to translate words that show up in the dictionary.
def translate(s):
words = s.split()
translation = [swedish[word] for word in words if word in swedish.keys()]
print ' '.join(translation)
translate(english)
>>>
god jul och nytt ar
The ' '.join(translation) function will convert your list of words to a string separated by ' '.

Categories