Flattening 3D list of words to 2D - python

I have a pandas column with text strings. For simplicity ,lets assume I have a column with two strings.
s=["How are you. Don't wait for me", "this is all fine"]
I want to get something like this:
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
Basically take each sentence of a document and tokenism into list of words. So finally I need list of list of string.
I tried using a map like below:
nlp=spacy.load('en')
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
global log_txt
x=re.sub("\s\s+" , " ", x.strip())
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
#log_txt=log_txt.extend(txt_to_words)
return txt_to_words
The nlp from spacy is used to split a string of text into list of sentences.
log_txt=list(map(text_to_words,s))
log_txt
But this as you know would put all of the result from both the documents into another list
[[['How', 'are', 'you'], ["Don't", 'wait', 'for', 'me']],
[['this', 'is', 'all', 'fine']]]

You'll need a nested list comprehension. Additionally, you can get rid of punctuation using re.sub.
import re
data = ["How are you. Don't wait for me", "this is all fine"]
words = [
re.sub([^a-z\s], '', j.lower()).split() for i in data for j in nlp(i).sents
]
Or,
words = []
for i in data:
... # do something here
for j in nlp(i).sents:
words.append(re.sub([^a-z\s], '', j.lower()).split())

There is a much simpler way for list comprehension.
You can first join the strings with a period '.' and split them again.
[x.split() for x in '.'.join(s).split('.')]
It will give the desired result.
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
For Pandas dataframes, you may get an object, and hence a list of lists after tolist function in return. Just extract the first element.
For example,
import pandas as pd
def splitwords(s):
s1 = [x.split() for x in '.'.join(s).split('.')]
return s1
df = pd.DataFrame(s)
result = df.apply(splitwords).tolist()[0]
Again, it will give you the preferred result.
Hope it helps ;)

Related

Python - nested list comprehension with tokenizing

Python question: I have a list of sentences on which I want to apply nltk stemming. So for each word in each sentence, I want to apply, in this case, the nltk snowball.stem function.
I want to write that as short as possible via list comprehension.
Below code works fine, but I want to write it in less lines:
data_stemming=[]
for sentence in data:
word_list=word_tokenize(sentence)
stemmed_sentence=' '.join([stemmer.stem(w) for w in word_list])
data_stemming.append(stemmed_sentence)
print(data_stemming)
output:
['do do done', 'do requir', 'shoe shoe']
Can someone help me out here?
Thanks a lot!
nltk.word_tokenize accepts as input a string, but data is a list of strings. What you need is:
data = ['doing do done', 'requires require', 'shoe shoes']
joined_data = ' '.join(data)
data_stemming=[[snowball.stem(w) for w in word_list] for word_list in nltk.word_tokenize(joined_data)]
You can try doing it in a single List comprehension
data_stemming = [''.join([stemmer.stem(w) for w in word_tokenize]) for sentence in data]

Removing stopwords from a string with ordered set and join retains a single stopword

I don't understand why I don't remove the stopword "a" in this loop. It seems so obvious that this should work...
Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. Output: stripped_paragraph = 'want figure out how can better data scientist'
Below I define 'stopwords'
I split all the words by a space, make a set of words while retaining the order
loop through the ordered and split substring set ('osss' var) and conditionally remove each word if it's a word in the list 'stopwords'
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
for word in osss:
if word.strip() in stopwords:
osss.remove(word)
else:
next
return ' '.join(osss)
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
My incorrect output is: 'want figure out how can a better data scientist'
Correct output: 'want figure out how can better data scientist'
edit: note that .strip() in the condition check with word.strip() is unnecessary and I still get the same output, that was me checking to make sure there wasn't an extra space somehow
edit2: this is an interview question, so I can't use any imports
What your trying to do can be achieved with much fewer lines of code.
The main problem in your code is your changing the list while iterating over it.
This works and is much simpler. Essentially looping over the list of your paragraph words, and only keeping the ones that aren't in the stopwords list. Then joining them back together with a space.
paragraph = 'I want to figure out how I can be a better data scientist'
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
filtered = ' '.join([word for word in paragraph.split() if word not in stopwords])
print(filtered)
You may also consider using nltk, which has a predefined list of stopwords.
You should not change(delete/add) a collection(osss) while iterating over it.
del_list = []
for word in osss:
if word.strip() in stopwords:
del_list.append(word)
else:
next
osss = [e for e in osss if e not in del_list]
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
x = list(osss)
for word in osss:
if word.strip() in stopwords:
x.remove(word)
#else:
# next
ret = ' '.join(x)
return ret
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")

Removing stopwords from a pandas series based on list

I have the following data frame called sentences
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
And a dataframe called stopwords:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
I want to remove all stopwords from sentences["sentence"]. I tried the code below but it does not work. I think there is an issue with my if statement. Can anyone help?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
When I apply the function, it just returns the first or first few strings in the sentence but does not cut out stopwords at all. Kinda stuck here
You can convert stopwords word to list and remove those words from sentences using list comprehension,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
You get
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
Or you can wrap the code in a function,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))
You have many syntax errors in your code above. If you keep the stopwords as a list (or set) rather than DataFrame the following will work -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))
The key to success is to convert the list of stopwords into a set(): sets have O(1) lookup times, while lists' time is O(N).
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

python to search separated words

I am trying to extract separated multi words from a python list with two different list as a query string. My sentences list is
lst = ['we have the terrible HIV epidemic that takes down the life expectancy of the African ','and I take the regions down here','The poorest are down']
lst_verb = ['take','go','wake']
lst_prep = ['down','up','in']
import re
output=[]
item = 'down'
p = re.compile(r'(?:\w+\s+){1,20}'+item)
for i in lst:
output.append(p.findall(i))
for item in output:
print(item)
with this i am able to extract word from the list, However I am only want to extract separated multiwords, i.e it should extract the word from the list "and I take the regions down here".
furthermore, I want to use the word from lst_verb and lst_prep as query string.
for example
re.findall(r \lst_verb+'*.\b'+ \lst_prep)
Thank you for your answer.
You can use regex like
(?is)^(?=.*\b(take)\b)(?=.*?\b(go)\b)(?=.*\b(where)\b)(?=.*\b(wake)\b).*
To match Multiple words
like this your example
use functions to create regex string from the verbs and prep.
hope this helps

Categories