List of list of words by Python: - python

Having a long list of comments (50 by saying) such as this one:
"this was the biggest disappointment of our trip. the restaurant had
received some very good reviews, so our expectations were high. the
service was slow even though the restaurant was not very full. I had
the house salad which could have come out of any sizzler in the us.
the keshi yena, although tasty reminded me of barbequed pulled
chicken. this restaurant is very overrated".
I want to create a list of list of words retaining sentence tokenization using python.
After removing stopwords I want a result for all 50 comments in which sentence tokens are retained and word tokens are retained into each tokenized sentence. At the end I hope result being similar to:
list(c("disappointment", "trip"),
c("restaurant", "received", "good", "reviews", "expectations", "high"),
c("service", "slow", "even", "though", "restaurant", "full"),
c("house", "salad", "come", "us"),
c("although", "tasty", "reminded", "pulled"),
"restaurant")
How could I do that in python? Is R a good option in this case? I really will appreciate your help.

If you do not want to create a list of stop words by hand, I would recommend that you use the nltk library in python. It also handles sentence splitting (as opposed to splitting on every period). A sample that parses your sentence might look like this:
import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentence_detector.tokenize(text.strip())
results = []
for sentence in sentences:
tokens = nltk.word_tokenize(sentence)
words = [t.lower() for t in tokens if t.isalnum()]
not_stop_words = tuple([w for w in words if w not in stop_words])
results.append(not_stop_words)
print results
However, note that this does not give the exact same output as listed in your question, but instead looks like this:
[('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]
You might need to add some stop words manually in your case if the output needs to look the same.

Not sure you want R for this or not, but based on your requirement, I think it can be done in a pure pythonic way as well.
You basically want a list that contains small list of important words (that are not stop words) per sentence.
So you can do something like
input_reviews = """
this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high.
the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us.
the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated.
"""
# load your stop words list here
stop_words_list = ['this', 'was', 'the', 'of', 'our', 'biggest', 'had', 'some', 'very', 'so', 'were', 'not']
def main():
sentences = input_reviews.split('.')
sentence_list = []
for sentence in sentences:
inner_list = []
words_in_sentence = sentence.split(' ')
for word in words_in_sentence:
stripped_word = str(word).lstrip('\n')
if stripped_word and stripped_word not in stop_words_list:
# this is a good word
inner_list.append(stripped_word)
if inner_list:
sentence_list.append(inner_list)
print(sentence_list)
if __name__ == '__main__':
main()
On my end, this outputs
[['disappointment', 'trip'], ['restaurant', 'received', 'good', 'reviews,', 'expectations', 'high'], ['service', 'slow', 'even', 'though', 'restaurant', 'full'], ['I', 'house', 'salad', 'which', 'could', 'have', 'come', 'out', 'any', 'sizzler', 'in', 'us'], ['keshi', 'yena,', 'although', 'tasty', 'reminded', 'me', 'barbequed', 'pulled', 'chicken'], ['restaurant', 'is', 'overrated']]

This is one way to do it. You may need to initialize the stop_words as suits your application. I have assumed stop_words is in lowercase: hence, using lower() on the original sentences for comparison. sentences.lower().split('.') gives sentences. s.split() gives list of words in each sentence.
stokens = [list(filter(lambda x: x not in stop_words, s.split())) for s in sentences.lower().split('.')]
You may wonder why we use filter and lambda. An alternative is this but this will give a flat list and hence is not suitable:
stokens = [word for s in sentences.lower().split('.') for word in s.split() if word not in stop_words]
filter is a functional programming construct. It helps us to process an entire list, in this case, via an anonymous function using the lambda syntax.

Related

How to remove stop words and get lemmas in a pandas data frame using spacy?

I have a column of tokens in a pandas data frame in python. Something that looks like:
word_tokens
(the,cheeseburger,was,great)
(i,never,did,like,the,pizza,too,much)
(yellow,submarine,was,only,an,ok,song)
I want get two more new columns in this dataframe using the spacy library. One column that contains each row's tokens with the stopwords removed, and the other one containing the lemmas from the second column. How could I do that?
You're right about making your text a spaCy type - you want to transform every tuple of tokens into a spaCy Doc. From there, it is best to use the attributes of the tokens to answer the questions of "is the token a stop word" (use token.is_stop), or "what is the lemma of this token" (use token.lemma_). My implementation is below, I altered your input data slightly to include some examples of plurals so you can see that the lemmatization works properly.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
texts = [('the','cheeseburger','was','great'),
('i','never','did','like','the','pizzas','too','much'),
('yellowed','submarines','was','only','an','ok','song')]
df = pd.DataFrame({'word_tokens': texts})
The initial DataFrame looks like this:
word_tokens
0
('the', 'cheeseburger', 'was', 'great')
1
('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much')
2
('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song')
I define functions to perform the main tasks:
tuple of tokens -> spaCy Doc
spaCy Doc -> list of non-stop words
spaCy Doc -> list of non-stop, lemmatized words
def to_doc(words:tuple) -> spacy.tokens.Doc:
# Create SpaCy documents by joining the words into a string
return nlp(' '.join(words))
def remove_stops(doc) -> list:
# Filter out stop words by using the `token.is_stop` attribute
return [token.text for token in doc if not token.is_stop]
def lemmatize(doc) -> list:
# Take the `token.lemma_` of each non-stop word
return [token.lemma_ for token in doc if not token.is_stop]
Applying these looks like:
# create documents for all tuples of tokens
docs = list(map(to_doc, df.word_tokens))
# apply removing stop words to all
df['removed_stops'] = list(map(remove_stops, docs))
# apply lemmatization to all
df['lemmatized'] = list(map(lemmatize, docs))
The output you get should look like this:
word_tokens
removed_stops
lemmatized
0
('the', 'cheeseburger', 'was', 'great')
['cheeseburger', 'great']
['cheeseburger', 'great']
1
('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much')
['like', 'pizzas']
['like', 'pizza']
2
('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song')
['yellowed', 'submarines', 'ok', 'song']
['yellow', 'submarine', 'ok', 'song']
Based on your use case, you may want to explore other attributes of spaCy's document object (https://spacy.io/api/doc). Particularly, take a look at doc.noun_chunks and doc.ents if you're trying to extract more meaning out of text.
It is also worth noting that if you plan on using this with a very large number of texts, you should consider nlp.pipe: https://spacy.io/usage/processing-pipelines. It processes your documents in batches instead of one by one, and could make your implementation more efficient.
If you are working with spacy you should make your text a spacy type, so something like this:
nlp = spacy.load("en_core_web_sm")
text = topic_data['word_tokens'].values.tolist()
text = '.'.join(map(str, text))
text = nlp(text)
This makes it easier to work with. You can then tokenize the words like this
token_list = []
for token in text:
token_list.append(token.text)
And Remove stop words like so.
token_list= [word for word in token_list if not word in nlp.Defaults.stop_words]
I haven't yet figured out the lemmatization part yet, but this is a start till then.

How to check for words that are not immediately followed by a keyword, how about words not surrounded by the keyword?

I am trying to look for words that do not immediately come before the.
Performed a positive look-behind to get the words that come after the keyword 'the' (?<=the\W). However, I am unable to capture 'people' and 'that' as the above logic would not apply to these cases.
I am unable to take care of the words that do not have the keyword 'the' before and after (for example, 'that' and 'people' in the sentence).
p = re.compile(r'(?<=the\W)\w+')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
The current output am getting is
'part','fair','most','fireworks'.
Edit:
Thank you for all the help below. Using the below suggestions in the comments, managed to update my code.
p = re.compile(r"\b(?!the)(\w+)(\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
This brings me closer to the output I need to get.
Updated Output:
[('part', ' of the'), ('fair', ''),
('that', ' attracts the'), ('most', ''),
('people', ' is the'), ('fireworks', '')]
I just need the strings ('part','fair','that','most','people','fireworks').
Any advise?
I am trying to look for words that do not immediately come before 'the' .
Note that the code below does not use re.
words = 'the part of the fair that attracts the most people is the fireworks'
words_list = words.split()
words_not_before_the = []
for idx, w in enumerate(words_list):
if idx < len(words_list)-1 and words_list[idx + 1] != 'the':
words_not_before_the.append(w)
words_not_before_the.append(words_list[-1])
print(words_not_before_the)
output
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
using regex:
import re
m = re.sub(r'\b(\w+)\b the', 'the', 'the part of the fair that attracts the most people is the fireworks')
print([word for word in m.split(' ') if not word.isspace() and word])
output:
['the', 'part', 'the', 'fair', 'that', 'the', 'most', 'people', 'the', 'fireworks']
I am trying to look for words that do not immediately come before the.
Try this:
import re
# The capture group (\w+) matches a word, that is followed by a word, followed by the word: "the"
p = re.compile(r'(\w+)\W\w+\Wthe')
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Output:
['part', 'that', 'people']
Try to spin it around, instead of finding the words that does not immediately follow the, eliminate all the occurrences that immediately follow the
import re
test = "the part of the fair that attracts the most people is the fireworks"
pattern = r"\s\w*\sthe|the\s"
print(re.sub(pattern, "", test))
output: part fair that most people fireworks
I have finally solved the question. Thank you all!
p = re.compile(r"\b(?!the)(\w+)(?:\W\w+\Wthe)?")
m = p.findall('the part of the fair that attracts the most people is the fireworks')
print(m)
Added a non-capturing group '?:' inside the third group.
Output:
['part', 'fair', 'that', 'most', 'people', 'fireworks']

Is it possible to get a list of words given the Lemma in Spacy?

I am trying to fix grammatical gender in French text and wanted to know if there is a way to get a list of all words from a certain lemma and if it possible to do a lookup in such list?
Try:
import spacy
lemma_lookup = spacy.lang.en.LOOKUP
reverse_lemma_lookup = {}
for word,lemma in lemma_lookup.items():
if not reverse_lemma_lookup.get(lemma):
reverse_lemma_lookup[lemma] = [word]
elif word not in reverse_lemma_lookup[lemma]:
reverse_lemma_lookup[lemma].append(word)
reverse_lemma_lookup["be"]
["'m", 'am', 'are', 'arst', 'been', 'being', 'is', 'm', 'was', 'wass', 'were']
Change spacy.lang.en.LOOKUP to spacy.lang.fr.LOOKUP I guess for French

Remove non-select text in Python Script

I got huge amount of text data to deal with(through Orange), but need to clean it up somehow. Which means I need to remove all useless word for every line. Here is the code I put in Python Script(In Orange).
for i in range(1):
print in_data[i]
The data is one word per column.
Running script:
['1', 'NSW', 'Worst service ever', '0', 'I've', 'experi', 'drop', 'calls', 'late', 'voicemail', 'messages', 'poor', 'batteri', 'life', 'and', 'bad', '3G', 'coverage.', 'Complain', 'to', 'the', 'call', 'centr', 'doe', 'noth', 'and', 'thei', 'refus', 'to', 'replac', 'my', 'phone', 'or', 'let', 'me', 'out', 'of', 'the', 'contract', 'I', 'just', 'signed.', 'Thei', 'deni', 'there', 'is', 'ani', 'Dropped calls']
I am planning to remove all useless word. For example, I wanna keep only "Dropped calls","Complain" and remove all the rest. Base on this large amount of data. I need to use for loop to clean each line. But what method can keep the word I want and remove all the rest?
If the order of words is not important, you could define a set of useful words and take a set intersection with the list of all words per line.
useful_words = set(['Complain', 'Dropped calls', 'lolcat'])
for i in range(x):
filtered_words = useful_words.intersection(set(in_data[i]))
print filtered_words
(This is just a rough draft, which needs some form of text preprocessing and normalizations, but you get the idea)
The following should be an efficient solution both in time and space
# generator to yield every word which is in the set to keep
def filter_gen(words, words_to_keep):
for word in words:
if word in words_to_keep:
yield word
words_to_keep = set(( "Bb", "Dd"))
words = [ "Aa", "Bb", "Cc", "Dd" ]
res = [ word for word in filter_gen(words, words_to_keep) ]
>>> res
['Bb', 'Dd']

Python title() with apostrophes

Is there a way to use .title() to get the correct output from a title with apostrophes? For example:
"john's school".title() --> "John'S School"
How would I get the correct title here, "John's School" ?
If your titles do not contain several whitespace characters in a row (which would be collapsed), you can use string.capwords() instead:
>>> import string
>>> string.capwords("john's school")
"John's School"
EDIT: As Chris Morgan rightfully says below, you can alleviate the whitespace collapsing issue by specifying " " in the sep argument:
>>> string.capwords("john's school", " ")
"John's School"
This is difficult in the general case, because some single apostrophes are legitimately followed by an uppercase character, such as Irish names starting with "O'". string.capwords() will work in many cases, but ignores anything in quotes. string.capwords("john's principal says,'no'") will not return the result you may be expecting.
>>> capwords("John's School")
"John's School"
>>> capwords("john's principal says,'no'")
"John's Principal Says,'no'"
>>> capwords("John O'brien's School")
"John O'brien's School"
A more annoying issue is that title itself does not produce the proper results. For example, in American usage English, articles and prepositions are generally not capitalized in titles or headlines. (Chicago Manual of Style).
>>> capwords("John clears school of spiders")
'John Clears School Of Spiders'
>>> "John clears school of spiders".title()
'John Clears School Of Spiders'
You can easy_install the titlecase module that will be much more useful to you, and does what you like, without capwords's issues. There are still many edge cases, of course, but you'll get much further without worrying too much about a personally-written version.
>>> titlecase("John clears school of spiders")
'John Clears School of Spiders'
I think that can be tricky with title()
Lets try out something different :
def titlize(s):
b = []
for temp in s.split(' '): b.append(temp.capitalize())
return ' '.join(b)
titlize("john's school")
// You get : John's School
Hope that helps.. !!
Although the other answers are helpful, and more concise, you may run into some problems with them. For example, if there are new lines or tabs in your string. Also, hyphenated words (whether with regular or non-breaking hyphens) may be a problem in some instances, as well as words that begin with apostrophes. However, using regular expressions (using a function for the regular expression replacement argument) you can solve these problems:
import re
def title_capitalize(match):
text=match.group()
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
return re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
s="here's an apostrophe es. this string has multiple spaces\nnew\n\nlines\nhyphenated words: and non-breaking   spaces, and a non‑breaking hyphen, as well as 'ords that begin with ’strophies; it\teven\thas\t\ttabs."
print(title(s))
Anyway, you can edit this to compensate for any further problems, such as backticks and what-have-you, if needed.
If you're of the opinion that title casing should keep such as prepositions, conjunctions and articles lowercase unless they're at the beginning or ending of the title, you can try such as this code (but there are a few ambiguous words that you'll have to figure out by context, such as when):
import re
lowers={'this', 'upon', 'altogether', 'whereunto', 'across', 'between', 'and', 'if', 'as', 'over', 'above', 'afore', 'inside', 'like', 'besides', 'on', 'atop', 'about', 'toward', 'by', 'these', 'for', 'into', 'beforehand', 'unlike', 'until', 'in', 'aft', 'onto', 'to', 'vs', 'amid', 'towards', 'afterwards', 'notwithstanding', 'unto', 'while', 'next', 'including', 'thru', 'a', 'down', 'after', 'with', 'afterward', 'or', 'those', 'but', 'whereas', 'versus', 'without', 'off', 'among', 'because', 'some', 'against', 'before', 'around', 'of', 'under', 'that', 'except', 'at', 'beneath', 'out', 'amongst', 'the', 'from', 'per', 'mid', 'behind', 'along', 'outside', 'beyond', 'up', 'past', 'through', 'beside', 'below', 'during'}
def title_capitalize(match, use_lowers=True):
text=match.group()
lower=text.lower()
if lower in lowers and use_lowers==True:
return lower
else:
i=0
new_text=""
capitalized=False
while i<len(text):
if text[i] not in {"’", "'"} and capitalized==False:
new_text+=text[i].upper()
capitalized=True
else:
new_text+=text[i].lower()
i+=1
return new_text
def title(the_string):
first=re.sub(r"[\w'’‑-]+", title_capitalize, the_string)
return re.sub(r"(^[\w'’‑-]+)|([\w'’‑-]+$)", lambda match : title_capitalize(match, use_lowers=False), first)
IMHO, best answer is #Frédéric's one. But if you already have your string separated to words, and you know how string.capwords is implemeted, then you can avoid unneeded joining step:
def capwords(s, sep=None):
return (sep or ' ').join(
x.capitalize() for x in s.split(sep)
)
As a result, you can just do this:
# here my_words == ['word1', 'word2', ...]
s = ' '.join(word.capitalize() for word in my_words)
If you have to cater for dashes then use:
import string
" ".join(
string.capwords(word, sep="-")
for word in string.capwords(
"john's school at bel-red"
).split()
)
# "John's School At Bel-Red"

Categories