Convert list of string representations of sentences into vocabulary set - python

I have a list of string representations of sentences that looks something like this:
original_format = ["This is a question", "This is another question", "And one more too"]
I want to convert this list into a set of unique words in my corpus. Given the above list, the output would look something like this:
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
I've figured out a way to do this, but it takes a very long time to run. I am interested in a more efficient way of converting from one format to another (especially since my actual dataset contains >200k sentences).
FYI, what I'm doing right now is creating an empty set for the vocab and then looping through each sentence (split by spaces) and unioning with the vocab set. Using the original_format variable as defined above, it looks like this:
vocab = set()
for q in original_format:
vocab = vocab.union(set(q.split(' ')))
Can you help me run this conversion more efficiently?

You can use itertools.chain with set. This avoids nested for loops and list construction.
from itertools import chain
original_format = ["This is a question", "This is another question", "And one more too"]
res = set(chain.from_iterable(i.split() for i in original_format))
print(res)
{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}
Or for a truly functional approach:
from itertools import chain
from operator import methodcaller
res = set(chain.from_iterable(map(methodcaller('split'), original_format)))

Using a simple set comprehension:
{j for i in original_format for j in i.split()}
Output:
{'too', 'is', 'This', 'And', 'question', 'another', 'more', 'one', 'a'}

Related

How to avoid Gensim Simple Preprocess to remove digits?

I am having some problems in preprocessing some data with gensim.utils.simple_preprocess.
In a few words, I noticed that the simple_preprocess function removes the digits from my text, but I don't want to!
For instance, I have this code:
import gensim
from gensim.utils import simple_preprocess
my_text = ["I am doing activity number 1", "Instead, I am doing the number 2"]
def gen_words(texts):
final = []
for text in texts:
new = gensim.utils.simple_preprocess(text, deacc=True, min_len=1)
final.append(new)
return (final)
solution = gen_words(my_text)
print (solution)
The output is the following:
[['i', 'am', 'doing', 'activity', 'number'], ['instead', 'i', 'am', 'doing', 'the', 'number']]
I would like instead to have this as a solution:
[['i', 'am', 'doing', 'activity', 'number', '1'], ['instead', 'i', 'am', 'doing', 'the', 'number', '2']]
How to avoid seeing the digits erased from my code? I have also tried setting the min_len=0 but still is not working.
The simple_preprocess() function is just one rather simple convenience option for tokenizing text from a string, into a list-of-tokens.
It's not especially well-tuned for any particular need – and it has no configurable option to retain tokens that don't match its particular hardcoded pattern (PAT_ALPHABETIC) which rules-out tokens with leading digits.
Many projects will want to apply their own tokenization/preprocessing instead, better suited to their data & problem domain. If you need ideas for how to start, youc can consult the actual source code for simple_preprocess() (and other functions it relies upon like tokenize() & simple_tokenize()) that Gensim uses:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

How to create a 3rd list from a nested list based on items in another list

I have a list of some users
list_of_users=['#elonmusk', '#YouTube','#FortniteGame','#BillGates','#JeffBezos']
and a nested list made by tweets, split by words.
tweets_splitted_by_words=[['#MrBeastYT', '#BillGates', 'YOU’RE', 'THE', 'LAST', 'ONE', 'FINISH', 'THE', 'MISSION', '#TeamTrees'], ['#MrBeastYT', '#realDonaldTrump', 'do', 'something', 'useful', 'with', 'your', 'life', 'and', 'donate', 'to', '#TeamTrees'], ['Please', 'please', 'donate']]
I want to create a third new list made by subblists of tweets_splitted_by_words only if each subblist contains at least one of the users in list_of_users.
Output that I want:
output=[['#MrBeastYT', '#BillGates', 'YOU’RE', 'THE', 'LAST', 'ONE', 'FINISH', 'THE', 'MISSION', '#TeamTrees']]
I tried the following code but it didn't work out:
tweets_per_user_mentioned= []
giorgia=[]
for r in range(len(tweets_splitted_by_words)):
giorgia.append(r)
for _i in range(len(giorgia)):
if _i in range(len(list_of_users)):
tweets_per_user_mentioned.append(tweets_splitted_by_words[r])
else:
pass
print(tweets_per_user_mentioned)
Since you will be performing lookups on the list of users, it is a good idea to have a set data structure. Sets provide O(1) lookup which greatly reduces time complexity of many problems.
For filtering, I'd just use python's built-in any and a list comprehension
set_of_users = set(list_of_users)
filtered_tweets = [tweet for tweet in tweets_splitted_by_words \
if any(word in set_of_users for word in tweet)]

Remove all special characters and numbers and stop words

I have a list of issues like below and I would like to remove all special characters, numbers from this list of issues and i would like to do tokenization and stop words removal from this issues list:
issue=[[hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by
the_way two-three!~`` [problems]:are there;]
[happy"journey" (and) \\r\\n\\rbring 576 chachos?>]]
I have tried below code but I am not getting desired output:
import re
ab=re.sub('[^A-Za-z0-9]+', '', issue)
bc=re.split(r's, ab)
I would like to see output like below:
issue_output=[['hi','going','uk','us','tomorrow','morning',
'way','two','three','problems' ]
[ 'happy','journey','bring','chachos']]
There are two glaring problems with the code that you have posted. First is that your input list issue is not formatted properly which makes it impossible to parse. Depending on the way you actually want it formatted, the answer to your question might change, but in general, this leads to the second problem, which is that you are trying to do re.sub on a list. You want to do the substitution on the list's elements. You can use list comprehension for that:
issue_output = [re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in issue]
Since there is no valid Python list provided in the question, I will assume the values in the list based on my best guess.
issue = [
['hi iam !#going $%^ to uk&*(us \\r\\ntomorrow {morning} by the_way two-three!~`` [problems]:are there;'],
['happy"journey" (and) \\r\\n\\rbring 576 chachos?>']
]
In this case, when you have a list of lists of strings, you need to adjust the list comprehension for that.
cleaned_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item) for item in inner_list] for inner_list in issue]
This returns a list of lists with strings inside:
[['hi iam going to uk us r ntomorrow morning by the way two three problems are there '], ['happy journey and r n rbring 576 chachos ']]
If you want to have the separate words in that list, simply split() them after substitution.
tokenized_issue = [[re.sub(r'[^A-Za-z0-9]+', ' ', item.split()) for item in inner_list][0] for inner_list in issue]
This gives the result of:
[['hi', 'iam', 'going', 'to', 'uk', 'us', 'r', 'ntomorrow', 'morning', 'by', 'the', 'way', 'two', 'three', 'problems', 'are', 'there'], ['happy', 'journey', 'and', 'r', 'n', 'rbring', '576', 'chachos']]

List of list of words by Python:

Having a long list of comments (50 by saying) such as this one:
"this was the biggest disappointment of our trip. the restaurant had
received some very good reviews, so our expectations were high. the
service was slow even though the restaurant was not very full. I had
the house salad which could have come out of any sizzler in the us.
the keshi yena, although tasty reminded me of barbequed pulled
chicken. this restaurant is very overrated".
I want to create a list of list of words retaining sentence tokenization using python.
After removing stopwords I want a result for all 50 comments in which sentence tokens are retained and word tokens are retained into each tokenized sentence. At the end I hope result being similar to:
list(c("disappointment", "trip"),
c("restaurant", "received", "good", "reviews", "expectations", "high"),
c("service", "slow", "even", "though", "restaurant", "full"),
c("house", "salad", "come", "us"),
c("although", "tasty", "reminded", "pulled"),
"restaurant")
How could I do that in python? Is R a good option in this case? I really will appreciate your help.
If you do not want to create a list of stop words by hand, I would recommend that you use the nltk library in python. It also handles sentence splitting (as opposed to splitting on every period). A sample that parses your sentence might look like this:
import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentence_detector.tokenize(text.strip())
results = []
for sentence in sentences:
tokens = nltk.word_tokenize(sentence)
words = [t.lower() for t in tokens if t.isalnum()]
not_stop_words = tuple([w for w in words if w not in stop_words])
results.append(not_stop_words)
print results
However, note that this does not give the exact same output as listed in your question, but instead looks like this:
[('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]
You might need to add some stop words manually in your case if the output needs to look the same.
Not sure you want R for this or not, but based on your requirement, I think it can be done in a pure pythonic way as well.
You basically want a list that contains small list of important words (that are not stop words) per sentence.
So you can do something like
input_reviews = """
this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high.
the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us.
the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated.
"""
# load your stop words list here
stop_words_list = ['this', 'was', 'the', 'of', 'our', 'biggest', 'had', 'some', 'very', 'so', 'were', 'not']
def main():
sentences = input_reviews.split('.')
sentence_list = []
for sentence in sentences:
inner_list = []
words_in_sentence = sentence.split(' ')
for word in words_in_sentence:
stripped_word = str(word).lstrip('\n')
if stripped_word and stripped_word not in stop_words_list:
# this is a good word
inner_list.append(stripped_word)
if inner_list:
sentence_list.append(inner_list)
print(sentence_list)
if __name__ == '__main__':
main()
On my end, this outputs
[['disappointment', 'trip'], ['restaurant', 'received', 'good', 'reviews,', 'expectations', 'high'], ['service', 'slow', 'even', 'though', 'restaurant', 'full'], ['I', 'house', 'salad', 'which', 'could', 'have', 'come', 'out', 'any', 'sizzler', 'in', 'us'], ['keshi', 'yena,', 'although', 'tasty', 'reminded', 'me', 'barbequed', 'pulled', 'chicken'], ['restaurant', 'is', 'overrated']]
This is one way to do it. You may need to initialize the stop_words as suits your application. I have assumed stop_words is in lowercase: hence, using lower() on the original sentences for comparison. sentences.lower().split('.') gives sentences. s.split() gives list of words in each sentence.
stokens = [list(filter(lambda x: x not in stop_words, s.split())) for s in sentences.lower().split('.')]
You may wonder why we use filter and lambda. An alternative is this but this will give a flat list and hence is not suitable:
stokens = [word for s in sentences.lower().split('.') for word in s.split() if word not in stop_words]
filter is a functional programming construct. It helps us to process an entire list, in this case, via an anonymous function using the lambda syntax.

How do I put single word as an array to a list in Python?

I need to make such a list of words in python:
list_of_words = ['saffron'], ['aloha'], ['leave'],['cola'],['packing']\
by choosing some random words from other word_bank = ['cola', 'home', 'undone', 'some', 'good', ....] unless, let's say len(list_of_words)=15
I have never used that before. What is it called?
Where should I search for it?
How do I obtain such a list?
Maybe that is what you are looking for:
import random
word_bank = ['cola', 'home', 'undone', 'some', 'good']
tuple([[x] for x in random.sample(word_bank, 5)])
Possible output:
(['cola'], ['some'], ['good'], ['undone'], ['home'])
Here is my solution to what I believe you are saying:
import random
list_of_words = []
word_bank = ['cola', 'home', 'undone', 'some', 'good']
while len(list_of_words)<15:
list_of_words.append(random.choice(word_bank))
this will create an empty list, and then append a random choice from word_bank onto this list. I wasn't sure exactly why you wanted a list of lists, but this will put it into a format similar to word_bank.

Categories