Python list comprehension on two list - python

I am stuck with list comprehension in Python
I have the following data structure
dataset = [sentence1, sentence2,...]
sentence = [word1, word2,...]
in addition, I have a list of special words
special_words = [special_word1, special_word2, special_word3,...]
I want to run over all special_words in special_words and fetch all words that occur together in sentence with special_words.
As result I expect,
data=[special_word1_list, special_word2_list, ...],
,where special_word1_list = [word1, word2, ...]
it means word1, word2 ,... were in the sentences together with special_word1_list
I tried many difference ways to construct the list comprehension, unfortunately without any success.
I would appreciate any help, in addition if you know any good article about list comprehension, post it here.

data = [
{
word
for sentence in sentences
if special_word in sentence
for word in sentence
}
for special_word in special_words
]

I think you want:
data = [sentence for sentence in data
if any(word in special_words
for word in sentence)]

Alternatively, I'd suggest creating a dictionary mapping special words to sets of other words occurring in the same sentence as the respective special word:
{sw : {w for ds in dataset if sw in ds for w in ds if w != sw}
for sw in special_words}

Related

same words in different contexts with Word2Vec

I want to use Word2Vec to represent words by vectors.
If there are 2 identical words in the Word2Vec's input,
is it possible to get a different representation for them?
Are there different methods to solve this issue?
First of all, great question.
I found a link that explains how to avoid duplicates when representing words with vectors using python.
https://towardsdatascience.com/learn-word2vec-by-implementing-it-in-tensorflow-45641adaf2ac
You can split the text and avoid word duplication in the following way:
corpus_raw = 'He is the king . The king is royal . She is the royal queen '
# convert to lower case
corpus_raw = corpus_raw.lower()
# ---- dictionary words -> int and int -> words ----
words = []
for word in corpus_raw.split():
if word != '.': # because we don't want to treat . as a word
words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
word2int[word] = i
int2word[i] = word
print(words)

Code to create a reliable Language model from my own corpus

I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.

parsing emails to identify keywords

I'm looking to parse through a list of email text to identify keywords. lets say I have this following list:
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
I want to check to see if words from a keywords list are in any of these sentences in the list, using regex. I wouldn't want informations to be captured, only information
keywords = ['information', 'boxes', 'porcupine']
was trying to do something like:
['words' in words for [word for word in [sentence for sentence in sentences]]
or
for sentence in sentences:
sentence.split(' ')
ultimately would like to filter down current list to elements that only have the keywords I've specified.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
output: [False, True, False]
or ultimately:
parsed_list = [['more information in this one']]
Here is a one-liner to solve your problem. I find using lambda syntax is easier to read than nested list comprehensions.
keywords = ['information', 'boxes']
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more words to be honest, not sure what to write']]
results_lambda = list(
filter(lambda sentence: any((word in sentence[0] for word in keywords)), sentences))
print(results_lambda)
[['more information in this one']]
This can be done with a quick list comprehension!
lists = [['here is one sentence'], ['and here is another'], ['let us filter!'], ['more than one word filter']]
filter = ['filter', 'one']
result = list(set([x for s in filter for x in lists if s in x[0]]))
print(result)
result:
[['let us filter!'], ['more than one word filter'], ['here is one sentence']]
hope this helps!
Do you want to find sentences which have all the words in your keywords list?
If so, then you could use a set of those keywords and filter each sentence based on whether all words are present in the list:
One way is:
keyword_set = set(keywords)
n = len(keyword_set) # number of keywords
def allKeywdsPresent(sentence):
return len(set(sentence.split(" ")) & keyword_set) == n # the intersection of both sets should equal the keyword set
filtered = [sentence for sentence in sentences if allKeywdsPresent(sentence)]
# filtered is the final set of sentences which satisfy your condition
# if you want a list of booleans:
boolean_array = [allKeywdsPresent(sentence[0]) for sentence in sentences]
There could be more optimal ways to do this (e.g. the set created for each sentence in allKeywdsPresent could be replaced with a single pass over all elements, etc.) But, this is a start.
Also, understand that using a set means duplicates in your keyword list will be eliminated. So, if you have a list of keywords with some duplicates, then use a dict instead of the set to keep a count of each keyword and reuse above logic.
From your example, it seems enough to have at least one keyword match. Then you need to modify allKeywdsPresent() [Maybe rename if to anyKeywdsPresent]:
def allKeywdsPresent(sentence):
return any(word in keyword_set for word in sentence.split())
If you want to match only whole words and not just substrings you'll have to account for all word separators (whitespace, puctuation, etc.) and first split your sentences into words, then match them against your keywords. The easiest, although not fool-proof way is to just use the regex \W (non-word character) classifier and split your sentence on such occurences..
Once you have the list of words in your text and list of keywords to match, the easiest, and probably most performant way to see if there is a match is to just do set intersection between the two. So:
# not sure why you have the sentences in single-element lists, but if you insist...
sentences = [['this is a paragraph there should be lots more words here'],
['more information in this one'],
['just more disinformation, to make sure we have no partial matches']]
keywords = {'information', 'boxes', 'porcupine'} # note we're using a set here!
WORD = re.compile(r"\W+") # a simple regex to split sentences into words
# finally, iterate over each sentence, split it into words and check for intersection
result = [s for s in sentences if set(WORD.split(s[0].lower())) & keywords]
# [['more information in this one']]
So, how does it work - simple, we iterate over each of the sentences (and lowercase them for a good measure of case-insensitivity), then we split the sentence into words with the aforementioned regex. This means that, for example, the first sentence will split into:
['this', 'is', 'a', 'paragraph', 'there', 'should', 'be', 'lots', 'more', 'words', 'here']
We then convert it into a set for blazing fast comparisons (set is a hash sequence and intersections based on hashes are extremely fast) and, as a bonus, this also gets rid duplicate words.
Fnally, we do the set intersection against our keywords - if anything is returned these two sets have at least one word in common, which means that the if ... comparison evaluates to True and, in that case, the current sentence gets added to the result.
Final note - beware that while \W+ might be enough to split sentences into words (certainly better than a whitespace split only), it's far from perfect and not really suitable for all languages. If you're serious about word processing take a look at some of the NLP modules available for Python, such as the nltk.

Removing stopwords from a list of text files

I have a list of processed text files, that looks somewhat like this:
text = "this is the first text document " this is the second text document " this is the third document "
I've been able to successfully tokenize the sentences:
sentences = sent_tokenize(text)
for ii, sentence in enumerate(sentences):
sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]
And now I would like to remove stopwords from this list of tokens. However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.
This is what I've tried so far, but it returns no results:
sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]
I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!
You can create two lists generators like that:
sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]

is it possible to create an if statement dynamically using list elements and OR?

I'm trying to adapt the code I wrote below to work with a dynamic list of required values rather than with a string, as it works at present:
required_word = "duck"
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if required_word not in sentence:
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
Say for example I had two required words (only one of which are actually required), I could do this:
required_words = ["dog", "fox"]
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if (required_words[0] not in sentence) or (required_words[1]not in sentence):
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
>>> [['Hello', 'duck']]
However, what I need is for someone to steer me in the direction of a method of dealing with a list that will vary in size (number of items), and satisfy the if statement if any of the list's items are not in the list named 'sentence'. However, being quite new to Python, I'm stumped, and don't know how to better phrase the question. Do I need to come up with a different approach?
Thanks in advance!
(Note that the real code will do something more complicated than printing sentences_not_containing_required_words.)
You can construct this list pretty easily with a combination of a list comprehension and the any() built-in function:
non_matches = [s for s in sentences if not any(w in s for w in required_words)]
This will iterate over the list sentences while constructing a new list, and only include sentences where none of the words from required_words are present.
If you are going to end up with longer lists of sentences, you may consider using a generator expression instead to minimize memory footprint:
non_matches = (s for s in sentences if not any(w in s for w in required_words))
for s in non_matches:
# do stuff

Categories