I have the following text and want to isolate a part of the sentence related to a keyword, in this case keywords = ['pizza', 'chips'].
text = "The pizza is great but the chips aren't the best"
Expected Output:
{'pizza': 'The pizza is great'}
{'chips': "the chips aren't the best"}
I have tried using the Spacy Dependency Matcher but admittedly I'm not quite sure how it works. I tried the following pattern for chips which yields no matches.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "chips_id",
"RIGHT_ATTRS": {"ORTH": "chips"}
},
{
"LEFT_ID": "chips_id",
"REL_OP": "<<",
"RIGHT_ID": "other_words",
"RIGHT_ATTRS": {"POS": '*'}
}
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("chips", [pattern])
doc = nlp("The pizza is great but the chips aren't the best")
for id_, (_, other_words) in matcher(doc):
print(doc[other_words])
Edit:
Additional example sentences:
example_sentences = [
"The pizza's are just OK, the chips is stiff and the service mediocre",
"Then the mains came and the pizza - these we're really average - chips had loads of oil and was poor",
"Nice pizza freshly made to order food is priced well, but chips are not so keenly priced.",
"The pizzas and chips taste really good and the Tango Ice Blast was refreshing"
]
Here is my attempt at a very limited solution to your problem, since I do not know how extensive you will want this to be.
I utilized code from this answer in order to address the problem.
import spacy
import re
en = spacy.load('en_core_web_sm')
text = "The pizza is great but the chips aren't the best"
doc = en(text)
seen = set() # keep track of covered words
chunks = []
for sent in doc.sents:
heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']
for head in heads:
words = [ww for ww in head.subtree]
for word in words:
seen.add(word)
chunk = (' '.join([ww.text for ww in words]))
chunks.append( (head.i, chunk) )
unseen = [ww for ww in sent if ww not in seen]
chunk = ' '.join([ww.text for ww in unseen])
chunks.append( (sent.root.i, chunk) )
chunks = sorted(chunks, key=lambda x: x[0])
output_dict = {}
for np in doc.noun_chunks:
insensitive_the = re.compile(re.escape('the '), re.IGNORECASE)
new_np = insensitive_the.sub('',np.text)
output_dict[new_np]=''
for ii, chunk in chunks:
#print(ii, chunk)
for key in output_dict:
if key in chunk:
output_dict[key]=chunk
print(output_dict)
The output I get is:
I am aware there are a few problems:
The conjunction 'but' should not be in the value of the pizza key.
The word "are n't" should be aren't in the second value of the dictionary.
However, I believe we can fix this if we know more information about what sort of sentences you are dealing with. For instance, we might have a list of conjunctions that we can strip from all the values of the dict if the sentences are simple enough.
Update with example sentences:
As you can see, I think SpaCy struggles a bit with the punctuation, as well as knowing that you only want food items as nouns in the dictionary, presumably.
you could use the following function :
def spliter(text : str , keyword :list, number_of_words:int):
L = text.split()
sentences = dict()
for k in L :
if k in keyword :
n = L.index(k)
if len(L) -n -1 > number_of_words :
sentences.update({k:' '.join(L[n : n + number_of_words])})
else :
sentences.update({k:' '.join(L[n :])})
return sentences
Note : number_of_word define how many word you want to get after the desired keyword
Output : for number_of_words = 3 you get :
{'pizza': 'pizza is great', 'chips': "chips aren't the best"}
Related
I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]
Ok, so the title might sound a bit confusing, but here's an analogy of what I'm trying to achieve. Let's imagine that we have the following dataset:
Brand name
Product type
Product_Description
Nike
Shoes
These black shoes are wonderful. They are elegant, and really comfortable
BMW
Car
This car goes fast. If you like speed, you'll like it.
Suzuki
Car
A family car, elegant and made for long journeys.
Call of Duty
VideoGame
A nervous shooter, putting you in the shoes of a desperate soldier, who has nothing left to lose.
Adidas
Shoes
Sneakers made for men, and women, who always want to go out with style.
This is just a made-up sample, but let's imagine this list goes on for a lot of other products.
What I'm trying to achieve here, is to cluster the elements (whether it is shoes, cars, or videogames), based on the words used in their respective description. Thus, I would obtain brands that are clustered together according to their description, but perhaps not belonging to the same type (e.g.: Suzuki + Adidas), and to get the name of the brands that are clustered together.
To do so, I relied on a word embedding method. After cleaning the description (stop words, non-alphanumeric characters) and tokenized it, I used a FastText model (the Wikipedia one) to evaluate the embeddings in the product descriptions.
def clean_text(text, tokenizer, stopwords):
text = str(text).lower() # Lowercase words
text = re.sub(r"\[(.*?)\]", "", text) # Remove [+XYZ chars] in content
text = re.sub(r"\s+", " ", text) # Remove multiple spaces in content
text = re.sub(r"\w+…|…", "", text) # Remove ellipsis (and last word)
text = re.sub(r"<a[^>]*>(.*?)</a>", r"\1", text) #Remove html tags
text = re.sub(f"[{re.escape(punctuation)}]", "", text)
text = re.sub(r"(?<=\w)-(?=\w)", " ", text) # Replace dash between words
text = re.sub(
f"[{re.escape(string.punctuation)}]", "", text
) # Remove punctuation
doc = nlp_model(text)
tokens = [token.lemma_ for token in doc]
#tokens = tokenizer(text) # Get tokens from text
tokens = [t for t in tokens if not t in stopwords] # Remove stopwords
tokens = ["" if t.isdigit() else t for t in tokens] # Remove digits
tokens = [t for t in tokens if len(t) > 1] # Remove short tokens
return tokens #Clean the Text
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
df = pd.read_csv("./mockup.csv")
custom_stopwords = set(stopwords.words("english"))
df["Product_Description"] = df["Product_Description"].fillna("")
df["tokens"] = df["Product_Description"].map(lambda x: clean_text(x, word_tokenize, custom_stopwords))
model = KeyedVectors.load_word2vec_format('./wiki-news-300d-1M.vec')
The problem is that I'm a bit of a beginner in word embeddings and clustering. As I said, my goal would be to cluster brands according to the words used in their description (the hypothesis is perhaps some brands are linked together through the words used in their description?), thus forgoing the old classification (shoes, cars, videogames...). I would also like to get the key brands of each cluster (so cluster 1 = Suzuki + Adidas, Cluster 2 = Call of Duty + Nike, Cluster 3 = BMW + ..., etc...).
Does anyone have any ideas on how to tackle this problem? I read several tutorials online on word embeddings and clustering, and to be completely honest, I am a bit lost.
Thank you for your help.
I am looking for a tokenizer that is expanding contractions.
Using nltk to split a phrase into tokens, the contraction is not expanded.
nltk.word_tokenize("she's")
-> ['she', "'s"]
However, when using a dictionary with contraction mappings only, and therefore not taking any information provided by surrounding words into account, it's not possible to decide whether "she's" should be mapped to "she is" or to "she has".
Is there a tokenizer that provides contraction expansion?
You can do rule based matching with Spacy to take information provided by surrounding words into account.
I wrote some demo code below which you can extend to cover more cases:
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher
sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]
nlp = spacy.load('en_core_web_sm')
def normalize(sentence):
ans = []
doc = nlp(sentence)
#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))
for s in sentences:
print(s)
print(normalize(s))
print()
output:
now she's a software engineer
now she is a software engineer
she's got a cat
she has got a cat
he's a tennis player
he is a tennis player
He thinks that she's 30 years old
He thinks that she is 30 years is old
I've built a web crawler which fetches me data. The data is typically structured. But then and there are a few anomalies. Now to do analysis on top of the data I am searching for few words i.e searched_words=['word1','word2','word3'......] I want the sentences in which these words are present. So I coded as below :
searched_words=['word1','word2','word3'......]
fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words))
str_df['context'] = str_df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent) if w.lower() in words)])
It is working but the problem I am facing is if there is/are missing white-spaces after a fullstop in the text I am getting all such sentences as such.
Example :
searched_words = ['snakes','venomous']
text = "I am afraid of snakes.I hate them."
output : ['I am afraid of snakes.I hate them.']
Desired output : ['I am afraid of snakes.']
If all tokenizers (including nltk) fail you you can take matters into your own hands and try
import re
s='I am afraid of snakes.I hate venomous them. Theyre venomous.'
def findall(s,p):
return [m.start() for m in re.finditer(p, s)]
def find(sent, word):
res=[]
indexes = findall(sent,word)
for index in indexes:
i = index
while i>0:
if sent[i]!='.':
i-=1
else:
break
end = index+len(word)
nextFullStop = end + sent[end:].find('.')
res.append(sent[i:nextFullStop])
i=0
return res
Play with it here. There's some dots left in there as I do not know what you want to do exactly with them.
What it does is it finds all occurences of said word, and gets you the Sentence all they way back to the previous dot. This is for an edge case only but you can tune it easily, specific to your needs.
I have a series of strings like:
'i would like a blood orange'
I also have a list of strings like:
["blood orange", "loan shark"]
Operating on the string, I want the following list:
["i", "would", "like", "a", "blood orange"]
What is the best way to get the above list? I've been using re throughout my code, but I'm stumped with this issue.
This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.
(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)
def split_with_phrases(sentence, phrase_list):
words = sentence.split(" ")
phrases = set(tuple(s.split(" ")) for s in phrase_list)
print phrases
max_phrase_length = max(len(p) for p in phrases)
# Find a phrase within words starting at the specified index. Return the
# phrase as a tuple, or None if no phrase starts at that index.
def find_phrase(start_idx):
# Iterate backwards, so we'll always find longer phrases before shorter ones.
# Otherwise, if we have a phrase set like "hello world" and "hello world two",
# we'll never match the longer phrase because we'll always match the shorter
# one first.
for phrase_length in xrange(max_phrase_length, 0, -1):
test_word = tuple(words[idx:idx+phrase_length])
if test_word in phrases:
return test_word
return None
skip = 0
for idx in xrange(len(words)):
if skip:
# This word was returned as part of a previous phrase; skip it.
skip -= 1
continue
phrase = find_phrase(idx)
if phrase is not None:
skip = len(phrase)
yield " ".join(phrase)
continue
yield words[idx]
print [s for s in split_with_phrases('i would like a blood orange',
["blood orange", "loan shark"])]
Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.
list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]
for item in input_list:
for str_lst in list_to_split:
if item in str_lst:
tmp = str_lst.split(item)
lst = []
for itm in tmp:
if itm!= '':
lst.append(itm)
lst.append(item)
print lst
output:
['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']
One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.
#!/usr/bin/python
import re
my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))
my_segs = re.split(r"\s+",my_string)
for i in range(0,len(my_segs)):
my_segs[i] = my_segs[i].replace("&"," ")
print my_segs
Edit: Glenn Maynard's solution is better.