I have to find and "apply" collocations in several sentences. The sentences are stored in a list of string. Let' focus on only one sentence now.
Here's an example:
sentence = 'I like to eat the ice cream in new york'
Here's what I want in the end:
sentence_final = 'I like to eat the ice_cream in new_york'
I'm using Python NLTK to find the collocations and I'm able to create a set containing all the possible collocations over all the sentences I have.
Here's an example of the set:
set_collocations = set([('ice', 'cream'), ('new', 'york'), ('go', 'out')])
It's obviously bigger in reality.
I created the following function, which should return the new function, modified as described above:
def apply_collocations(sentence, set_colloc):
window_size = 2
words = sentence.lower().split()
list_bigrams = list(nltk.bigrams(words))
set_bigrams=set(list_bigrams)
intersect = set_bigrams.intersection(set_colloc)
print(set_colloc)
print(set_bigrams)
# No collocation in this sentence
if not intersect:
return sentence
# At least one collocation in this sentence
else:
set_words_iters = set()
# Create set of words of the collocations
for bigram in intersect:
set_words_iters.add(bigram[0])
set_words_iters.add(bigram[1])
# Sentence beginning
if list_bigrams[0][0] not in set_words_iters:
new_sentence = list_bigrams[0][0]
begin = 1
else:
new_sentence = list_bigrams[0][0] + '_' + list_bigrams[0][1]
begin = 2
for i in range(begin, len(list_bigrams)):
print(new_sentence)
if list_bigrams[i][1] in set_words_iters and list_bigrams[i] in intersect:
new_sentence += ' ' + list_bigrams[i][0] + '_' + list_bigrams[i][1]
elif list_bigrams[i][1] not in set_words_iters:
new_sentence += ' ' + list_bigrams[i][1]
return new_sentence
2 question:
Is there a more optimized way to to this?
Since I'm a little bit inexpert with NLTK, can someone tell me if there' a "direct way" to apply collocations to a certain text? I mean, once I have identified the bigrams which I consider collocations, is there some function (or fast method) to modify my sentences?
You can simply replace the string "x y" by "x_y" for each element in your collocations set:
def apply_collocations(sentence, set_colloc):
res = sentence.lower()
for b1,b2 in set_colloc:
res = res.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2))
return res
Related
Input:
string = "My dear adventurer, do you understand the nature of the given discussion?"
expected output:
string = 'My dear ##########, do you ########## the nature ## the given ##########?'
How can you replace the third word in a string of words with the # length equivalent of that word while avoiding counting special characters found in the string such as apostrophes('), quotations("), full stops(.), commas(,), exclamations(!), question marks(?), colons(:) and semicolons (;).
I took the approach of converting the string to a list of elements but am finding difficulty filtering out the special characters and replacing the words with the # equivalent. Is there a better way to go about it?
I solved it with:
s = "My dear adventurer, do you understand the nature of the given discussion?"
def replace_alphabet_with_char(word: str, replacement: str) -> str:
new_word = []
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
for c in word:
if c in alphabet:
new_word.append(replacement)
else:
new_word.append(c)
return "".join(new_word)
every_nth_word = 3
s_split = s.split(' ')
result = " ".join([replace_alphabet_with_char(s_split[i], '#') if i % every_nth_word == every_nth_word - 1 else s_split[i] for i in range(len(s_split))])
print(result)
Output:
My dear ##########, do you ########## the nature ## the given ##########?
There are more efficient ways to solve this question, but I hope this is the simplest!
My approach is:
Split the sentence into a list of the words
Using that, make a list of every third word.
Remove unwanted characters from this
Replace third words in original string with # times the length of the word.
Here's the code (explained in comments) :
# original line
line = "My dear adventurer, do you understand the nature of the given discussion?"
# printing original line
print(f'\n\nOriginal Line:\n"{line}"\n')
# printing somehting to indicate that next few prints will be for showing what is happenning after each lone
print('\n\nStages of parsing:')
# splitting by spaces, into list
wordList = line.split(' ')
# printing wordlist
print(wordList)
# making list of every third word
thirdWordList = [wordList[i-1] for i in range(1,len(wordList)+1) if i%3==0]
# pritning third-word list
print(thirdWordList)
# characters that you don't want hashed
unwantedCharacters = ['.','/','|','?','!','_','"',',','-','#','\n','\\',':',';','(',')','<','>','{','}','[',']','%','*','&','+']
# replacing these characters by empty strings in the list of third-words
for unwantedchar in unwantedCharacters:
for i in range(0,len(thirdWordList)):
thirdWordList[i] = thirdWordList[i].replace(unwantedchar,'')
# printing third word list, now without punctuation
print(thirdWordList)
# replacing with #
for word in thirdWordList:
line = line.replace(word,len(word)*'#')
# Voila! Printing the result:
print(f'\n\nFinal Output:\n"{line}"\n\n')
Hope this helps!
Following works and does not use regular expressions
special_chars = {'.','/','|','?','!','_','"',',','-','#','\n','\\'}
def format_word(w, fill):
if w[-1] in special_chars:
return fill*(len(w) - 1) + w[-1]
else:
return fill*len(w)
def obscure(string, every=3, fill='#'):
return ' '.join(
(format_word(w, fill) if (i+1) % every == 0 else w)
for (i, w) in enumerate(string.split())
)
Here are some example usage
In [15]: obscure(string)
Out[15]: 'My dear ##########, do you ########## the nature ## the given ##########?'
In [16]: obscure(string, 4)
Out[16]: 'My dear adventurer, ## you understand the ###### of the given ##########?'
In [17]: obscure(string, 3, '?')
Out[17]: 'My dear ??????????, do you ?????????? the nature ?? the given ???????????'
With help of some regex. Explanation in the comments.
import re
imp = "My dear adventurer, do you understand the nature of the given discussion?"
every_nth = 3 # in case you want to change this later
out_list = []
# split the input at spaces, enumerate the parts for looping
for idx, word in enumerate(imp.split(' ')):
# only do the special logic for multiples of n (0-indexed, thus +1)
if (idx + 1) % every_nth == 0:
# find how many special chars there are in the current segment
len_special_chars = len(re.findall(r'[.,!?:;\'"]', word))
# ^ add more special chars here if needed
# subtract the number of special chars from the length of segment
str_len = len(word) - len_special_chars
# repeat '#' for every non-special char and add the special chars
out_list.append('#'*str_len + word[-len_special_chars] if len_special_chars > 0 else '')
else:
# if the index is not a multiple of n, just add the word
out_list.append(word)
print(' '.join(out_list))
A mixed of regex and string manipulation
import re
string = "My dear adventurer, do you understand the nature of the given discussion?"
new_string = []
for i, s in enumerate(string.split()):
if (i+1) % 3 == 0:
s = re.sub(r'[^\.:,;\'"!\?]', '#', s)
new_string.append(s)
new_string = ' '.join(new_string)
print(new_string)
I have a function remove_stopwords like this. How do I make it run faster?
temp.reverse()
def drop_stopwords(text):
for x in temp:
elif len(x.split()) > 1:
text_list = text.split()
for y in range(len(text_list)-len(x.split())):
if " ".join(text_list[y:y+len(x.split())]) == x:
del text_list[y:y+len(x.split())]
text = " ".join(text_list)
else:
text = " ".join(text for text in text.split() if text not in vietnamese)
return text
time to solve a text in my data is 14s and if I have some trick like this time for will decrease to 3s:
temp.reverse()
def drop_stopwords(text):
for x in temp:
if len(x.split()) >2:
if x in text:
text = text.replace(x,'')
elif len(x.split()) > 1:
text_list = text.split()
for y in range(len(text_list)-len(x.split())):
if " ".join(text_list[y:y+len(x.split())]) == x:
del text_list[y:y+len(x.split())]
text = " ".join(text_list)
else:
text = " ".join(text for text in text.split() if text not in vietnamese)
return text
but I think it may get wrong some where in my language. How can I rewrite this function in Python to make it faster (in C and C++ I can solve it easily with the function above :(( )
Your function does a lot of the same thing over and over, particularly repeated split and join of the same text. Doing a single split, operating on the list, and then doing a single join at the end might be faster, and would definitely lead to simpler code. Unfortunately I don't have any of your sample data to test the performance with, but hopefully this gives you something to experiment with:
temp = ["foo", "baz ola"]
def drop_stopwords(text):
text_list = text.split()
text_len = len(text_list)
for word in temp:
word_list = word.split()
word_len = len(word_list)
for i in range(text_len + 1 - word_len):
if text_list[i:i+word_len] == word_list:
text_list[i:i+word_len] = [None] * word_len
return ' '.join(t for t in text_list if t)
print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog
You could also just try iteratively doing text.replace in all cases and seeing how that performs compared to your more complex split-based solution:
temp = ["foo", "baz ola"]
def drop_stopwords(text):
for word in temp:
text = text.replace(word, '')
return ' '.join(text.split())
print(drop_stopwords("the quick brown foo jumped over the baz ola dog"))
# the quick brown jumped over the dog
I use a list of synonym to replace word in my sentence by them. The function works but there is a slightly problem with the output
#Function
eda(t, alpha_sr=0.1, num_aug=3)
Original : "Un abricot est bon."
New sentence : 'Un aubercot est bon .'
As you can see the replacement was made but the punctuation is far the last word est the original. I would like to modify so that I will obtain this result for each pounction.
New sentence : 'Un aubercot est bon.'
augmented_sentences.append(' '.join(a_words)) # the problem arise here since, I joined the words after splitting them the punctuation is also join with space.
Sinc I am working with some quite long review, the punctuation is really important.
The code is below :
def cleaning(texte):
texte = re.sub(r"<iwer>.*?</iwer>", " ", str(texte)) # clean
return texte
def eda(sentence, alpha_sr=float, num_aug=int):
sentence = cleaning(sentence)
sent_doc = nlp(sentence)
words = [token.text for token in sent_doc if token.pos_ != "SPACE"]
num_words = len(words)
augmented_sentences = []
num_new_per_technique = int(num_aug/4)+1
if (alpha_sr > 0):
n_sr = max(1, int(alpha_sr*num_words))
for _ in range(num_new_per_technique):
a_words = synonym_replacement(words, n_sr)
print(a_words)
augmented_sentences.append(' '.join(a_words)) # the problem is here since, I joined the words adfter using
shuffle(augmented_sentences)
#trim so that we have the desired number of augmented sentences
if num_aug >= 1:
augmented_sentences = augmented_sentences[:num_aug]
else:
keep_prob = num_aug / len(augmented_sentences)
augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]
#append the original sentence
augmented_sentences.append(sentence)
#print(len(augmented_sentences))
return augmented_sentences
def synonym_replacement(words, n):
new_words = words.copy()
random_word_list = [word for word in words if word not in stop_words]
random_word_list = ' '.join(new_words)
#print("random list :", random_word_list)
sent_doc = nlp(random_word_list)
random_word_list = [token.lemma_ for token in sent_doc if token.pos_ == "NOUN" or token.pos_ == "ADJ" or token.pos_ == "VERB" or token.pos_ == "ADV"]
random.shuffle(random_word_list)
num_replaced = 0
for random_word in random_word_list:
synonyms = get_synonyms(random_word)
if len(synonyms) >= 1:
synonym = random.choice(list(synonyms))
new_words = [synonym if word == random_word else word for word in new_words]
#print("replaced", random_word, "with", synonym)
num_replaced += 1
if num_replaced >= n: #only replace up to n words
break
#this is stupid but we need it, trust me
sentence = ' '.join(new_words)
new_words = sentence.split(' ')
return new_words
def get_synonyms(word):
synonyms = []
for k_syn, v_syn in word_syn_map.items():
if k_syn == word:
print(v_syn)
synonyms.extend(v_syn)
synonyms = set(synonyms)
if word in synonyms:
synonyms.remove(word)
return list(synonyms)
the dictionnary of synonym look like this :
#word_syn_map
defaultdict(<class 'list'>,
{'ADN': ['acide désoxyribonucléique', 'acide désoxyribonucléique'],
'abdomen': ['bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre',
'bas-ventre',
'bide',
'opisthosome',
'panse',
'ventre'],
'abricot': ['aubercot', 'michemis', 'aubercot', 'michemis']})
tokenization
import stanza
import spacy_stanza
stanza.download('fr')
nlp = spacy_stanza.load_pipeline('fr', processors='tokenize,mwt,pos,lemma')
Two answers to this:
I can't see your nlp function, so I don't know how you're tokenising the string, but it looks like you're doing it by treating punctuation as a separate token. That's why it's picking up a space, because it's being treated like any other word. You either need to adjust your tokenisation algorithm so that it includes the punctuation in the word or if you can't do that then you need to do an extra pass through the words list at the start to stick punctuation back onto the token it belongs to (ie. if a given token is punctuation, and you'll need a list of punctuation tokens, then glue it together with the token before). Either way then you need to adjust your matching algorithm so it ignores punctuation and matches the rest of the word.
This feels like you're overcomplicating the problem. I'd be inclined to do something like this:
import re
def get_synonym(wordmatch):
synonym = # pick one synonym for word at random
return wordmatch.group(1) + synonym + matchobj.group(2)
new_sentence = sentence.copy # Or however you want to take a copy of sentence, eg. copy.copy()
for original_word in defaultdict:
wordexp = re.compile((^|" ") + word + ([ .!?-,])) # Add more punctuation to this list
new_sentence = re.sub(wordexp, get_synonym, new_sentence, flags=re.IGNORECASE)
Not guaranteed to work, I haven't tested it (and you'll certainly need to do something to maintain capitalisation or it'll lowercase everything), but I'd do something with regexes, myself.
take a sample of sentences from each of the corpus1, corpus2 and corpus3 corpora and displays the average length (as measured in terms of the number of characters in the sentence).
so I've 3 corpus and sample_raw_sents is a defined function to return random sentences:
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
print(len(sentence))
so using this code all lengths are printed, though how do I sum() these lengths?
Use zip, it will allow you to draw a sentence from each corpus all at once.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
zipped = zip(tcr.sample_raw_sents(sample_size),
rcr.sample_raw_sents(sample_size),
mcr.sample_raw_sents(sample_size))
for s1, s2, s3 in zipped:
summed = len(s1) + len(s2) + len(s3)
average = summed/3
print(summed, average)
You could store all lengths of sentences in list and then sum them up.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
print(sum(lengths) / len(lengths))
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
s = s + len(sentence)
average = s/150
print('average: {}'.format(average))
Given the following basis:
basis = "Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay."
and the following words:
words = "word, text, bank, tree"
How can I calculate the PMI-values of each word in "words" compared to each word in "basis", where I can use a context window size 5 (that is two positions before and two after the target word)?
I know how to calculate the PMI, but I don't know how to handle the fact of the context window.
I calculate the 'normal' PMI-values as follows:
def PMI(ContingencyTable):
(a,b,c,d,N) = ContingencyTable
# avoid log(0)
a += 1
b += 1
c += 1
d += 1
N += 4
R_1 = a + b
C_1 = a + c
return log(float(a)/(float(R_1)*float(C_1))*float(N),2)
I did a little searching on PMI, looks like heavy duty packages are out there, "windowing" included
In PMI the "mutual" seems to refer to the joint probability of two different words so you need to firm up that idea with respect to the problem statement
I took on the smaller problem of just generating the short windowed lists in your problem statement mostly for my own exercise
def wndw(wrd_l, m_l, pre, post):
"""
returns a list of all lists of sequential words in input wrd_l
that are within range -pre and +post of any word in wrd_l that matches
a word in m_l
wrd_l = list of words
m_l = list of words to match on
pre, post = ints giving range of indices to include in window size
"""
wndw_l = list()
for i, w in enumerate(wrd_l):
if w in m_l:
wndw_l.append([wrd_l[i + k] for k in range(-pre, post + 1)
if 0 <= (i + k ) < len(wrd_l)])
return wndw_l
basis = """Each word of the text is converted as follows: move any
consonant (or consonant cluster) that appears at the start
of the word to the end, then append ay."""
words = "word, text, bank, tree"
print(*wndw(basis.split(), [x.strip() for x in words.split(',')], 2, 2),
sep="\n")
['Each', 'word', 'of', 'the']
['of', 'the', 'text', 'is', 'converted']
['of', 'the', 'word', 'to', 'the']