I'm trying to print a text while highlighting certain words and word bigrams. This would be fairly straight forward if I didn't have to print the other tokens like punctuation and such as well.
I have a list of words to highlight and another list of word bigrams to highlight.
Highlighting individual words is fairly easy, like for example:
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def highlighter(content, terms_to_hightlight):
tokens = regex_pattern.split(content)
for token in tokens:
if token.lower() in terms_to_hightlight:
print('\x1b[6;30;42m' + token + '\x1b[0m', end="")
else:
print(token, end="")
Only highlighting words that appear in sequence is more complex. I have been playing around with iterators but haven't been able to come up with anything that isn't overtly complicated.
If I understand the question correctly, one solution is to look ahead to the next word token and check if the bigram is in the list.
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def find_next_word(tokens, idx):
nonword = string.punctuation + " \n"
for i in range(idx+1, len(tokens)):
if tokens[i] not in nonword:
return (tokens[i], i)
return (None, -1)
def highlighter(content, terms, bigrams):
tokens = regex_pattern.split(content)
idx = 0
while idx < len(tokens):
token = tokens[idx]
(next_word, nw_idx) = find_next_word(tokens, idx)
if token.lower() in terms:
print('*' + token + '*', end="")
idx += 1
elif next_word and (token.lower(), next_word.lower()) in bigrams:
concat = "".join(tokens[idx:nw_idx+1])
print('-' + concat + '-', end="")
idx = nw_idx + 1
else:
print(token, end="")
idx += 1
terms = ['man', 'the']
bigrams = [('once', 'upon'), ('i','was')]
text = 'Once upon a time, as I was walking to the city, I met a man. As I was tired, I did not look once... upon this man.'
highlighter(text, terms, bigrams)
When called, this gives :
-Once upon- a time, as -I was- walking to *the* city, I met a *man*. As -I was- tired, I did not look -once... upon- this *man*.
Please note that:
this is a greedy algorithm, it will match the first bigram it finds. So for instance you check for yellow banana and banana boat, yellow banana boat is always highlighted as -yellow banana- boat. If you want another behavior, you should update the test logic.
you probably also want to update the logic to manage the case where a word is both in terms and the first part of a bigram
I haven't tested all edge cases, some things may break / there may be fence-post errors
you can optimize performance if necessary by:
building a list of the first words of the bigram and checking if a word is in it before doing the look-ahead to the next word
and/or using the result of the look-ahead to treat in one step all the non-word tokens between two words (implementing this step should be enough to insure linear performance)
Hope this helps.
Related
I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.
i want to do data augmentation for sentiment analysis task by replacing words with it's synonyms from wordnet but replacing is random i want to loop over the synonyms and replace word with all synonyms one at the time to increase data-size
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])
Even I'm working with a similar kind of project, generating new sentences from a given input but without changing the context from the input text.
While coming across this, I found a data augmentation technique. Which seems to work well on the augmentation part. EDA(Easy Data Augmentation) is a paper[https://github.com/jasonwei20/eda_nlp].
Hope this helps you.
I have a corpus of sentences in a specific domain.
I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).
Is there such a code/project?
I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.
I recommend writing your own basic implementation. First, we need some sentences:
import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())
sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:
punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = [word for word in sentence if word not in punctuation]
sentences[i] = new_sentence
Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:
for i, sentence in enumerate(sentences.copy()):
new_sentence = list()
for j, word in enumerate(sentence.copy()):
new_word = word.lower() # Lower case all characters in word
new_sentence.append(new_word)
sentences[i] = new_sentence
Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.
start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
new_sentence = start + sentence + end
sentences[i] = new_sentence
Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:
new_words = list()
for sentence in sentences:
for word in sentence:
new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)
And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".
bigrams = list()
for sentence in sentences:
new_bigrams = nltk.bigrams(sentence)
bigrams += new_bigrams
Now we can create a frequency distribution:
bigram_fdist = nltk.ConditionalFreqDist(bigrams)
Finally, we want to know the probability of each word in the model:
def getUnigramProbability(word):
if word in unigram_fdist:
return unigram_fdist[word]/total_words
else:
return -1 # You should figure out how you want to handle out-of-vocabulary words
def getBigramProbability(word1, word2):
if word1 not in bigram_fdist:
return -1 # You should figure out how you want to handle out-of-vocabulary words
elif word2 not in bigram_fdist[word1]:
# i.e. "word1 word2" never occurs in the corpus
return getUnigramProbability(word2)
else:
bigram_frequency = bigram_fdist[word1][word2]
unigram_frequency = unigram_fdist[word1]
bigram_probability = bigram_frequency / unigram_frequency
return bigram_probability
While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.
You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.
I've built a web crawler which fetches me data. The data is typically structured. But then and there are a few anomalies. Now to do analysis on top of the data I am searching for few words i.e searched_words=['word1','word2','word3'......] I want the sentences in which these words are present. So I coded as below :
searched_words=['word1','word2','word3'......]
fsa = re.compile('|'.join(re.escape(w.lower()) for w in searched_words))
str_df['context'] = str_df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent) if w.lower() in words)])
It is working but the problem I am facing is if there is/are missing white-spaces after a fullstop in the text I am getting all such sentences as such.
Example :
searched_words = ['snakes','venomous']
text = "I am afraid of snakes.I hate them."
output : ['I am afraid of snakes.I hate them.']
Desired output : ['I am afraid of snakes.']
If all tokenizers (including nltk) fail you you can take matters into your own hands and try
import re
s='I am afraid of snakes.I hate venomous them. Theyre venomous.'
def findall(s,p):
return [m.start() for m in re.finditer(p, s)]
def find(sent, word):
res=[]
indexes = findall(sent,word)
for index in indexes:
i = index
while i>0:
if sent[i]!='.':
i-=1
else:
break
end = index+len(word)
nextFullStop = end + sent[end:].find('.')
res.append(sent[i:nextFullStop])
i=0
return res
Play with it here. There's some dots left in there as I do not know what you want to do exactly with them.
What it does is it finds all occurences of said word, and gets you the Sentence all they way back to the previous dot. This is for an edge case only but you can tune it easily, specific to your needs.
I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.