Find lots of string in text - Python - python

I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).
Here is what I have tried out (Python):
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
.....
]
text = 'i love android and i think i will have a tea with john'
def find_sentence(to_find_sentences, text):
text = text.split()
res = []
w = len(text)
for i in range(w):
for j in range(i+1,w+1):
tmp = ' '.join(descr[i:j])
if tmp in to_find_sentences:
res.add(tmp)
return res
print find_sentence(to_find_sentence, text)
Out:
['i love android', 'have a tea']
In my case I've used a set to speed up the in operation

A fast solution would be to build a Trie out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:
(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
Here's an example on debuggex:
It might be a good idea to add '\b' as word boundaries, to avoid matching "have a team".
You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py in your current directory.
You can then use this code to generate the trie/regex:
import re
from trie import Trie
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
]
trie = Trie()
for sentence in to_find_sentences:
trie.add(sentence)
print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'
print(re.findall(pattern, text))
# ['i love android', 'have a tea']
You invest some time to create the Trie and the regex, but the processing should be extremely fast.
Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.
Note that it wouldn't find overlapping sentences:
to_find_sentences = [
'i love android',
'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']
You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.

Related

How to replace multiple substrings in a list of sentences using regex in python?

I have a list of sentences as below :
sentences = ["I am learning to code", "coding seems to be intresting in python", "how to code in python", "practicing how to code is the key"]
Now I wish to replace few substrings in this list of sentences using dictionary of words and its replacements.
word_list = {'intresting': 'interesting', 'how to code': 'learning how to code', 'am learning':'love learning', 'in python': 'using python'}
I tried the following code:
replaced_sentences = [' '.join([word_list.get(w, w) for w in sentence.split()])
for sentence in sentences]
But only the one word string is getting replaced and not the keys with more than one word. It is because am using sentence.split() which tokenizes sentences word by word and misses out replacing substrings greater than one word.
How do I get to replace the substring with exact match using regex or any other suggestions?
expected output:
sentences = ["I love learning to code", "coding seems to be interesting using python", "learning how to code using python", "practicing learning how to code is the key"]
Thanks in advance.
It's probably easiest to read if you break this into a function that replaces all the words for a single sentence. Then you can apply it to all the sentences in the list. Here we make a single regex by concaving all the keys of the dict with '|'. Then use re.sub grab the found value associated with the key, and return it as the replacement.
import re
def replace_words(s, word_lookup):
rx = '|'.join(word_lookup.keys())
return re.sub(rx, lambda match: word_lookup[match.group(0)], s)
[replace_words(s, word_list) for s in sentences]
This will result in:
['I love learning to code',
'coding seems to be interesting using python',
'learning how to code using python',
'practicing learning how to code is the key']
You could optimize a bit by making the regex once instead of each time in the function. This would allow you to do something like:
import re
rx = re.compile('|'.join(word_list.keys()))
[rx.sub(lambda match: word_list[match.group(0)], s) for s in sentences]

How to replace multiple words with single word using dictionary in python?

I have a dictionary of keys and multiple values as below:
word_list = {'cool':['better','good','best','great'], 'car':['vehicle','moving','automobile','four-wheeler'], 'sound':['noise', 'disturbance', 'rattle']}
sentences = ['that day I heard a vehicle noise not a great four-wheeler', 'the automobile industry is doing good these days', 'that moving noise is better now']
As I have multiple values for a given key, if any of these values appear in the sentences, I want to replace them with its associated key.
I tried the following, but did not get the desired output.
results= [' '.join(word_list.get(y, y) for y in w.split()) for w in sentences]
Desired output:
['that day I heard a car sound not a cool car', 'the car industry is doing cool these days', 'that car sound is better now']
Not sure how to achieve this.
The trick is actually to create an inverted mapping where you set as key, each value of the replacement key, and as value the key.
Then after it's easy, as you just have to iterate on each word in each sentence and replace it with the value of that inverted mapping, if the word is one of the key of this mapping.
word_list = {
'cool': ['better','good','best','great'],
'car': ['vehicle','moving','automobile','four-wheeler'],
'sound': ['noise', 'disturbance', 'rattle']
}
sentences = [
'that day I heard a vehicle noise not a great four-wheeler',
'the automobile industry is doing good these days',
'that moving noise is better now'
]
swapped_word_list = {
word: replacement
for replacement, words in word_list.items()
for word in words
}
new_sentences = [
' '.join([
swapped_word_list.get(word, word)
for word in sentence.split()
])
for sentence in sentences
]
a solution using regex & reduce, because why not:
create a list of mapping for a regex pattern matching all the words to the replacement word
apply all the mappings to each string recursively using reduce
note the prefix rf before the string specifies that it is a raw f-string
from functools import reduce
import re
mappings = [
{'pat': rf'\b({"|".join(words)})\b', 'rep': rep}
for rep, words in word_list.items()
]
cleaned_sentences = [
reduce(lambda s, m: re.sub(m['pat'], m['rep'], s), mappings, sentence)
for sentence in sentences
]
for s in cleaned_sentences:
print(s)
# outputs:
that day I heard a car sound not a cool car
the car industry is doing cool these days
that car sound is cool now

replace any words in string that match an entry in list with a single tag (python)

I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.

filtering stopwords near punctuation

I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.

String splitting issue problem with multiword expressions

I have a series of strings like:
'i would like a blood orange'
I also have a list of strings like:
["blood orange", "loan shark"]
Operating on the string, I want the following list:
["i", "would", "like", "a", "blood orange"]
What is the best way to get the above list? I've been using re throughout my code, but I'm stumped with this issue.
This is a fairly straightforward generator implementation: split the string into words, group together words which form phrases, and yield the results.
(There may be a cleaner way to handle skip, but for some reason I'm drawing a blank.)
def split_with_phrases(sentence, phrase_list):
words = sentence.split(" ")
phrases = set(tuple(s.split(" ")) for s in phrase_list)
print phrases
max_phrase_length = max(len(p) for p in phrases)
# Find a phrase within words starting at the specified index. Return the
# phrase as a tuple, or None if no phrase starts at that index.
def find_phrase(start_idx):
# Iterate backwards, so we'll always find longer phrases before shorter ones.
# Otherwise, if we have a phrase set like "hello world" and "hello world two",
# we'll never match the longer phrase because we'll always match the shorter
# one first.
for phrase_length in xrange(max_phrase_length, 0, -1):
test_word = tuple(words[idx:idx+phrase_length])
if test_word in phrases:
return test_word
return None
skip = 0
for idx in xrange(len(words)):
if skip:
# This word was returned as part of a previous phrase; skip it.
skip -= 1
continue
phrase = find_phrase(idx)
if phrase is not None:
skip = len(phrase)
yield " ".join(phrase)
continue
yield words[idx]
print [s for s in split_with_phrases('i would like a blood orange',
["blood orange", "loan shark"])]
Ah, this is crazy, crude and ugly. But looks like it works. You may wanna clean and optimize it but certain ideas here might work.
list_to_split = ['i would like a blood orange', 'i would like a blood orange ttt blood orange']
input_list = ["blood orange", "loan shark"]
for item in input_list:
for str_lst in list_to_split:
if item in str_lst:
tmp = str_lst.split(item)
lst = []
for itm in tmp:
if itm!= '':
lst.append(itm)
lst.append(item)
print lst
output:
['i would like a ', 'blood orange']
['i would like a ', 'blood orange', ' ttt ', 'blood orange']
One quick and dirty, completely un-optimized approach might be to just replace the compounds in the string with a version including a different separator (preferably one that does not occur anywhere else in your target string or compound words). Then split and replace. A more efficient approach would be to iterate only once through the string, matching the compound words where appropriate - but you may have to watch out for instances where there are nested compounds, etc., depending on your array.
#!/usr/bin/python
import re
my_string = "i would like a blood orange"
compounds = ["blood orange", "loan shark"]
for i in range(0,len(compounds)):
my_string = my_string.replace(compounds[i],compounds[i].replace(" ","&"))
my_segs = re.split(r"\s+",my_string)
for i in range(0,len(my_segs)):
my_segs[i] = my_segs[i].replace("&"," ")
print my_segs
Edit: Glenn Maynard's solution is better.

Categories