I have a text string and I want to replace two words with a single word. E.g. if the word is artificial intelligence, I want to replace it with artificial_intelligence. This needs to be done for a list of 200 words and on a text file of size 5 mb.
I tried string.replace but it can work only for one element, not for the list.
Example
Text='Artificial intelligence is useful for us in every situation of deep learning.'
List a : list b
Artificial intelligence: artificial_intelligence
Deep learning: deep_ learning
...
Text.replace('Artificial intelligence','Artificial_intelligence') is working.
But
For I in range(len(Lista)):
Text=Text.replace(Lista[I],List b[I])
doesn't work.
I would suggest using a dict for your replacements:
text = "Artificial intelligence is useful for us in every situation of deep learning."
replacements = {"Artificial intelligence" : "Artificial_intelligence",
"deep learning" : "deep_learning"}
Then your approach works (although it is case-sensitive):
>>> for rep in replacements:
text = text.replace(rep, replacements[rep])
>>> print(text)
Artificial_intelligence is useful for us in every situation of deep_learning.
For other approaches (like the suggested regex-approach), have a look at SO: Python replace multiple strings.
Since you have a case problem between your list entries and your string, you could use the re.sub() function with IGNORECASE flag to obtain what you want:
import re
list_a = ['Artificial intelligence', 'Deep learning']
list_b = ['artificial_intelligence', 'deep_learning']
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in zip(list_a, list_b):
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)
# artificial_intelligence is useful for us in every situation of deep_learning.
Note the use of the zip() function wich allows to iterate over the two lists in the same time.
Also note that Christian is right, a dict would be more suitable for your substitution data. The previous code would then be the following for the exact same result:
import re
subs = {'Artificial intelligence': 'artificial_intelligence',
'Deep learning': 'deep_learning'}
text = 'Artificial intelligence is useful for us in every situation of deep learning.'
for from_, to in subs.items():
text = re.sub(from_, to, text, flags=re.IGNORECASE)
print(text)
Related
I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?
Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:
or_satisfied = False
for w in doc:
if word in wAND: wAND.remove(word)
if not or_satisfied and word in wOR: or_satisfied = True
if word in wNOT: return False
return or_satisfied and not wAND
You can build regexps for the word bags you have, and use them:
def make_re(word_set):
return re.compile(
r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
flags=re.I,
)
wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)
def re_match(doc):
if not wOR_re.search(doc):
return False
if wNOT_re.search(doc):
return False
found = set()
expected = len(wAND)
for word in re.finditer(r'\w+', doc):
found.add(word)
if len(found) == expected:
break
return len(found) == expected
A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that
documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
regexps are run in native code, which tends to always be faster than Python
name='original' iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match' iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match' iters=10000 time=0.203 iters_per_sec=49363.58
where bag_match is my original comment suggestion of using set intersections:
def bag_match(doc):
bag = set(clean_doc(doc))
return (
(bag.intersection(wOR) or not wOR) and
(bag.issuperset(wAND) or not wAND) and
(not bag.intersection(wNOT) or not wNOT)
)
If you already have cleaned the documents to an iterable of words (here I just slapped #lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:
name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60
I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.
This is what I have so far:
import re
for sentence in sentences:
for phrase in phrases:
if phrase in sentence.lower():
iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
newsentence = iphrase.sub("**"+phrase+"**", sentence)
newlist.append(newsentence)
So far this approach takes about 60 seconds to complete.
I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.
I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.
So essentially, what would be the fastest possible way to perform this check?
Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:
phrases = [
"phrase1",
"phrase2",
"phrase with spaces",
'can be really really',
'characters',
'some sentences'
# ...
]
sentences = [
"sentence",
"some sentences are longer",
"some sentences can be really really ... really long, about 1000 characters.",
# ...
]
# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]
Gives you a results of:
['**some sentences** are longer',
'**some sentences** **can be really really** ... really long, about 1000 **characters**.']
What about set comprehension?
found = {'**' + p + '**' for s in sentences for p in phrases if p in s}
You could try update (by reduction) the phrases list if you don't mind altering it:
found = []
p = phrases[:] # shallow copy for modification
for s in sentences:
for i in range(len(phrases)):
phrase = phrases[i]
if phrase in s:
p.remove(phrase)
found.append('**'+ phrase + '**')
phrases = p[:]
Basically each iteration reduces the phrases container. We iterate through the latest container until we find a phrase that is in at least one sentence.
We remove it from the copied list then once we checked the latest phrases, we update the container with the reduced subset of phrases (those that haven't been seen yet). We do this since we only need to see a phrase at least once, so checking again (although it may exist in another sentence) is unnecessary.
I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes
You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)
I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.
I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work