For example I have a string such as
text = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
How can I generate a set of all possible strings with variants of words in braces?
Hello. We have a good offer for you.
Good morning, we have a best offer for you.
etc...
You can use the re and random module, like this:
import random
import re
def randomize(match):
res = match.group(1).split('|')
random.shuffle(res)
return res[0]
def random_sentence(tpl):
return re.sub(r'{(.*?)}', randomize, tpl)
tpl = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
print(random_sentence(tpl))
I would use tree-traversal method to get all possible variants:
import re
text = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
variants = ['']
elements = re.split(r'([{\|}])',text)
inside = False
options = []
for elem in elements:
if elem=='{':
inside = True
continue
if not inside:
variants = [v+elem for v in variants]
if inside and elem not in '|}':
options.append(elem)
if inside and elem=='}':
variants = [v+opt for opt in options for v in variants]
options = []
inside = False
print(*variants,sep='\n')
Output:
Hello. We have a good offer for you.
Good morning. We have a good offer for you.
Hi. We have a good offer for you.
Hello, we have a good offer for you.
Good morning, we have a good offer for you.
Hi, we have a good offer for you.
Hello. We have a best offer for you.
Good morning. We have a best offer for you.
Hi. We have a best offer for you.
Hello, we have a best offer for you.
Good morning, we have a best offer for you.
Hi, we have a best offer for you.
Explanation: I use re.split to split str into elements:
['', '{', 'Hello', '|', 'Good morning', '|', 'Hi', '}', '', '{', '. We', '|', ', we', '}', ' have a ', '{', 'good ', '|', 'best ', '}', 'offer for you.']
Then I create flag inside which I will use to store if I am currently inside or outside { and } and act accordingly.
If I find { I set flag and go to next element (continue)
If I am not inside brackets I simply add given element to every
variant.
If I am inside and elements is not { and is not | I append
this element to options list.
If I am inside and find } then I made variants for every
possible part of (one of variants),(one of options) and
variants become effect of this operation.
Note that I assume that: always correct str will be given as text and { will be used solely as control character and } will be used solely as control character and | inside { } will be used solely as control character.
Related
I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.
I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science
My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!
Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."
Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.
I have a MongoDB query that searches for addresses. The problem is that if a user accidentally adds an extra whitespace, the query will not find the address. For example, if the user types 123 Fakeville St instead of 123 Fakeville St, the query will not return any results.
Is there a simple way to deal with this issue, perhaps using $regex? I guess the space would need to be ignore between the house number (123) and the street name (Fakeville). My query is set up like this:
#app.route('/getInfo', methods=['GET'])
def getInfo():
address = request.args.get("a")
addressCollection = myDB["addresses"]
addressJSON = []
regex = "^" + address
for address in addressCollection.find({'Address': {'$regex':regex,'$options':'i'} },{"Address":1,"_id":0}).limit(3):
addressJSON.append({"Address":address["Address"]})
return jsonify(addresses=addressJSON)
Clean up the query before sending it off:
>> import re
>>> re.sub(r'\s+', ' ', '123 abc')
'123 abc'
>>> re.sub(r'\s+', ' ', '123 abc def ghi')
'123 abc def ghi'
You'll probably want to make sure that the data in your database is similarly normalised. Also consider similar strategies for things like punctuation.
In fact, using a regex for this seems overly strict, as well as reinventing the wheel. Consider using a proper search engine such as Lucene or Elasticsearch.
An alternative approach without using regex you could try is to utilise MongoDB text indexes. By adding a text index on the field you can perform text searches using $text operator
For example:
db.coll.find(
{ $text:{$search:"123 Fakeville St"}},
{ score: { $meta: "textScore" } } )
.sort( { score: { $meta: "textScore" } } ).limit(1)
This should work for entries such as: "123 Fakeville St.", "123 fakeville street", etc. As long as the important parts of the address makes it in.
See more info on $text behaviour
I am in need of a little help here, I need to identify the negative words like "not good","not bad" and then identify the polarity (negative or positive) of the sentiment. I did everything except handling the negations. I just want to know how I can include negations into it. How do I go about it?
Negation handling is quite a broad field, with numerous different potential implementations. Here I can provide sample code that negates a sequence of text and stores negated uni/bi/trigrams in not_ form. Note that nltk isn't used here in favor of simple text processing.
# negate_sequence(text)
# text: sentence to process (creation of uni/bi/trigrams
# is handled here)
#
# Detects negations and transforms negated words into 'not_' form
#
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not_" + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
If we run this program on a sample input text = "I am not happy today, and I am not feeling well", we obtain the following sequences of unigrams, bigrams, and trigrams:
[ 'i',
'am',
'i am',
'not',
'am not',
'i am not',
'not_happy',
'not not_happy',
'am not not_happy',
'not_today',
'not_happy not_today',
'not not_happy not_today',
'and',
'not_today and',
'not_happy not_today and',
'i',
'and i',
'not_today and i',
'am',
'i am',
'and i am',
'not',
'am not',
'i am not',
'not_feeling',
'not not_feeling',
'am not not_feeling',
'not_well',
'not_feeling not_well',
'not not_feeling not_well']
We may subsequently store these trigrams in an array for future retreival and analysis. Process the not_ words as negative of the [sentiment, polarity] that you have defined for their counterparts.
this seems to be working decently well as a poor man's word negation in python. it's definitely not perfect, but may be useful for some cases. it takes a spacy sentence object.
def word_is_negated(word):
""" """
for child in word.children:
if child.dep_ == 'neg':
return True
if word.pos_ in {'VERB'}:
for ancestor in word.ancestors:
if ancestor.pos_ in {'VERB'}:
for child2 in ancestor.children:
if child2.dep_ == 'neg':
return True
return False
def find_negated_wordSentIdxs_in_sent(sent, idxs_of_interest=None):
""" """
negated_word_idxs = set()
for word_sent_idx, word in enumerate(sent):
if idxs_of_interest:
if word_sent_idx not in idxs_of_interest:
continue
if word_is_negated(word):
negated_word_idxs.add(word_sent_idx)
return negated_word_idxs
call it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
find_negated_wordSentIdxs_in_sent(nlp("I have hope, but I do not like summer"))
EDIT:
As #Amandeep pointed out, depending on your use case, you may also want to include NOUNS, ADJECTIVES, ADVERBS in the line: if word.pos_ in {'VERB'}:.
It's been a while since I've worked on sentiment analysis, so not sure what the status of this area is now, and in any case I have never used nltk for this. So I wouldn't be able to point you to anything there. But in general, I think it's safe to say that this is an active area of research and an essential part of NLP. And that surely it isn't a problem that has been 'solved' yet. It's one of the finer, more interesting fields of NLP, involving irony, sarcams, scope (of negations). Often, coming up with a correct analysis means interpreting a lot of context/domain/discourse information. Which isn't straightforward at all.
You may want to look at this topic: Can an algorithm detect sarcasm. And some googling will probably give you a lot more information.
In short; your question is way too broad to come up with a specific answer.
Also, I wonder what you mean with "I did everything except handling the negations". You mean you identified 'negative' words? Have you considered that this information can be conveyed in a lot more than the words not, no, etc? Consider for example "Your solution was not good" vs. "Your solution was suboptimal".
What exactly you are looking for, and what will suffice in your situation, obivously depends on context and domain of application.
This probably wasn't the answer you were hoping for, but I'd suggest you do a bit more research (as a lot of smart things have been done by smart people in this field).
I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.
Here is what I have so far
text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']
The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']
Can this be done with a simple regex pattern?
import itertools as it
import re
three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
if key:
group=list(group)
for i in range(0,len(group)-1):
print(' '.join(group[i:i+2]))
# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think
It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?
map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
May be you can rewrite the lambda for shorter (like just '+')
And BTW ' is not part of \w or \s
Something like this with additional checks for list boundaries should do:
>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>
There are two problems with your approach:
Neither \w nor \s matches punctuation.
When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.
To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.
But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.
re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
^^^ ^
lookahead assertion