Ignoring Multiple Whitespace Characters in a MongoDB Query - python

I have a MongoDB query that searches for addresses. The problem is that if a user accidentally adds an extra whitespace, the query will not find the address. For example, if the user types 123 Fakeville St instead of 123 Fakeville St, the query will not return any results.
Is there a simple way to deal with this issue, perhaps using $regex? I guess the space would need to be ignore between the house number (123) and the street name (Fakeville). My query is set up like this:
#app.route('/getInfo', methods=['GET'])
def getInfo():
address = request.args.get("a")
addressCollection = myDB["addresses"]
addressJSON = []
regex = "^" + address
for address in addressCollection.find({'Address': {'$regex':regex,'$options':'i'} },{"Address":1,"_id":0}).limit(3):
addressJSON.append({"Address":address["Address"]})
return jsonify(addresses=addressJSON)

Clean up the query before sending it off:
>> import re
>>> re.sub(r'\s+', ' ', '123 abc')
'123 abc'
>>> re.sub(r'\s+', ' ', '123 abc def ghi')
'123 abc def ghi'
You'll probably want to make sure that the data in your database is similarly normalised. Also consider similar strategies for things like punctuation.
In fact, using a regex for this seems overly strict, as well as reinventing the wheel. Consider using a proper search engine such as Lucene or Elasticsearch.

An alternative approach without using regex you could try is to utilise MongoDB text indexes. By adding a text index on the field you can perform text searches using $text operator
For example:
db.coll.find(
{ $text:{$search:"123 Fakeville St"}},
{ score: { $meta: "textScore" } } )
.sort( { score: { $meta: "textScore" } } ).limit(1)
This should work for entries such as: "123 Fakeville St.", "123 fakeville street", etc. As long as the important parts of the address makes it in.
See more info on $text behaviour

Related

How do I match a string in a pandas column then return what follows it?

I have a pandas dataframe which contains a column containing twitter profile descriptions. In some of these description, there are strings like 'insta: profile_name'.
How can I create a line of code which would search for a string (eg, 'insta:' or 'instagram:') and then return the rest of the string of whatever is next to it?
1252: 'lad who loves to cook 🥘 • insta: xxx',
1254: 'founder and head chef | insta: xxx |',
1992: '🇬🇧 |bakery instagram - xxx',
2291: 'insta: #xxx for enquiries'
2336: 'self taught baker. ig:// xxxx 🍥🧆',
You can use Regex to match each of the keywords such as: Insta
The code should be something like this:
import re
container = list()
for word in [list of keywords, ex: "insta","face"]:
_tag = re.findall( word + 'Regex Syntax', the_string_to_find_from)
container.append([word,_tag])
then you can unpack the resulted Container variable when you want to get the result. I can help you write the Regex syntax but I need more information on the way your required information is wrapped in the text.
Answer provided by Nk03 in the comments:
df['name'].str.extract(pat = r'(insta:|ig:)(.*)')[1].str.strip('\',')

Generate text from a given template

For example I have a string such as
text = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
How can I generate a set of all possible strings with variants of words in braces?
Hello. We have a good offer for you.
Good morning, we have a best offer for you.
etc...
You can use the re and random module, like this:
import random
import re
def randomize(match):
res = match.group(1).split('|')
random.shuffle(res)
return res[0]
def random_sentence(tpl):
return re.sub(r'{(.*?)}', randomize, tpl)
tpl = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
print(random_sentence(tpl))
I would use tree-traversal method to get all possible variants:
import re
text = '{Hello|Good morning|Hi}{. We|, we} have a {good |best }offer for you.'
variants = ['']
elements = re.split(r'([{\|}])',text)
inside = False
options = []
for elem in elements:
if elem=='{':
inside = True
continue
if not inside:
variants = [v+elem for v in variants]
if inside and elem not in '|}':
options.append(elem)
if inside and elem=='}':
variants = [v+opt for opt in options for v in variants]
options = []
inside = False
print(*variants,sep='\n')
Output:
Hello. We have a good offer for you.
Good morning. We have a good offer for you.
Hi. We have a good offer for you.
Hello, we have a good offer for you.
Good morning, we have a good offer for you.
Hi, we have a good offer for you.
Hello. We have a best offer for you.
Good morning. We have a best offer for you.
Hi. We have a best offer for you.
Hello, we have a best offer for you.
Good morning, we have a best offer for you.
Hi, we have a best offer for you.
Explanation: I use re.split to split str into elements:
['', '{', 'Hello', '|', 'Good morning', '|', 'Hi', '}', '', '{', '. We', '|', ', we', '}', ' have a ', '{', 'good ', '|', 'best ', '}', 'offer for you.']
Then I create flag inside which I will use to store if I am currently inside or outside { and } and act accordingly.
If I find { I set flag and go to next element (continue)
If I am not inside brackets I simply add given element to every
variant.
If I am inside and elements is not { and is not | I append
this element to options list.
If I am inside and find } then I made variants for every
possible part of (one of variants),(one of options) and
variants become effect of this operation.
Note that I assume that: always correct str will be given as text and { will be used solely as control character and } will be used solely as control character and | inside { } will be used solely as control character.

Templates with argument in string formatting

I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.
I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science
My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!
Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."
Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.

Identify symbols in string

I am implementing a simple DSL. I have the following input string:
txt = 'Hi, my name is <<name>>. I was born in <<city>>.'
And I have the following data:
{
'name': 'John',
'city': 'Paris',
'more': 'xxx',
'data': 'yyy',
...
}
I need to implement the following function:
def tokenize(txt):
...
return fmt, vars
Where I get:
fmt = 'Hi, my name is {name}. I was born in {city}.'
vars = ['name', 'city']
That is, fmt can be passed to the str.format() function, and vars is a list of the detected tokens (so that I can perform lookup in the data, which can be more complex than what I described, since it can be split in several namespaces)
After this, processing the format would be simple:
def expand(fmt, vars, data):
params = get_params(vars, data)
return fmt.format(params)
Where get_params is performing simple lookup of the data, and returning something like:
params = {
'name': 'John',
'city': 'Paris',
}
My question is:
How can I implement tokenize? How can I detect the tokens, knowing that the delitimers are << and >>? Should I go for regexes, or is there an easier path?
This is something similar to what pystache, or even .format itself, are doing, but I would like a light-weight implementation. Robustness is not very critical at this stage.
Yes, this is a perfect target for regexp. Find the begin/end quotation marks, replace them with braces, and extract the symbol names into a list. Do you have a solid description of legal symbols? You'll want a search such as
/\<\<([a-zA-Z]+[a-zA-Z0-9_]*)\>\>/
For classical variable names (note that this excludes leading underscores). Are you familiar enough with regexps to take it from here?
import re
def tokenize(text):
found_variables = []
def replace_and_capture(match):
found_variables.append(match.group(1))
return "{{{}}}".format(match.group(1))
return re.sub(r'<<([^>]+)>>', replace_and_capture, text), found_variables
fmt, vars = tokenize('Hi, my name is <<name>>. I was born in <<city>>.')
print(fmt)
print(vars)
# Output:
# Hi, my name is {name}. I was born in {city}.
# ['name', 'city']

Address formatting using regex - add state before zip code

I have an address formatted like this:
street address, town zip
I need to add the state abbreviation before the zip, which is always 5 digits.
I think I should use regex to do something like below, but I don't know how to finish it:
instr = "123 street st, anytown 12345"
state = 'CA'
outstr = re.sub(r'(???)(/\b\d{5}\b/g)', r'\1state\2', instr)
My question is what to put in the ??? and whether I used the state variable correctly in outstr. Also, did I get the zip regex correct?
You can also use rsplit to do that:
instr = "123 street st, anytown 12345"
state = 'CA'
address, zip_code = instr.rsplit(' ', 1) # ['123 street st, anytown', '12345']
print '%s %s %s' % (address, state, zip_code)
>> "123 street st, anytown CA 12345"
From the str.rsplit documentation:
str.rsplit([sep[, maxsplit]])
Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done, the rightmost ones.
You can't put the variable "state" straight into the replacement string. You should use python string formatting to make reference to the variable.
Keep regex simple, assume the data are simple. If ZIP is always appear the the end of the string, then just match from the end, use $.
Let me try :
instr = "123 street st, anytown 12345"
# Always strip the trailing spaces to avoid surprises
instr = instr.rstrip()
state = 'CA'
# Assume The ZIP has no trailing space and in last position.
search_pattern = r"(\d{5})$"
#
# Format the replacement, since I search from the end, so group 1 should be fined
replace_str = r"{mystate} \g<1>'.format(mystate = state)
outstr = re.sub(search_pattern, replace_str, instr)
#Forge example is lean and clean. However, you need to be careful about the data quality when using str.rsplit(). For example
# If town and zip code stick together
instr = "123 street st, anytown12345"
# or trailing spaces
instr = "123 street st, anytown 12345 "
The universal fix is use a strip and regex as shown in my code. Always think ahead of input data quality, some code will failed after going through unit test.

Categories