Identify symbols in string - python

I am implementing a simple DSL. I have the following input string:
txt = 'Hi, my name is <<name>>. I was born in <<city>>.'
And I have the following data:
{
'name': 'John',
'city': 'Paris',
'more': 'xxx',
'data': 'yyy',
...
}
I need to implement the following function:
def tokenize(txt):
...
return fmt, vars
Where I get:
fmt = 'Hi, my name is {name}. I was born in {city}.'
vars = ['name', 'city']
That is, fmt can be passed to the str.format() function, and vars is a list of the detected tokens (so that I can perform lookup in the data, which can be more complex than what I described, since it can be split in several namespaces)
After this, processing the format would be simple:
def expand(fmt, vars, data):
params = get_params(vars, data)
return fmt.format(params)
Where get_params is performing simple lookup of the data, and returning something like:
params = {
'name': 'John',
'city': 'Paris',
}
My question is:
How can I implement tokenize? How can I detect the tokens, knowing that the delitimers are << and >>? Should I go for regexes, or is there an easier path?
This is something similar to what pystache, or even .format itself, are doing, but I would like a light-weight implementation. Robustness is not very critical at this stage.

Yes, this is a perfect target for regexp. Find the begin/end quotation marks, replace them with braces, and extract the symbol names into a list. Do you have a solid description of legal symbols? You'll want a search such as
/\<\<([a-zA-Z]+[a-zA-Z0-9_]*)\>\>/
For classical variable names (note that this excludes leading underscores). Are you familiar enough with regexps to take it from here?

import re
def tokenize(text):
found_variables = []
def replace_and_capture(match):
found_variables.append(match.group(1))
return "{{{}}}".format(match.group(1))
return re.sub(r'<<([^>]+)>>', replace_and_capture, text), found_variables
fmt, vars = tokenize('Hi, my name is <<name>>. I was born in <<city>>.')
print(fmt)
print(vars)
# Output:
# Hi, my name is {name}. I was born in {city}.
# ['name', 'city']

Related

How can I use variables on a variables loaded from text file? [duplicate]

I am looking for either technique or templating system for Python for formatting output to simple text. What I require is that it will be able to iterate through multiple lists or dicts. It would be nice if I would be able to define template into separate file (like output.templ) instead of hardcoding it into source code.
As simple example what I want to achieve, we have variables title, subtitle and list
title = 'foo'
subtitle = 'bar'
list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
And running throught a template, output would look like this:
Foo
Bar
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
How to do this? Thank you.
You can use the standard library string an its Template class.
Having a file foo.txt:
$title
$subtitle
$list
And the processing of the file (example.py):
from string import Template
d = {
'title': 'This is the title',
'subtitle': 'And this is the subtitle',
'list': '\n'.join(['first', 'second', 'third'])
}
with open('foo.txt', 'r') as f:
src = Template(f.read())
result = src.substitute(d)
print(result)
Then run it:
$ python example.py
This is the title
And this is the subtitle
first
second
third
There are quite a number of template engines for python: Jinja, Cheetah, Genshi etc. You won't make a mistake with any of them.
If your prefer to use something shipped with the standard library, take a look at the format string syntax. By default it is not able to format lists like in your output example, but you can handle this with a custom Formatter which overrides the convert_field method.
Supposed your custom formatter cf uses the conversion code l to format lists, this should produce your given example output:
cf.format("{title}\n{subtitle}\n\n{list!l}", title=title, subtitle=sibtitle, list=list)
Alternatively you could preformat your list using "\n".join(list) and then pass this to your normal template string.
if you want arbitrary prefixes/suffixes to identify your variables, you can simply use re.sub with a lambda expression:
from pathlib import Path
import re
def tpl(fn:Path, v:dict[str,str]) -> str:
text = fn.with_suffix('.html').read_text()
return re.sub("(<!-- (.+?) -->)", lambda m: v[m[2].lower()], text)
html = tpl(Path(__file__), {
'title' : 't',
'body' : 'b'
})

Python - Remove tab and new line in Object

Just a new user of scrapy.org and a newbie to Python. I have this values at brand and title properties (JAVA OOP Term) that contains tab spaces and new line. How can we trim it to make this 2 following object properties to have this plain string value
item['brand'] = "KORAL ACTIVEWEAR"
item['title'] = "Boom Leggings"
Below is the data structure
{'store_id': 870, 'sale_price_low': [], 'brand': [u'\n KORAL ACTIVEWEAR\n '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n Boom Leggings\n '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u' https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}
I was able to trim the prices to only get the number and decimal by using regex search method, which I think might be wrong when there is a price comma separator.
price = re.compile('[0-9\.]+')
item['retail_price'] = filter(price.search, item['retail_price'])
It looks like all you need to do, at least for this example, is strip all whitespace off the edges of the brand and title values. You don't need a regex for that, just call the strip method.
However, your brand isn't a single string; it's a list of strings (even if there's only one string in the list). So, if you try to just strip it, or run a regex on it, you're going to get an AttributeError or TypeError from trying to treat that list as a string.
To fix this, you need to map the strip over all of the strings, with either the map function or a list comprehension:
item['brand'] = [brand.strip() for brand in item['brand']]
item['title'] = map(str.strip, item['title'])
… whichever of the two is easier for you to understand.
If you have other examples that have embedded runs of whitespace, and you want to turn every such run into exactly one space character, you need to use the sub method with your regex:
item['brand'] = [re.sub(ur'\s+', u' ', brand.strip() for brand in item['brand']]
Notice the u prefixes. In Python 2, you need a u prefix to make a unicode literal instead of a str (encoded bytes) literal. And it's important to use Unicode patterns against Unicode strings, even if the pattern itself doesn't care about any non-ASCII characters. (If all of this seems like a pointless pain and a bug magnet—well, it is; that's the main reason Python 3 exists.)
As for the retail_price, the same basic observations apply. Again, it's a list of strings, not just a string. And again, you probably don't need regex. Assuming the price is always a $ (or other single-character currency marker) followed by a number, just slice off the $ and call float or Decimal on it:
item['retail_price'] = [float(price[1:]) for price in item['retail_price']]
… but if you have examples that look different, with arbitrary extra characters on both sides of the price, you can use re.search here, but you'll still need to map it, and to use a Unicode pattern.
You also need to grab the matching group out of the search, and to handle empty/invalid strings in some way (they'll return None for the search, and you can't convert that to a float). You have to decide what to do about it, but from your attempt with filter it looks like you just want to skip them. This is complicated enough that I'd do it in multiple steps:
prices = item['price']
matches = (re.search(r'[0-9.]+', price) for price in prices)
groups = (match.group() for match in matches if match)
item['price'] = map(float, validmatches)
… or maybe wrap that in a function.
You can define a method like below which takes an object and returns all the leaves normalized.
import six
def normalize(obj):
if isinstance(obj, six.string_types):
return ' '.join(obj.split())
elif isinstance(obj, list):
return [normalize(x) for x in obj]
elif isinstance(obj, dict):
return {k:normalize(v) for k,v in obj.items()}
return obj
This is a recursive method and does not modify the original object instead returns the normalized object. You can also use it for normalizing the strings.
For your example item
>> item = {'store_id': 870, 'sale_price_low': [], 'brand': [u'\n KORAL ACTIVEWEAR\n '], 'currency': 'AUD', 'retail_price': [u'$140.00'], 'category': [u'Activewear'], 'title': [u'\n Boom Leggings\n '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u' https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}
>> print (normalize(item))
>> {'category': [u'Activewear'], 'store_id': 870, 'sale_price_low': [], 'title': [u'Boom Leggings'], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'brand': [u'KORAL ACTIVEWEAR'], 'currency': 'AUD', 'image_url': [u'https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'sale_price_high': [], 'retail_price': [u'$140.00'], 'store': 'SampleStore'}

How to formatting a list into a string?

I am trying to format a string in python that takes arguments as items from a list of names.
The catch is, I want to print all the list items with double quotes and backslash and one after each other in the same string only.
The code is:
list_names=['Alex', 'John', 'Joseph J']
String_to_pring='Hi my name is (\\"%s\\")'%(list_names)
The output should look like this:
'Hi my name is (\"Alex\",\"John\",\"Joseph J\")'
But instead, I keep getting like this:
'Hi my names is (\"['Alex','John','Joseph J']\")'
I've even tried using .format() and json.dumps() but still the same result.
Is there any way to print the desired output or can I only print each list item at a time?
Without changing much of your code, you could simply format the repr representation of the list that's converted into a tuple.
# proper way - this is what you actually want
list_names = ['Alex', 'John', 'Joseph J']
string_to_print = 'Hi my name is %s' % (repr(tuple(list_names)))
 
print(string_to_print)
# Hi my name is ('Alex', 'John', 'Joseph J')
If you want to get your exact output, just do some string replacing:
# improper way
list_names = ['Alex', 'John', 'Joseph J']
string_to_print = 'Hi my name is %s' % (repr(tuple(list_names)).replace("\'", '\\"'))
print(string_to_print)
# Hi my name is (\"Alex\", \"John\", \"Joseph J\")
if you're trying to pass string_to_print to some other place, just try the proper way first, it might actually work for you.
If you were mindful enough, you'll find that the previous "improper way" contains a small bug, try this adding "Alex's house" into list_names, the output would look like this:
Hi my name is (\"Alex\", \"John\", \"Joseph J\", "Alex\"s house")
To take care of that bug, you'll need to have a better way of replacing, by using re.sub().
from re import sub
list_names = ['Alex', 'John', 'Joseph J', "Alex's house"]
string_to_print = 'Hi my name is %s' % (sub(r'([\'\"])(.*?)(?!\\\1)(\1)', r'\"\2\"', repr(tuple(list_names))))
print(string_to_print)
But if things like this wouldn't happen during your usage, I would suggest to keep using the "improper way" as it's a lot simpler.
There is no function for formatting lists as human-friendly strings You have to format lists yourself:
names = ",".join(r'\"{}\"'.format(name) for name in list_names)
print(names)
#\"Alex\",\"John\",\"Joseph J\"
print('Hi my name is ({})'.format(names))
#Hi my name is (\"Alex\",\"John\",\"Joseph J\")
This is one way using format and join:
list_names = ['Alex', 'John', 'Joseph J']
String_to_pring='Hi my name is (\\"{}\\")'.format('\\",\\"'.join(i for i in list_names))
# Hi my name is (\"Alex\",\"John\",\"Joseph J\")

Templates with argument in string formatting

I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.
I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science
My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!
Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."
Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.

replace placeholder tags with dictionary fields in python

this is my code so far:
import re
template="Hello,my name is [name],today is [date] and the weather is [weather]"
placeholder=re.compile('(\[([a-z]+)\])')
find_tags=placeholder.findall(cam.template_id.text)
fields={field_name:'Michael',field_date:'21/06/2015',field_weather:'sunny'}
for key,placeholder in find_tags:
assemble_msg=template.replace(placeholder,?????)
print assemble_msg
I want to replace every tag with the associated dictionary field and the final message to be like this:
My name is Michael,today is 21/06/2015 and the weather is sunny.
I want to do this automatically and not manually.I am sure that the solution is simple,but I couldn't find any so far.Any help?
No need for a manual solution using regular expressions. This is (in a slightly different format) already supported by str.format:
>>> template = "Hello, my name is {name}, today is {date} and the weather is {weather}"
>>> fields = {'name': 'Michael', 'date': '21/06/2015', 'weather': 'sunny'}
>>> template.format(**fields)
Hello, my name is Michael, today is 21/06/2015 and the weather is sunny
If you can not alter your template string accordingly, you can easily replace the [] with {} in a preprocessing step. But note that this will raise a KeyError in case one of the placeholders is not present in the fields dict.
In case you want to keep your manual approach, you could try like this:
template = "Hello, my name is [name], today is [date] and the weather is [weather]"
fields = {'field_name': 'Michael', 'field_date': '21/06/2015', 'field_weather': 'sunny'}
for placeholder, key in re.findall('(\[([a-z]+)\])', template):
template = template.replace(placeholder, fields.get('field_' + key, placeholder))
Or a bit simpler, without using regular expressions:
for key in fields:
placeholder = "[%s]" % key[6:]
template = template.replace(placeholder, fields[key])
Afterwards, template is the new string with replacements. If you need to keep the template, just create a copy of that string and do the replacement in that copy. In this version, if a placeholder can not be resolved, it stays in the string. (Note that I swapped the meaning of key and placeholder in the loop, because IMHO it makes more sense that way.)
You can use dictionaries to put data straight into strings, like so...
fields={'field_name':'Michael','field_date':'21/06/2015','field_weather':'sunny'}
string="Hello,my name is %(field_name)s,today is %(field_date)s and the weather is %(field_weather)s" % fields
This might be an easier alternative for you?

Categories