python: extracting variables from string templates - python

I am familiar with the ability to insert variables into a string using Templates, like this:
Template('value is between $min and $max').substitute(min=5, max=10)
What I now want to know is if it is possible to do the reverse. I want to take a string, and extract the values from it using a template, so that I have some data structure (preferably just named variables, but a dict is fine) that contains the extracted values. For example:
>>> string = 'value is between 5 and 10'
>>> d = Backwards_template('value is between $min and $max').extract(string)
>>> print d
{'min': '5', 'max':'10'}
Is this possible?

That's called regular expressions:
import re
string = 'value is between 5 and 10'
m = re.match(r'value is between (.*) and (.*)', string)
print(m.group(1), m.group(2))
Output:
5 10
Update 1. Names can be given to groups:
m = re.match(r'value is between (?P<min>.*) and (?P<max>.*)', string)
print(m.group('min'), m.group('max'))
But this feature is not used often, as there are usually enough problems with a more important aspect: how to capture exactly what you want (with this particular case that's not a big deal, but even here: what if the string is value is between 1 and 2 and 3 -- should the string be accepted and what's the min and max?).
Update 2. Rather than making a precise regex, it's sometimes easier to combine regular expressions and "regular" code like this:
m = re.match(r'value is between (?P<min>.*) and (?P<max>.*)', string)
try:
value_min = float(m.group('min'))
value_max = float(m.group('max'))
except (AttributeError, ValueError): # no match or failed conversion
value_min = None
value_max = None
This combined approach is especially worth remembering when your text consists of many chunks (like phrases in quotes of different types) to be processed: in tricky cases, it's harder to define a single regex to handle both delimiters and contents of chunks than to define several steps like text.split(), optional merging of chunks, and independent processing of each chunk (using regexes and other means).

It's not possible to perfectly reverse the substitution. The problem is that some strings are ambiguous, for example
value is between 5 and 7 and 10
would have two possible solutions: min = "5", max = "7 and 10" and min = "5 and 7", max = "10"
However, you might be able to achieve useful results with regex:
import re
string = 'value is between 5 and 10'
template= 'value is between $min and $max'
pattern= re.escape(template)
pattern= re.sub(r'\\\$(\w+)', r'(?P<\1>.*)', pattern)
match= re.match(pattern, string)
print(match.groupdict()) # output: {'max': '10', 'min': '5'}

The behave module for Behavior-Driven Development provides a few different mechanisms for specifying and parsing templates.
Depending on the complexity of your templates, and the other needs of your app, you might find one or the other most useful. (Plus, you can steal their pre-written code.)

You can use the difflib module to compare the two strings and pull out the information you want.
https://docs.python.org/3.6/library/difflib.html
For example:
import difflib
def backwards_template(my_string, template):
my_lib = {}
entry = ''
value = ''
for s in difflib.ndiff(my_string, template):
if s[0]==' ':
if entry != '' and value != '':
my_lib[entry] = value
entry = ''
value = ''
elif s[0]=='-':
value += s[2]
elif s[0]=='+':
if s[2] != '$':
entry += s[2]
# check ending if non-empty
if entry != '' and value != '':
my_lib[entry] = value
return my_lib
my_string = 'value is between 5 and 10'
template = 'value is between $min and $max'
print(backwards_template(my_string, template))
Gives:
{'min': '5', 'max': '10'}

Related

Can I include variable in string formatting mini-language (Python)? [duplicate]

Is it possible to use variables in the format specifier in the format()-function in Python? I have the following code, and I need VAR to equal field_size:
def pretty_printer(*numbers):
str_list = [str(num).lstrip('0') for num in numbers]
field_size = max([len(string) for string in str_list])
i = 1
for num in numbers:
print("Number", i, ":", format(num, 'VAR.2f')) # VAR needs to equal field_size
You can use the str.format() method, which lets you interpolate other variables for things like the width:
'Number {i}: {num:{field_size}.2f}'.format(i=i, num=num, field_size=field_size)
Each {} is a placeholder, filling in named values from the keyword arguments (you can use numbered positional arguments too). The part after the optional : gives the format (the second argument to the format() function, basically), and you can use more {} placeholders there to fill in parameters.
Using numbered positions would look like this:
'Number {0}: {1:{2}.2f}'.format(i, num, field_size)
but you could also mix the two or pick different names:
'Number {0}: {1:{width}.2f}'.format(i, num, width=field_size)
If you omit the numbers and names, the fields are automatically numbered, so the following is equivalent to the preceding format:
'Number {}: {:{width}.2f}'.format(i, num, width=field_size)
Note that the whole string is a template, so things like the Number string and the colon are part of the template here.
You need to take into account that the field size includes the decimal point, however; you may need to adjust your size to add those 3 extra characters.
Demo:
>>> i = 3
>>> num = 25
>>> field_size = 7
>>> 'Number {i}: {num:{field_size}.2f}'.format(i=i, num=num, field_size=field_size)
'Number 3: 25.00'
Last but not least, of Python 3.6 and up, you can put the variables directly into the string literal by using a formatted string literal:
f'Number {i}: {num:{field_size}.2f}'
The advantage of using a regular string template and str.format() is that you can swap out the template, the advantage of f-strings is that makes for very readable and compact string formatting inline in the string value syntax itself.
I prefer this (new 3.6) style:
name = 'Eugene'
f'Hello, {name}!'
or a multi-line string:
f'''
Hello,
{name}!!!
{a_number_to_format:.1f}
'''
which is really handy.
I find the old style formatting sometimes hard to read. Even concatenation could be more readable. See an example:
'{} {} {} {} which one is which??? {} {} {}'.format('1', '2', '3', '4', '5', '6', '7')
I used just assigned the VAR value to field_size and change the print statement. It works.
def pretty_printer(*numbers):
str_list = [str(num).lstrip('0') for num in numbers]
field_size = max([len(string) for string in str_list])
VAR=field_size
i = 1
for num in numbers:
print("Number", i, ":", format(num, f'{VAR}.2f'))

Using variables in the format() function in Python

Is it possible to use variables in the format specifier in the format()-function in Python? I have the following code, and I need VAR to equal field_size:
def pretty_printer(*numbers):
str_list = [str(num).lstrip('0') for num in numbers]
field_size = max([len(string) for string in str_list])
i = 1
for num in numbers:
print("Number", i, ":", format(num, 'VAR.2f')) # VAR needs to equal field_size
You can use the str.format() method, which lets you interpolate other variables for things like the width:
'Number {i}: {num:{field_size}.2f}'.format(i=i, num=num, field_size=field_size)
Each {} is a placeholder, filling in named values from the keyword arguments (you can use numbered positional arguments too). The part after the optional : gives the format (the second argument to the format() function, basically), and you can use more {} placeholders there to fill in parameters.
Using numbered positions would look like this:
'Number {0}: {1:{2}.2f}'.format(i, num, field_size)
but you could also mix the two or pick different names:
'Number {0}: {1:{width}.2f}'.format(i, num, width=field_size)
If you omit the numbers and names, the fields are automatically numbered, so the following is equivalent to the preceding format:
'Number {}: {:{width}.2f}'.format(i, num, width=field_size)
Note that the whole string is a template, so things like the Number string and the colon are part of the template here.
You need to take into account that the field size includes the decimal point, however; you may need to adjust your size to add those 3 extra characters.
Demo:
>>> i = 3
>>> num = 25
>>> field_size = 7
>>> 'Number {i}: {num:{field_size}.2f}'.format(i=i, num=num, field_size=field_size)
'Number 3: 25.00'
Last but not least, of Python 3.6 and up, you can put the variables directly into the string literal by using a formatted string literal:
f'Number {i}: {num:{field_size}.2f}'
The advantage of using a regular string template and str.format() is that you can swap out the template, the advantage of f-strings is that makes for very readable and compact string formatting inline in the string value syntax itself.
I prefer this (new 3.6) style:
name = 'Eugene'
f'Hello, {name}!'
or a multi-line string:
f'''
Hello,
{name}!!!
{a_number_to_format:.1f}
'''
which is really handy.
I find the old style formatting sometimes hard to read. Even concatenation could be more readable. See an example:
'{} {} {} {} which one is which??? {} {} {}'.format('1', '2', '3', '4', '5', '6', '7')
I used just assigned the VAR value to field_size and change the print statement. It works.
def pretty_printer(*numbers):
str_list = [str(num).lstrip('0') for num in numbers]
field_size = max([len(string) for string in str_list])
VAR=field_size
i = 1
for num in numbers:
print("Number", i, ":", format(num, f'{VAR}.2f'))

Convert, or unformat, a string to variables (like format(), but in reverse) in Python

I have strings of the form Version 1.4.0\n and Version 1.15.6\n, and I'd like a simple way of extracting the three numbers from them. I know I can put variables into a string with the format method; I basically want to do that backwards, like this:
# So I know I can do this:
x, y, z = 1, 4, 0
print 'Version {0}.{1}.{2}\n'.format(x,y,z)
# Output is 'Version 1.4.0\n'
# But I'd like to be able to reverse it:
mystr='Version 1.15.6\n'
a, b, c = mystr.unformat('Version {0}.{1}.{2}\n')
# And have the result that a, b, c = 1, 15, 6
Someone else I found asked the same question, but the reply was specific to their particular case: Use Python format string in reverse for parsing
A general answer (how to do format() in reverse) would be great! An answer for my specific case would be very helpful too though.
Just to build on Uche's answer, I was looking for a way to reverse a string via a pattern with kwargs. So I put together the following function:
def string_to_dict(string, pattern):
regex = re.sub(r'{(.+?)}', r'(?P<_\1>.+)', pattern)
values = list(re.search(regex, string).groups())
keys = re.findall(r'{(.+?)}', pattern)
_dict = dict(zip(keys, values))
return _dict
Which works as per:
>>> p = 'hello, my name is {name} and I am a {age} year old {what}'
>>> s = p.format(name='dan', age=33, what='developer')
>>> s
'hello, my name is dan and I am a 33 year old developer'
>>> string_to_dict(s, p)
{'age': '33', 'name': 'dan', 'what': 'developer'}
>>> s = p.format(name='cody', age=18, what='quarterback')
>>> s
'hello, my name is cody and I am a 18 year old quarterback'
>>> string_to_dict(s, p)
{'age': '18', 'name': 'cody', 'what': 'quarterback'}
>>> import re
>>> re.findall('(\d+)\.(\d+)\.(\d+)', 'Version 1.15.6\n')
[('1', '15', '6')]
The pypi package parse serves this purpose well:
pip install parse
Can be used like this:
>>> import parse
>>> result=parse.parse('Version {0}.{1}.{2}\n', 'Version 1.15.6\n')
<Result ('1', '15', '6') {}>
>>> values=list(result)
>>> print(values)
['1', '15', '6']
Note that the docs say the parse package does not EXACTLY emulate the format specification mini-language by default; it also uses some type-indicators specified by re. Of special note is that s means "whitespace" by default, rather than str. This can be easily modified to be consistent with the format specification by changing the default type for s to str (using extra_types):
result = parse.parse(format_str, string, extra_types=dict(s=str))
Here is a conceptual idea for a modification of the string.Formatter built-in class using the parse package to add unformat capability that I have used myself:
import parse
from string import Formatter
class Unformatter(Formatter):
'''A parsable formatter.'''
def unformat(self, format, string, extra_types=dict(s=str), evaluate_result=True):
return parse.parse(format, string, extra_types, evaluate_result)
unformat.__doc__ = parse.Parser.parse.__doc__
IMPORTANT: the method name parse is already in use by the Formatter class, so I have chosen unformat instead to avoid conflicts.
UPDATE: You might use it like this- very similar to the string.Formatter class.
Formatting (identical to '{:d} {:d}'.format(1, 2)):
>>> formatter = Unformatter()
>>> s = formatter.format('{:d} {:d}', 1, 2)
>>> s
'1 2'
Unformatting:
>>> result = formatter.unformat('{:d} {:d}', s)
>>> result
<Result (1, 2) {}>
>>> tuple(result)
(1, 2)
This is of course of very limited use as shown above. However, I've put up a pypi package (parmatter - a project originally for my own use but maybe others will find it useful) that explores some ideas of how to put this idea to more useful work. The package relies heavily on the aforementioned parse package. EDIT: a few years of experience under my belt later, I realized parmatter (my first package!) was a terrible, embarrassing idea and have since deleted it.
Actually the Python regular expression library already provides the general functionality you are asking for. You just have to change the syntax of the pattern slightly
>>> import re
>>> from operator import itemgetter
>>> mystr='Version 1.15.6\n'
>>> m = re.match('Version (?P<_0>.+)\.(?P<_1>.+)\.(?P<_2>.+)', mystr)
>>> map(itemgetter(1), sorted(m.groupdict().items()))
['1', '15', '6']
As you can see, you have to change the (un)format strings from {0} to (?P<_0>.+). You could even require a decimal with (?P<_0>\d+). In addition, you have to escape some of the characters to prevent them from beeing interpreted as regex special characters. But this in turm can be automated again e.g. with
>>> re.sub(r'\\{(\d+)\\}', r'(?P<_\1>.+)', re.escape('Version {0}.{1}.{2}'))
'Version\\ (?P<_0>.+)\\.(?P<_1>.+)\\.(?P<_2>.+)'
Some time ago I made the code below that does the reverse of format but limited to the cases I needed.
And, I never tried it, but I think this is also the purpose of the parse library
My code:
import string
import re
_def_re = '.+'
_int_re = '[0-9]+'
_float_re = '[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?'
_spec_char = '[\^$.|?*+()'
def format_parse(text, pattern):
"""
Scan `text` using the string.format-type `pattern`
If `text` is not a string but iterable return a list of parsed elements
All format-like pattern cannot be process:
- variable name cannot repeat (even unspecified ones s.t. '{}_{0}')
- alignment is not taken into account
- only the following variable types are recognized:
'd' look for and returns an integer
'f' look for and returns a float
Examples::
res = format_parse('the depth is -42.13', 'the {name} is {value:f}')
print res
print type(res['value'])
# {'name': 'depth', 'value': -42.13}
# <type 'float'>
print 'the {name} is {value:f}'.format(**res)
# 'the depth is -42.130000'
# Ex2: without given variable name and and invalid item (2nd)
versions = ['Version 1.4.0', 'Version 3,1,6', 'Version 0.1.0']
v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
# v=[{0: 1, 1: 4, 2: 0}, None, {0: 0, 1: 1, 2: 0}]
"""
# convert pattern to suitable regular expression & variable name
v_int = 0 # available integer variable name for unnamed variable
cur_g = 0 # indices of current regexp group name
n_map = {} # map variable name (keys) to regexp group name (values)
v_cvt = {} # (optional) type conversion function attached to variable name
rpattern = '^' # stores to regexp pattern related to format pattern
for txt,vname, spec, conv in string.Formatter().parse(pattern):
# process variable name
if len(vname)==0:
vname = v_int
v_int += 1
if vname not in n_map:
gname = '_'+str(cur_g)
n_map[vname] = gname
cur_g += 1
else:
gname = n_map[vname]
# process type of required variables
if 'd' in spec: vtype = _int_re; v_cvt[vname] = int
elif 'f' in spec: vtype = _float_re; v_cvt[vname] = float
else: vtype = _def_re;
# check for regexp special characters in txt (add '\' before)
txt = ''.join(map(lambda c: '\\'+c if c in _spec_char else c, txt))
rpattern += txt + '(?P<'+gname+'>' + vtype +')'
rpattern += '$'
# replace dictionary key from regexp group-name to the variable-name
def map_result(match):
if match is None: return None
match = match.groupdict()
match = dict((vname, match[gname]) for vname,gname in n_map.iteritems())
for vname, value in match.iteritems():
if vname in v_cvt:
match[vname] = v_cvt[vname](value)
return match
# parse pattern
if isinstance(text,basestring):
match = re.search(rpattern, text)
match = map_result(match)
else:
comp = re.compile(rpattern)
match = map(comp.search, text)
match = map(map_result, match)
return match
for your case, here is a use example:
versions = ['Version 1.4.0', 'Version 3.1.6', 'Version 0.1.0']
v = format_parse(versions, 'Version {:d}.{:d}.{:d}')
# v=[{0: 1, 1: 4, 2: 0}, {0: 3, 1: 1, 2: 6}, {0: 0, 1: 1, 2: 0}]
# to get the versions as a list of integer list, you can use:
v = [[vi[i] for i in range(3)] for vi in filter(None,v)]
Note the filter(None,v) to remove unparsable versions (which return None). Here it is not necessary.
This
a, b, c = (int(i) for i in mystr.split()[1].split('.'))
will give you int values for a, b and c
>>> a
1
>>> b
15
>>> c
6
Depending on how regular or irregular, i.e., consistent, your number/version formats will be, you may want to consider the use of regular expressions, though if they will stay in this format, I would favor the simpler solution if it works for you.
Here's a solution in case you don't want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.
def match_format_string(format_str, s):
"""Match s against the given format string, return dict of matches.
We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
{:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).
We raise if the format string does not match s.
Example:
fs = '{test}-{flight}-{go}'
s = fs.format('first', 'second', 'third')
match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
"""
# First split on any keyword arguments, note that the names of keyword arguments will be in the
# 1st, 3rd, ... positions in this list
tokens = re.split(r'\{(.*?)\}', format_str)
keywords = tokens[1::2]
# Now replace keyword arguments with named groups matching them. We also escape between keyword
# arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
tokens[0::2] = map(re.escape, tokens[0::2])
pattern = ''.join(tokens)
# Use our pattern to match the given string, raise if it doesn't match
matches = re.match(pattern, s)
if not matches:
raise Exception("Format string did not match")
# Return a dict with all of our keywords and their values
return {x: matches.group(x) for x in keywords}

Regex Python / group quantifiers

I want to match a list of variables which look like directories, e.g.:
Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34
The length of the "subdirectories" is variable having an upper bound (above it's 9).
I want to group every subdirectory except the 1st one which I named "Same" above.
The best I could come up with is:
^(?:([^/]+)/){4,8}([^/]+)=(.*)
It already looks for 4-8 subdirectories but only groups the last one. Why's that?
Is there a better solution using group quantifiers?
Edit: Solved. Will use split() instead.
import re
regx = re.compile('(?:(?<=\A)|(?<=/)).+?(?=/|\Z)')
for ss in ('Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123',
'Same/Same2/Battery/Name=SomeString',
'Same/Same2/Home/Land/Some/More/Stuff=0.34'):
print ss
print regx.findall(ss)
print
Edit 1
Now you have given more info on what you want to obtain ( _"Same/Same2/Battery/Name=SomeString becoming SAME2_BATTERY_NAME=SomeString"_ ) better solutions can be proposed: either with a regex or with split() , + replace()
import re
from os import sep
sep2 = r'\\' if sep=='\\' else '/'
pat = '^(?:.+?%s)(.+$)' % sep2
print 'pat==%s\n' % pat
ragx = re.compile(pat)
for ss in ('Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123',
'Same\Same2\Battery\Name=SomeString',
'Same\Same2\Home\Land\Some\More\Stuff=0.34'):
print ss
print ragx.match(ss).group(1).replace(sep,'_')
print ss.split(sep,1)[1].replace(sep,'_')
print
result
pat==^(?:.+?\\)(.+$)
Same\Same2\Foot\Ankle\Joint\Actuator\Sensor\Temperature\Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123
Same2_Foot_Ankle_Joint_Actuator_Sensor_Temperature_Value=4.123
Same\Same2\Battery\Name=SomeString
Same2_Battery_Name=SomeString
Same2_Battery_Name=SomeString
Same\Same2\Home\Land\Some\More\Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34
Same2_Home_Land_Some_More_Stuff=0.34
Edit 2
Re-reading your comment, I realized that I didn't take in account that you want to upper the part of the strings that lies before the '=' sign but not after it.
Hence, this new code that exposes 3 methods that answer this requirement. You will choose which one you prefer:
import re
from os import sep
sep2 = r'\\' if sep=='\\' else '/'
pot = '^(?:.+?%s)(.+?)=([^=]*$)' % sep2
print 'pot==%s\n' % pot
rogx = re.compile(pot)
pet = '^(?:.+?%s)(.+?(?==[^=]*$))' % sep2
print 'pet==%s\n' % pet
regx = re.compile(pet)
for ss in ('Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123',
'Same\Same2\Battery\Name=SomeString',
'Same\Same2\Ocean\Atlantic\North=',
'Same\Same2\Maths\Addition\\2+2=4\Simple=ohoh'):
print ss + '\n' + len(ss)*'-'
print 'rogx groups '.rjust(32),rogx.match(ss).groups()
a,b = ss.split(sep,1)[1].rsplit('=',1)
print 'split split '.rjust(32),(a,b)
print 'split split join upper replace %s=%s' % (a.replace(sep,'_').upper(),b)
print 'regx split group '.rjust(32),regx.match(ss.split(sep,1)[1]).group()
print 'regx split sub '.rjust(32),\
regx.sub(lambda x: x.group(1).replace(sep,'_').upper(), ss)
print
result, on a Windows platform
pot==^(?:.+?\\)(.+?)=([^=]*$)
pet==^(?:.+?\\)(.+?(?==[^=]*$))
Same\Same2\Foot\Ankle\Joint\Sensor\Value=4.123
----------------------------------------------
rogx groups ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
split split ('Same2\\Foot\\Ankle\\Joint\\Sensor\\Value', '4.123')
split split join upper replace SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123
regx split group Same2\Foot\Ankle\Joint\Sensor\Value
regx split sub SAME2_FOOT_ANKLE_JOINT_SENSOR_VALUE=4.123
Same\Same2\Battery\Name=SomeString
----------------------------------
rogx groups ('Same2\\Battery\\Name', 'SomeString')
split split ('Same2\\Battery\\Name', 'SomeString')
split split join upper replace SAME2_BATTERY_NAME=SomeString
regx split group Same2\Battery\Name
regx split sub SAME2_BATTERY_NAME=SomeString
Same\Same2\Ocean\Atlantic\North=
--------------------------------
rogx groups ('Same2\\Ocean\\Atlantic\\North', '')
split split ('Same2\\Ocean\\Atlantic\\North', '')
split split join upper replace SAME2_OCEAN_ATLANTIC_NORTH=
regx split group Same2\Ocean\Atlantic\North
regx split sub SAME2_OCEAN_ATLANTIC_NORTH=
Same\Same2\Maths\Addition\2+2=4\Simple=ohoh
-------------------------------------------
rogx groups ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
split split ('Same2\\Maths\\Addition\\2+2=4\\Simple', 'ohoh')
split split join upper replace SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh
regx split group Same2\Maths\Addition\2+2=4\Simple
regx split sub SAME2_MATHS_ADDITION_2+2=4_SIMPLE=ohoh
I probably misunderstood what exactly you want to do, but here is how you would do it without regex:
for entry in list_of_vars:
key, value = entry.split('=')
key_components = key.split('/')
if 4 <= len(key_components) <= 8:
# here the actual work is done
print "%s=%s" % ('_'.join(key_components[1:]).upper(), value)
Just use split?
>>> p='Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123'
>>> p.split('/')
['Same', 'Same2', 'Foot', 'Ankle', 'Joint', 'Actuator', 'Sensor', 'Temperature', 'Value=4.123']
Also, if you want that key/val pair you can do something like this...
>>> s = p.split('/')
>>> s[-1].split('=')
['Value', '4.123']
A couple of variations on your theme. For one, I've always found regexen to be cryptic to the point of unmaintainable, so I wrote the pyparsing module. In my mind, I look at your code and think, "oh, it's a list of '/'-delimited strings, an '=' sign, and then some kind of rvalue." And that translates pretty directly into the pyparsing parser definition code. By adding a name here and there in the parser ("key" and "value", similar to named groups in regex), the output is pretty easily processed.
data="""\
Same/Same2/Foot/Ankle/Joint/Actuator/Sensor/Temperature/Value=4.123
Same/Same2/Battery/Name=SomeString
Same/Same2/Home/Land/Some/More/Stuff=0.34""".splitlines()
from pyparsing import Word, alphas, alphanums, Word, nums, QuotedString, delimitedList
wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvalue = wd | number | QuotedString('"')
defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')
for d in data:
result = defn.parseString(d)
Second, I question your approach at defining all of those variable names - creating variable names on the fly based on your data is a pretty well-recognized Code Smell (not necessarily bad, but you might really want to rethink this approach). I used a recursive defaultdict to create a navigable structure so that you can easily do operations like "find all the entries that are sub-elements of "Same2" (in this case, "Foot", "Battery", and "Home") - this kind of work is more difficult when trying to sift through some collection of variable names as found in locals(), it seems to me you will end up re-parsing these names to reconstruct the key hierarchy.
from collections import defaultdict
class recursivedefaultdict(defaultdict):
def __init__(self, attrFactory=int):
self.default_factory = lambda : type(self)(attrFactory)
self._attrFactory = attrFactory
def __getattr__(self, attr):
newval = self._attrFactory()
setattr(self, attr, newval)
return newval
table = recursivedefaultdict()
# parse each entry, and accumulate into hierarchical dict
for d in data:
# use pyparsing parser, gives us key (list of names) and value
result = defn.parseString(d)
t = table
for k in result.key[:-1]:
t = t[k]
t[result.key[-1]] = result.value
# recursive method to iterate over hierarchical dict
def showTable(t, indent=''):
for k,v in t.items():
print indent+k,
if isinstance(v,dict):
print
showTable(v, indent+' ')
else:
print v
showTable(table)
Prints:
Same
Same2
Foot
Ankle
Joint
Actuator
Sensor
Temperature
Value 4.123
Battery
Name SomeString
Home
Land
Some
More
Stuff 0.34
If you are really set on defining those variable names, then adding some helpful parse actions to pyparsing will reformat the parsed data at parse time, so that it's directly processable afterwards:
wd = Word(alphas, alphanums)
number = Word(nums+'+-', nums+'.').setParseAction(lambda t:float(t[0]))
rvaluewd = wd.copy().setParseAction(lambda t: '"%s"' % t[0])
rvalue = rvaluewd | number | QuotedString('"')
defn = delimitedList(wd, '/')('key') + '=' + rvalue('value')
def joinNamesWithAllCaps(tokens):
tokens["key"] = '_'.join(map(str.upper, tokens.key))
defn.setParseAction(joinNamesWithAllCaps)
for d in data:
result = defn.parseString(d)
print result.key,'=', result.value
Prints:
SAME_SAME2_FOOT_ANKLE_JOINT_ACTUATOR_SENSOR_TEMPERATURE_VALUE = 4.123
SAME_SAME2_BATTERY_NAME = "SomeString"
SAME_SAME2_HOME_LAND_SOME_MORE_STUFF = 0.34
(Note that this also encloses your SomeString value in quotes, so that the resulting assignment statement is valid Python.)

parsing a line of text to get a specific number

I have a line of text in the form " some spaces variable = 7 = '0x07' some more data"
I want to parse it and get the number 7 from "some variable = 7". How can this be done in python?
I would use a simpler solution, avoiding regular expressions.
Split on '=' and get the value at the position you expect
text = 'some spaces variable = 7 = ...'
if '=' in text:
chunks = text.split('=')
assignedval = chunks[1]#second value, 7
print 'assigned value is', assignedval
else:
print 'no assignment in line'
Use a regular expression.
Essentially, you create an expression that goes something like "variable = (\d+)", do a match, and then take the first group, which will give you the string 7. You can then convert it to an int.
Read the tutorial in the link above.
Basic regex code snippet to find numbers in a string.
>>> import re
>>> input = " some spaces variable = 7 = '0x07' some more data"
>>> nums = re.findall("[0-9]*", input)
>>> nums = [i for i in nums if i] # remove empty strings
>>> nums
['7', '0', '07']
Check out the documentation and How-To on python.org.

Categories