Automatic conversion of the advanced string formatter from the old style - python

Is there any automatic way to convert a piece of code from python's old style string formatting (using %) to the new style (using .format)? For example, consider the formatting of a PDB atom specification:
spec = "%-6s%5d %4s%1s%3s %1s%4d%1s %8.3f%8.3f%8.3f%6.2f%6.2f %2s%2s"
I've been converting some of these specifications by hand as needed, but this is both error prone, and time-consuming as I have many such specifications.

Use pyupgrade
pyupgrade --py3-plus <filename>
You can convert to f-strings (formatted string literals) instead of .format() with
pyupgrade --py36-plus <filename>
You can install it with
pip install pyupgrade

The functionality of the two forms does not match up exactly, so there is no way you could automatically translate every % string into an equivalent {} string or (especially) vice-versa.
Of course there is a lot of overlap, and many of the sub-parts of the two formatting languages are the same or very similar, so someone could write a partial converter (which could, e.g., raise an exception for non-convertible code).
For a small subset of the language like what you seem to be using, you could do it pretty trivially with a simple regex—every pattern starts with % and ends with one of [sdf], and something like {:\1\2} as a replacement pattern ought to be all you need.
But why bother? Except as an exercise in writing parsers, what would be the benefit? The % operator is not deprecated, and using % with an existing % format string will obviously do at least as well as using format with a % format string converted to {}.
If you are looking at this as an exercise in writing parsers, I believe there's an incomplete example buried inside pyparsing.
Some differences that are hard to translate, off the top of my head:
* for dynamic field width or precision; format has a similar feature, but does it differently.
%(10)s, because format tries to interpret the key name as a number first, then falls back to a dict key.
%(a[b])s, because format doesn't quote or otherwise separate the key from the rest of the field, so a variety of characters simply can't be used.
%c takes integers or single-char strings; :c only integers.
%r/%s/%a analogues are not part of the format string, but a separate part of the field (which also comes on the opposite side).
%g and :g have slightly different cutoff rules.
%a and !a don't do the exact same thing.
The actual differences aren't listed anywhere; you will have to dig them out by a thorough reading of the Format Specification Mini-Language vs. the printf-style String Formatting language.

The docs explain some of the differences. As far as I can tell -- although I'm not very familiar with old-style format strings -- is that the functionality of the new style is a superset of the functionality of the oldstyle.
You'd have to do more tweaking to handle edge cases, but I think something simple like
re.replace(r'%(\w+)([sbcdoXnf...])', r'{\1\2}', your_string)
would get you 90% of the way there. The remaining translation -- going from things like %x to {0:x} -- will be too complex for a regular expression to handle (without writing some ridiculously complex conditionals inside of your regex).

Related

Attempted regex of string that contains multiple numbers as different base counts

Currently building an encryption module in python 3.8 and have run into a snag with a desired feature/upgrade. Looking for some assistance in finding a solution that would be more helpful than writing a 'string crawler' to parse out an encrypted string of data.
In my first 'official' release everything works fine, but this is due to how much easier it is to split a string based of off easily identifiable prefixes in a string. For example, '0x' in a hexadecimal or '0o' in an octal.
The current definitions for what a number can inhabit use the aforementioned base types along with support for counts of 2-10 as '{n}bXX'.
What I currently have implemented works just fine for the present design, but having trouble with trying to come up with something that can handle higher bases (at least up to 64) that isn't going to be bulky or slow; also, the redesign is having trouble parsing out a string which contains multiple base counts, post assignment to their corresponding characters.
TL;DR - If I have an encoded string like so: "0x9a0o25179b83629b86740xc01d64b9HM-70o5521"
I would like it to be split as: [0x9a, 0o2517, 9b8362, 9b8674, 0xc0ld, 64b9HM-7, 0o5521]
and need help finding a better solution than: r'(?:0x)|(?:9b)|...'

Why does python have so many ways to interpolate a variable inside a string? [duplicate]

Python has at least six ways of formatting a string:
In [1]: world = "Earth"
# method 1a
In [2]: "Hello, %s" % world
Out[2]: 'Hello, Earth'
# method 1b
In [3]: "Hello, %(planet)s" % {"planet": world}
Out[3]: 'Hello, Earth'
# method 2a
In [4]: "Hello, {0}".format(world)
Out[4]: 'Hello, Earth'
# method 2b
In [5]: "Hello, {planet}".format(planet=world)
Out[5]: 'Hello, Earth'
# method 2c
In [6]: f"Hello, {world}"
Out[6]: 'Hello, Earth'
In [7]: from string import Template
# method 3
In [8]: Template("Hello, $planet").substitute(planet=world)
Out[8]: 'Hello, Earth'
A brief history of the different methods:
printf-style formatting has been around since Pythons infancy
The Template class was introduced in Python 2.4
The format method was introduced in Python 2.6
f-strings were introduced in Python 3.6
My questions are:
Is printf-style formatting deprecated or going to be deprecated?
In the Template class, is the substitute method deprecated or going to be deprecated? (I'm not talking about safe_substitute, which as I understand it offers unique capabilities)
Similar questions and why I think they're not duplicates:
Python string formatting: % vs. .format — treats only methods 1 and 2, and asks which one is better; my question is explicitly about deprecation in the light of the Zen of Python
String formatting options: pros and cons — treats only methods 1a and 1b in the question, 1 and 2 in the answer, and also nothing about deprecation
advanced string formatting vs template strings — mostly about methods 1 and 3, and doesn't address deprecation
String formatting expressions (Python) — answer mentions that the original '%' approach is planned to be deprecated. But what's the difference between planned to be deprecated, pending deprecation and actual deprecation? And the printf-style method doesn't raise even a PendingDeprecationWarning, so is this really going to be deprecated? This post is also quite old, so the information may be outdated.
See also
PEP 502: String Interpolation - Extended Discussion
String Formatter
The new .format() method is meant to replace the old % formatting syntax. The latter has been de-emphasised, (but not officially deprecated yet). The method documentation states as much:
This method of string formatting is the new standard in Python 3, and should be preferred to the % formatting described in String Formatting Operations in new code.
(Emphasis mine).
To maintain backwards compatibility and to make transition easier, the old format has been left in place for now. From the original PEP 3101 proposal:
Backwards Compatibility
Backwards compatibility can be maintained by leaving the existing
mechanisms in place. The new system does not collide with any of
the method names of the existing string formatting techniques, so
both systems can co-exist until it comes time to deprecate the
older system.
Note the until it comes time to deprecate the older system; it hasn't been deprecated, but the new system is to be used whenever you write new code.
The new system has as an advantage that you can combine the tuple and dictionary approach of the old % formatter:
"{greeting}, {0}".format(world, greeting='Hello')
and is extensible through the object.__format__() hook used to handle formatting of individual values.
Note that the old system had % and the Template class, where the latter allows you to create subclasses that add or alter its behaviour. The new-style system has the Formatter class to fill the same niche.
Python 3 has further stepped away from deprecation, instead giving you warning in the printf-style String Formatting section:
Note: The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display tuples and dictionaries correctly). Using the newer formatted string literals or the str.format() interface helps avoid these errors. These alternatives also provide more powerful, flexible and extensible approaches to formatting text.
Python 3.6 also added formatted string literals, which in-line the expressions into the format strings. These are the fastest method of creating strings with interpolated values, and should be used instead of str.format() wherever you can use a literal.
The % operator for string formatting is not deprecated, and is not going to be removed - despite the other answers.
Every time the subject is raised on Python development list, there is strong controversy on which is better, but no controversy on whether to remove the classic way - it will stay. Despite being denoted on PEP 3101, Python 3.1 had come and gone, and % formatting is still around.
The statements for the keeping classic style are clear: it is simple, it is fast, it is quick to do for short things. Using the .format method is not always more readable - and barely anyone - even among the core developers, can use the full syntax provided by .format without having to look at the reference
Even back in 2009, one had messages like this: http://mail.python.org/pipermail/python-dev/2009-October/092529.html - the subject had barely showed up in the lists since.
2016 update
In current Python development version (which will become Python 3.6) there is a third method of string interpolation, described on PEP-0498. It defines a new quote prefix f"" (besides the current u"", b"" and r"").
Prefixing a string by f will call a method on the string object at runtime, which will automatically interpolate variables from the current scope into the string:
>>> value = 80
>>> f'The value is {value}.'
'The value is 80.'
Looking at the older Python docs and PEP 3101 there was a statement that the % operator will be deprecated and removed from the language in the future. The following statement was in the Python docs for Python 3.0, 3.1, and 3.2:
Since str.format() is quite new, a lot of Python code still uses the %
operator. However, because this old style of formatting will
eventually be removed from the language, str.format() should generally
be used.
If you go to the same section in Python 3.3 and 3.4 docs, you will see that statement has been removed. I also cannot find any other statement anywhere else in the documentation indicating that the operator will be deprecated or removed from the language. It's also important to note that PEP3101 has not been modified in over two and a half years (Fri, 30 Sep 2011).
Update
PEP461 Adding % formatting to bytes and bytearray is accepted and should be part of Python 3.5 or 3.6. It's another sign that the % operator is alive and kicking.
While there are various indications in the docs that .format and f-strings are superior to % strings, there's no surviving plan to ever deprecate the latter.
In commit Issue #14123: Explicitly mention that old style % string formatting has caveats but is not going away any time soon., inspired by issue Indicate that there are no current plans to deprecate printf-style formatting, the docs on %-formatting were edited to contain this phrase:
As the new string-formatting syntax is more flexible and handles tuples and dictionaries naturally, it is recommended for new code. However, there are no current plans to deprecate printf-style formatting.
(Emphasis mine.)
This phrase was removed later, in commit Close #4966: revamp the sequence docs in order to better explain the state of modern Python. This might seem like a sign that a plan to deprecate % formatting was back on the cards... but diving into the bug tracker reveals that the intent was the opposite. On the bug tracker, the author of the commit characterises the change like this:
changed the prose that describes the relationship between printf-style formatting and the str.format method (deliberately removing the implication that the former is any real danger of disappearing - it's simply not practical for us to seriously contemplate killing it off)
In other words, we've had two consecutive changes to the %-formatting docs intended to explicitly emphasise that it will not be deprecated, let alone removed. The docs remain opinionated on the relative merits of different kinds of string formatting, but they're also clear the %-formatting isn't going to get deprecated or removed.
What's more, the most recent change to that paragraph, in March 2017, changed it from this...
The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display tuples and dictionaries correctly). Using the newer formatted string literals or the str.format interface helps avoid these errors. These alternatives also provide more powerful, flexible and extensible approaches to formatting text.
... to this:
The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display tuples and dictionaries correctly). Using the newer formatted string literals, the str.format interface, or template strings may help avoid these errors. Each of these alternatives provides their own trade-offs and benefits of simplicity, flexibility, and/or extensibility.
Notice the change from "helps avoid" to "may help avoid", and how the clear recommendation of .format and f-strings has been replaced by fluffy, equivocal prose about how each style "provides their own trade-offs and benefits". That is, not only is a formal deprecation no longer on the cards, but the current docs are openly acknowledging that % formatting at least has some "benefits" over the other approaches.
I'd infer from all this that the movement to deprecate or remove % formatting has not only faltered, but been defeated thoroughly and permanently.
Guido's latest position on this seems to be indicated here:
What’s New In Python 3.0
PEP 3101: A New Approach To String Formatting
A new system for built-in string formatting operations replaces
the % string formatting operator. (However, the % operator is still
supported; it will be deprecated in Python 3.1 and removed from the
language at some later time.) Read PEP 3101 for the full scoop.
And the PEP3101 itself, which has the last modified dating back to (Fri, 30 Sep 2011), so no progress as of late on that one, I suppose.

Gettext does not flag python format strings with "python-format" unlike old style python strings with format specifiers

It seems that gettext is not able to recognise Python format strings with replacement fields properly and hence does not flag them as "python-format". E.g.
ugettext("This is a sample format string with a {kwarg}").format(kwarg='key word argument')
However, gettext identifies Python strings with format specifiers properly and flags such source strings with "python-format", e.g.,
ugettext("This is a sample string with a %(format_spec).") % {'format_spec': 'format specifier'}
I have tried using xgettext and Django's manage.py makemessages tools to generate PO files for Python format strings, but didn't see the Python format strings being flagged as "python-format".
Also, http://www.gnu.org/software/gettext/manual/html_node/Python.html does not specify anything about the new Python format strings.
Please help me find a work around for this issue.
I suppose you are being hit by this bug: the gettext tools support only the old Python string format for substitutions. So you should use old string format. Or you can use a different tool for doing your translations.
This flag python-format refers to the classic __mod__ replacement method which uses the % operator. The newer format mini-language is something completely different and of course can't share the same format flag.
For someone who freshly jumps in and was told to use .format because it is new and flexible this might be confusing; but the flag name must contain -format for consistency reasons and is not at all related to the name of the used Python mechanism.
The "classic" style has the advantage of being faster. Something like
kwarg = 'keyword argument'
ugettext('This is a sample format string with a %(kwarg)s') % locals()
should be recognized, and flagged as python-format.
Something like
ugettext('This is a sample format string with a %s') % (kwarg,)
would be flagged as c-format.
(You need to take care for the translation of "keyword argument" yourself elsewhere, of course.)
Update: Apparently, the format flag for the .format() syntax is python-brace-format.

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

Tell a raw string (r'') from a regular string ('')?

I'm currently building a tool that will have to match filenames against a pattern. For convenience, I intend to provide both lazy matching (in a glob-like fashion) and regexp matching. For example, the following two snippets would eventually have the same effects:
#mylib.rule('static/*.html')
def myfunc():
pass
#mylib.rule(r'^static/([^/]+)\.html')
def myfunc():
pass
AFAIK r'' is only useful to the Python parser and it actually creates a standard str instance after parsing (the only difference being that it keeps the \).
Is anybody aware of a way to tell one from another?
I would hate to have to provide two alternate decorators for the same purpose or, worse, resorting manually parsing the string to determine if it's a regexp or not.
You can't tell them apart. Every raw string literal could also be written as a standard string literal (possibly requiring more quoting) and vice versa. Apart from this, I'd definitely give different names to the two decorators. They don't do the same things, they do different things.
Example (CPython):
>>> a = r'^static/([^/]+)\.html'; b = '^static/([^/]+)\.html'
>>> a is b
True
So in this particular example, the raw string literal and the standard string literal even result in the same string object.
You can't tell whether a string was defined as a raw string after the fact. Personally, I would in fact use a separate decorator, but if you don't want to, you could use a named parameter (e.g. #rule(glob="*.txt") for globs and #rule(re=r".+\.txt") for regex).
Alternatively, require users to provide a compiled regular expression object if they want to use a regex, e.g. #rule(re.compile(r".+\.txt")) -- this is easy to detect because its type is different.
The term "raw string" is confusing because it sounds like it is a special type of string - when in fact, it is just a special syntax for literals that tells the compiler to do no interpretation of '\' characters in the string. Unfortunately, the term was coined to describe this compile-time behavior, but many beginners assume it carries some special runtime characteristics.
I prefer to call them "raw string literals", to emphasize that it is their definition of a string literal using a don't-interpret-backslashes syntax that is what makes them "raw". Both raw string literals and normal string literals create strings (or strs), and the resulting variables are strings like any other. The string created by a raw string literal is equivalent in every way to the same string defined non-raw-ly using escaped backslashes.

Categories