I'm currently building a tool that will have to match filenames against a pattern. For convenience, I intend to provide both lazy matching (in a glob-like fashion) and regexp matching. For example, the following two snippets would eventually have the same effects:
#mylib.rule('static/*.html')
def myfunc():
pass
#mylib.rule(r'^static/([^/]+)\.html')
def myfunc():
pass
AFAIK r'' is only useful to the Python parser and it actually creates a standard str instance after parsing (the only difference being that it keeps the \).
Is anybody aware of a way to tell one from another?
I would hate to have to provide two alternate decorators for the same purpose or, worse, resorting manually parsing the string to determine if it's a regexp or not.
You can't tell them apart. Every raw string literal could also be written as a standard string literal (possibly requiring more quoting) and vice versa. Apart from this, I'd definitely give different names to the two decorators. They don't do the same things, they do different things.
Example (CPython):
>>> a = r'^static/([^/]+)\.html'; b = '^static/([^/]+)\.html'
>>> a is b
True
So in this particular example, the raw string literal and the standard string literal even result in the same string object.
You can't tell whether a string was defined as a raw string after the fact. Personally, I would in fact use a separate decorator, but if you don't want to, you could use a named parameter (e.g. #rule(glob="*.txt") for globs and #rule(re=r".+\.txt") for regex).
Alternatively, require users to provide a compiled regular expression object if they want to use a regex, e.g. #rule(re.compile(r".+\.txt")) -- this is easy to detect because its type is different.
The term "raw string" is confusing because it sounds like it is a special type of string - when in fact, it is just a special syntax for literals that tells the compiler to do no interpretation of '\' characters in the string. Unfortunately, the term was coined to describe this compile-time behavior, but many beginners assume it carries some special runtime characteristics.
I prefer to call them "raw string literals", to emphasize that it is their definition of a string literal using a don't-interpret-backslashes syntax that is what makes them "raw". Both raw string literals and normal string literals create strings (or strs), and the resulting variables are strings like any other. The string created by a raw string literal is equivalent in every way to the same string defined non-raw-ly using escaped backslashes.
Related
I know that we can use the r(raw string) and u(unicode) flags before a string to get what we might actually desired. However, I am wondering how these do work with strings. I tried this in the IDLE:
a = r"This is raw string and \n will come as is"
print a
# "This is raw string and \n will come as is"
help(r)
# ..... Will get NameError
help(r"")
# Prints empty
How Python knows that it should treat the r or u in the front of a string as a flag? Or as string literals to be specific? If I want to learn more about what are the string literals and their limitations, how can I learn them?
The u and r prefixes are a part of the string literal, as defined in the python grammar. When the python interpreter parses a textual command in order to understand what the command does, it reads r"foo" as a single string literal with the value "foo". On the other hand, it reads b"foo" as a single bytes literal with an equivalent value.
For more information, you can refer to the literals section in python's documentation. Also, python has an ast module, that allows you to explore the way python parses commands.
Is there any automatic way to convert a piece of code from python's old style string formatting (using %) to the new style (using .format)? For example, consider the formatting of a PDB atom specification:
spec = "%-6s%5d %4s%1s%3s %1s%4d%1s %8.3f%8.3f%8.3f%6.2f%6.2f %2s%2s"
I've been converting some of these specifications by hand as needed, but this is both error prone, and time-consuming as I have many such specifications.
Use pyupgrade
pyupgrade --py3-plus <filename>
You can convert to f-strings (formatted string literals) instead of .format() with
pyupgrade --py36-plus <filename>
You can install it with
pip install pyupgrade
The functionality of the two forms does not match up exactly, so there is no way you could automatically translate every % string into an equivalent {} string or (especially) vice-versa.
Of course there is a lot of overlap, and many of the sub-parts of the two formatting languages are the same or very similar, so someone could write a partial converter (which could, e.g., raise an exception for non-convertible code).
For a small subset of the language like what you seem to be using, you could do it pretty trivially with a simple regex—every pattern starts with % and ends with one of [sdf], and something like {:\1\2} as a replacement pattern ought to be all you need.
But why bother? Except as an exercise in writing parsers, what would be the benefit? The % operator is not deprecated, and using % with an existing % format string will obviously do at least as well as using format with a % format string converted to {}.
If you are looking at this as an exercise in writing parsers, I believe there's an incomplete example buried inside pyparsing.
Some differences that are hard to translate, off the top of my head:
* for dynamic field width or precision; format has a similar feature, but does it differently.
%(10)s, because format tries to interpret the key name as a number first, then falls back to a dict key.
%(a[b])s, because format doesn't quote or otherwise separate the key from the rest of the field, so a variety of characters simply can't be used.
%c takes integers or single-char strings; :c only integers.
%r/%s/%a analogues are not part of the format string, but a separate part of the field (which also comes on the opposite side).
%g and :g have slightly different cutoff rules.
%a and !a don't do the exact same thing.
The actual differences aren't listed anywhere; you will have to dig them out by a thorough reading of the Format Specification Mini-Language vs. the printf-style String Formatting language.
The docs explain some of the differences. As far as I can tell -- although I'm not very familiar with old-style format strings -- is that the functionality of the new style is a superset of the functionality of the oldstyle.
You'd have to do more tweaking to handle edge cases, but I think something simple like
re.replace(r'%(\w+)([sbcdoXnf...])', r'{\1\2}', your_string)
would get you 90% of the way there. The remaining translation -- going from things like %x to {0:x} -- will be too complex for a regular expression to handle (without writing some ridiculously complex conditionals inside of your regex).
I am from a c background. started learning python few days ago. my question is what is the end of string notation in python. like we are having \0 in c. is there anything like that in python.
There isn't one. Python strings store the length of the string independent from the string contents.
There is nothing like that in Python. A string is simply a string. The following:
test = "Hello, world!"
is simply a string of 13 characters. It's a self-contained object and it knows how many character it contains, there is no need for an end-of-string notation.
Python's string management is internally a little more complex than that. Strings is a sequence type so that from a python coder's point of view it is more an array of characters than anything. (And so it has no terminating character but just a length property.)
If you must know: Internally python strings' data character arrays are null terminated. But the String object stores a couple of other properties as well. (e.g. the hash of the string for use as key in dictionaries.)
For more detailed info (especially for C coders) see here: http://www.laurentluce.com/posts/python-string-objects-implementation/
In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')
I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".
The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.