What's the preferred way to include unicode in python source files? - python

When using unicode strings in source code, there seems to be many ways to skin a cat. The docs and the relevant PEPs have plenty of information about what's possible, but are scant about what is preferred.
For example, the following each seem to give same result:
# coding: utf8
u1 = '\xe2\x82\xac'.decode('utf8')
u2 = u'\u20ac'
u3 = unichr(0x20ac)
u4 = "€".decode('utf8')
u5 = u"€"
If using the __future__ imports, I've found one more option:
# coding: utf8
from __future__ import unicode_literals
u6 = "€"
In python I am used to there being one obvious way to do it, so what is the recommended method of including international content in source files?
This is a python 2 question.
some background...
Methods u1, u2, u3 just seem silly to me, but I have seen enough people writing like this that I assume it is not just personal preference - is there any particular reason why we might want to force only ascii characters in source files, rather than specifying the encoding, or is this just a habit more likely to be found in older code lying around?
There's huge readability improvement in the code to use the actual symbols rather than some escape sequences, and to not do so would seem to be ignoring the strengths of the language rather than taking advantage of hard work by the python devs.

I think the most common way I've used (in Python 2) is:
# coding: utf-8
text = u'résumé'
The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)
from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)
In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.
Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.
¹You should upgrade.

Related

Python docstring escaping tabs and universal newlines

I'd appreciate some help on an efficient Pythonic solution for this problem.
Our internal coding standards mandate certain information fields should be in a block comment at the top of the file. In Perl, this was obviously a block of text beginning with '#'.
I'm experimenting with including this information in the module docstring in Python. The problem is I need to access some of this information in the program.
I have surgically extended docstring_parser to recognise the information fields, and create a data structure. This all works.
Except that one of the fields includes the source file location. That's fine on Unix, but we are a cross platform shop, and Windows uses '\' as a path separator. Python decides to process this as universal newlines and tabs, with weird results.
So the string %workspace%\PythonLib\rr2\tests\test_rr2.py
get rendered as:
%workspace%\PythonLib
r2 ests est_rr2.py
which isn't exactly readable anymore.
The fix I have attempted is based on repeated applications of str.replace(), but is there a better way?
#user2357112 is correct. The docstring can be made raw by beginning it with r""", and then everything works.

How to deal with strings where encoding is unclear

I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.

When is semicolon use in Python considered "good" or "acceptable"?

Python is a "whitespace delimited" language. However, the use of semicolons are allowed. For example, the following works, but it is frowned upon:
print("Hello!");
print("This is valid");
I've been using Python for several years now, and the only time I have ever used semicolons is in generating one-time command-line scripts with Python:
python -c "import inspect, mymodule; print(inspect.getfile(mymodule))"
Or adding code in comments on Stack Overflow (i.e., "you should try import os; print os.path.join(a,b)")
I also noticed in this answer to a similar question that the semicolon can also be used to make one line if blocks, as in
if x < y < z: print(x); print(y); print(z)
which is convenient for the two usage examples I gave (command-line scripts and comments).
The above examples are for communicating code in paragraph form or making short snippets, but not something I would expect in a production codebase.
Here is my question: in Python, is there ever a reason to use the semicolon in a production code? I imagine that they were added to the language solely for the reasons I have cited, but it’s always possible that Guido had a grander scheme in mind.
No opinions please; I'm looking either for examples from existing code where the semicolon was useful, or some kind of statement from the python docs or from Guido about the use of the semicolon.
PEP 8 is the official style guide and says:
Compound statements (multiple statements on the same line) are generally discouraged.
(See also the examples just after this in the PEP.)
While I don't agree with everything PEP 8 says, if you're looking for an authoritative source, that's it. You should use multi-statement lines only as a last resort. (python -c is a good example of such a last resort, because you have no way to use actual linebreaks in that case.)
I use semicolons in code all of the time. Our code folds across lines quite frequently, and a semicolon is a definitive assertion that a statement is ending.
output, errors, status = generate_output_with_errors_and_status(
first_monstrous_functional_argument(argument_one_to_argument
, argument_two_to_argument)
, second_argument);
See? They're quite helpful to readability.

Programming in Python using a Non-English language for keywords and variables

My end goal is to enable people that know the language Urdu and not English to be able to program in a Python environment.
Urdu is written left to right. I imagine having Urdu versions of all python keywords and using Urdu characters to define variable/function/class names.
Is this goal possible? If yes, then what all will need to be done?
I can already see one issue where the standard library functions would still be in English.
Update:
Whether this should be done or not is certainly a very interesting debate topic, which I'd love to talk about. However, is it actually possible and how?
I love Python and I know a lot of intelligent people that might never be taught English properly, only Urdu, hence the thought.
Chinese Python used to allow programmers to write source code entirely in Chinese, including translated module names, function names, and all keywords. Here's a code example from their website:
載入 系統 # import sys
文件名 = 系統.參數[1:] # filenames = sys.argv[1:]
定義 修正行尾(文件): # def fixline(file):
內文 = 打開(文件).讀入() # text = open(file).read()
內文 = 內文.替換('\n\r','\n') # text = text.replace('\n\r', '\n')
傳回 內文 # return text
取 文件 自 文件名: # for file in filenames:
寫 修正行尾(文件) # print fixline(file)
Sadly, it is no longer an active project, but you can get the source and find out what they changed to get an idea for your own implementation. (Learning why they failed may also help you understand what challenges you will have in making such a system successful).
You can use the ideas project to transform keywords,
it has an example for making a kind of French Python.
It's based on python import-hook so it runs in normal python3.
I built a python in Hebrew (that also uses the python builtins and not just keywords ) using this project ,that looks more like a normal python
Yes, it is possible. But it requires you to download Python from source, and change the source code and compile it. If I remember correctly from another discussion on the same topic, changing the grammar and regenerating some files is all it takes, except for when you are doing advanced stuff like meta-programming and things.
WHAT HAVE YOU TRIED? Python 3 allows you to use Unicode for all your function, class and variable names. Check it out, you just need to make sure you're using utf-8 for your script-- it's primarily a matter for your editor. Dealing gracefully with the mixed RTL/LTR issues is also your editor's problem. (The feature is discussed in PEP 3131)
The python language does not have "dialects", i.e., alternative sets of keywords. If you are not satisfied with Urdu identifiers and you are determined to have a completely Urdu-language experience, you could write a preprocessor that maps Urdu "keywords" to the corresponding (English) python keyword.
It shouldn't be too hard to wrap this kind of preprocessor around the interactive console and import modules, so that it works transparently to the user (and without recompiling python from source). If you can program and want to try this on for size, check out python's code module. It's designed to read python source and send it on to be compiled and executed. You just step in and add a preprocessing step.
I'd use a preprocessor for keywords and built-ins. But providing localized wrappers for your choice of common modules is even easier, actually. Let's demonstrate with a wrapper module RE.py that contains all exported identifiers of regular re, but renamed to upper case:
import re as _re
for name in _re.__all__:
locals()[name.upper()] = _re.__dict__[name]
That was it! You can now import this module and use it:
import RE
docs = RE.SUB(r"\bEnglish\b", r"Urdu", docs)
Use Urdu words instead of uppercase (of course you need to specify each one individually) and you're on your way. As you say, whether all this is a good idea is a different question.

Is there a way around coding in Python without the tab, indent & whitespace criteria?

I want to start using Python for small projects but the fact that a misplaced tab or indent can throw a compile error is really getting on my nerves. Is there some type of setting to turn this off?
I'm currently using NotePad++. Is there maybe an IDE that would take care of the tabs and indenting?
The answer is no.
At least, not until something like the following is implemented:
from __future__ import braces
No. Indentation-as-grammar is an integral part of the Python language, for better and worse.
Emacs! Seriously, its use of "tab is a command, not a character", is absolutely perfect for python development.
All of the whitespace issues I had when I was starting Python were the result mixing tabs and spaces. Once I configured everything to just use one or the other, I stopped having problems.
In my case I configured UltraEdit & vim to use spaces in place of tabs.
It's possible to write a pre-processor which takes randomly-indented code with pseudo-python keywords like "endif" and "endwhile" and properly indents things. I had to do this when using python as an "ASP-like" language, because the whole notion of "indentation" gets a bit fuzzy in such an environment.
Of course, even with such a thing you really ought to indent sanely, at which point the conveter becomes superfluous.
I find it hard to understand when people flag this as a problem with Python. I took to it immediately and actually find it's one of my favourite 'features' of the language :)
In other languages I have two jobs:
1. Fix the braces so the computer can parse my code
2. Fix the indentation so I can parse my code.
So in Python I have half as much to worry about ;-)
(nb the only time I ever have problem with indendation is when Python code is in a blog and a forum that messes with the white-space but this is happening less and less as the apps get smarter)
I'm currently using NotePad++. Is
there maybe an IDE that would take
care of the tabs and indenting?
I liked pydev extensions of eclipse for that.
I do not believe so, as Python is a whitespace-delimited language. Perhaps a text editor or IDE with auto-indentation would be of help. What are you currently using?
No, there isn't. Indentation is syntax for Python. You can:
Use tabnanny.py to check your code
Use a syntax-aware editor that highlights such mistakes (vi does that, emacs I bet it does, and then, most IDEs do too)
(far-fetched) write a preprocessor of your own to convert braces (or whatever block delimiters you love) into indentation
You should disable tab characters in your editor when you're working with Python (always, actually, IMHO, but especially when you're working with Python). Look for an option like "Use spaces for tabs": any decent editor should have one.
I agree with justin and others -- pick a good editor and use spaces rather than tabs for indentation and the whitespace thing becomes a non-issue. I only recently started using Python, and while I thought the whitespace issue would be a real annoyance it turns out to not be the case. For the record I'm using emacs though I'm sure there are other editors out there that do an equally fine job.
If you're really dead-set against it, you can always pass your scripts through a pre-processor but that's a bad idea on many levels. If you're going to learn a language, embrace the features of that language rather than try to work around them. Otherwise, what's the point of learning a new language?
Not really. There are a few ways to modify whitespace rules for a given line of code, but you will still need indent levels to determine scope.
You can terminate statements with ; and then begin a new statement on the same line. (Which people often do when golfing.)
If you want to break up a single line into multiple lines you can finish a line with the \ character which means the current line effectively continues from the first non-whitespace character of the next line. This visually appears violate the usual whitespace rules but is legal.
My advice: don't use tabs if you are having tab/space confusion. Use spaces, and choose either 2 or 3 spaces as your indent level.
A good editor will make it so you don't have to worry about this. (python-mode for emacs, for example, you can just use the tab key and it will keep you honest).
Tabs and spaces confusion can be fixed by setting your editor to use spaces instead of tabs.
To make whitespace completely intuitive, you can use a stronger code editor or an IDE (though you don't need a full-blown IDE if all you need is proper automatic code indenting).
A list of editors can be found in the Python wiki, though that one is a bit too exhausting:
- http://wiki.python.org/moin/PythonEditors
There's already a question in here which tries to slim that down a bit:
https://stackoverflow.com/questions/60784/poll-which-python-ideeditor-is-the-best
Maybe you should add a more specific question on that: "Which Python editor or IDE do you prefer on Windows - and why?"
Getting your indentation to work correctly is going to be important in any language you use.
Even though it won't affect the execution of the program in most other languages, incorrect indentation can be very confusing for anyone trying to read your program, so you need to invest the time in figuring out how to configure your editor to align things correctly.
Python is pretty liberal in how it lets you indent. You can pick between tabs and spaces (but you really should use spaces) and can pick how many spaces. The only thing it requires is that you are consistent which ultimately is important no matter what language you use.
I was a bit reluctant to learn Python because of tabbing. However, I almost didn't notice it when I used Vim.
If you don't want to use an IDE/text editor with automatic indenting, you can use the pindent.py script that comes in the Tools\Scripts directory. It's a preprocessor that can convert code like:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
end if
else:
print 'oops!'
end if
end def foobar
into:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
# end if
else:
print 'oops!'
# end if
# end def foobar
Which is valid python.
Nope, there's no way around it, and it's by design:
>>> from __future__ import braces
File "<stdin>", line 1
SyntaxError: not a chance
Most Python programmers simply don't use tabs, but use spaces to indent instead, that way there's no editor-to-editor inconsistency.
I'm surprised no one has mentioned IDLE as a good default python editor. Nice syntax colors, handles indents, has intellisense, easy to adjust fonts, and it comes with the default download of python. Heck, I write mostly IronPython, but it's so nice & easy to edit in IDLE and run ipy from a command prompt.
Oh, and what is the big deal about whitespace? Most easy to read C or C# is well indented, too, python just enforces a really simple formatting rule.
Many Python IDEs and generally-capable text/source editors can handle the whitespace for you.
However, it is best to just "let go" and enjoy the whitespace rules of Python. With some practice, they won't get into your way at all, and you will find they have many merits, the most important of which are:
Because of the forced whitespace, Python code is simpler to understand. You will find that as you read code written by others, it is easier to grok than code in, say, Perl or PHP.
Whitespace saves you quite a few keystrokes of control characters like { and }, which litter code written in C-like languages. Less {s and }s means, among other things, less RSI and wrist pain. This is not a matter to take lightly.
In Python, indentation is a semantic element as well as providing visual grouping for readability.
Both space and tab can indicate indentation. This is unfortunate, because:
The interpretation(s) of a tab varies
among editors and IDEs and is often
configurable (and often configured).
OTOH, some editors are not
configurable but apply their own
rules for indentation.
Different sequences of
spaces and tabs may be visually
indistinguishable.
Cut and pastes can alter whitespace.
So, unless you know that a given piece of code will only be modified by yourself with a single tool and an unvarying config, you must avoid tabs for indentation (configure your IDE) and make sure that you are warned if they are introduced (search for tabs in leading whitespace).
And you can still expect to be bitten now and then, as long as arbitrary semantics are applied to control characters.
Check the options of your editor or find an editor/IDE that allows you to convert TABs to spaces. I usually set the options of my editor to substitute the TAB character with 4 spaces, and I never run into any problems.
Yes, there is a way. I hate these "no way" answers, there is no way until you discover one.
And in that case, whatever it is worth, there is one.
I read once about a guy who designed a way to code so that a simple script could re-indent the code properly. I didn't managed to find any links today, though, but I swear I read it.
The main tricks are to always use return at the end of a function, always use pass at the end of an if or at the end of a class definition, and always use continue at the end of a while. Of course, any other no-effect instruction would fit the purpose.
Then, a simple awk script can take your code and detect the end of block by reading pass/continue/return instructions, and the start of code with if/def/while/... instructions.
Of course, because you'll develop your indenting script, you'll see that you don't have to use continue after a return inside the if, because the return will trigger the indent-back mechanism. The same applies for other situations. Just get use to it.
If you are diligent, you'll be able to cut/paste and add/remove if and correct the indentations automagically. And incidentally, pasting code from the web will require you to understand a bit of it so that you can adapt it to that "non-classical" setting.

Categories