How to set the filename encoding globally for a python interpreter? - python

In a project, we need to open many files all over the place in the source code. We know that all filenames on disk are encoded in utf-8 and all filenames are proceed as Unicode internally.
Is there a way to set the filename encoding globally for the running interpreter, so that we change the following command:
open(filename.encode('utf-8'))
to this simpler version:
open(filename)
This would reduces errors and confusion among the developers. We use python 2.7.

Assuming this is really what you meant (and not the encoding of the file's contents), this would do it:
open = lambda fname, *a, **kw: __builtins__.open(fname.encode('utf-8'), *a, **kw)
This will only affect modules that include (or import) the redefinition, so it's reasonably safe. But it might be less confusing, and certainly less trouble-prone over the long run, to provide a different command for opening files in your environment for which this applies:
def fsopen(fname, *args, **kwargs):
"""(explain when and why you want this)"""
return open(fname.encode('utf-8'), *args, **kwargs)

There is no such option, because the open function has documented behaviour not to UTF-8 encode the strings.
There's a simple reason for this: open is more or less just a delegate of the underlying OS function, and must work on any string that the OS function would work with. Hence, it expects raw Python strings and not unicode.
If you need this, implement a wrapper function to do this.
In Python, you can just override the local name -- or the global one, for that matter:
__uni_open = lambda string: __builtins__.open(string.encode("utf-8"))
open = __uni_open
in whatever module all your applications work with.
I don't recommend doing this -- it might break stuff beyond your project's control.

Related

Python documentation input variable type hints

TLDR
I'm noticing a significant difference in the information presented by the official python docs compared to what I'm seeing in the PyCharm hover-over / quickdocs. I'm hoping someone can point me to where I can find the source of this quickdoc information such that I can use it outside of PyCharm as a general reference.
For example in the python docs for os.makedir I see:
os.makedirs(name, mode=0o777, exist_ok=False)
Recursive directory creation function. Like mkdir(), but makes all intermediate-level directories needed to contain the leaf directory.
The mode parameter is passed to mkdir() for creating the leaf directory; see the mkdir() description for how it is interpreted. To set the file permission bits of any newly created parent directories you can set the umask before invoking makedirs(). The file permission bits of existing parent directories are not changed.
If exist_ok is False (the default), an FileExistsError is raised if the target directory already exists.
Note
makedirs() will become confused if the path elements to create include pardir (eg. “..” on UNIX systems).
This function handles UNC paths correctly.
Raises an auditing event os.mkdir with arguments path, mode, dir_fd.
New in version 3.2: The exist_ok parameter.
Changed in version 3.4.1: Before Python 3.4.1, if exist_ok was True and the directory existed, makedirs() would still raise an error if mode did not match the mode of the existing directory. Since this behavior was impossible to implement safely, it was removed in Python 3.4.1. See bpo-21082.
Changed in version 3.6: Accepts a path-like object.
Changed in version 3.7: The mode argument no longer affects the file permission bits of newly created intermediate-level directories.
But in the quickdocs I see:
os def makedirs(name: str | bytes | PathLike[str] | PathLike[bytes],
mode: int = ...,
exist_ok: bool = ...) -> None
makedirs(name [, mode=0o777][, exist_ok=False]) Super-mkdir; create a leaf directory and all intermediate ones. Works like mkdir, except that any intermediate path segment (not just the rightmost) will be created if it does not exist. If the target directory already exists, raise an OSError if exist_ok is False. Otherwise no exception is raised. This is recursive.
Where is this quickdoc type hinting information coming from and where can I find a complete reference with all these type hints such that I can reference it outside of PyCharm?
Background
Coming mainly from a strongly typed language like Java, I struggle to make constructive use of the python documentation with regards to function input parameter types. I am hoping someone can elucidate a standard process for resolving ambiguity compared to my current trail+[lots of]errors approach.
For example, the os.makedir function's first parameter is name.
os.makedirs(name, mode=0o777, exist_ok=False)
It is not apparent to me what sorts of things I can pass as name here. Could it be:
A str? If so, how should I create this? Via a string literal, double quoted string? Does this accept / separators or \ separators or system dependent?
A pathlib.Path?
Anything pathlike?
[Note the above are rhetorical questions and not the focus of this post]. These are all informed guesses, but if I were completely new to python and was trying to use this documentation to make some directories, I see two options:
Read the source code via some IDE or other indexing
Guess until I get it right
The first is fine for easier to understand functions like makedirs but for more complicated functions this would require gaining expertise in a library that I don't necessarily want to reuse and just want to try out. I simply don't have enough time to become an expert in everything I encounter. This seems quite inefficient.
The second also seems to be quite inefficient, with the added demerit of not knowing how to write robust code to check for inappropriate inputs.
Now I don't want to bash the python docs, as they are LEAPS and BOUNDS better than a fair few other languages I've used, but is this dilemma just a case of unfinished/poor documentation, or is there a standard way of knowing/understanding what input parameters like name in this case should be that I haven't outlined above?
To be fair, this may not be the optimal example, as if you look towards the end of the doc for makedirs you can see it does state:
Changed in version 3.6: Accepts a path-like object.
but this is not specifically referring to name. Yes, in this example it may seem rather obvious it is referring to name, but with the advent of type-hinting, why are the docs not type hinted like the quickdocs from PyCharm? Is this something planned for the future, or is it too large a can of worms to try to hint all possibilities in a flexible language like python?
Just as a comparison, take a look at Java's java.io.file.mkdirs where the various constructors definitely tell you all the options for specifying the path of the file:
File(File parent, String child)
// Creates a new File instance from a parent abstract pathname and a child pathname string.
File(String pathname)
// Creates a new File instance by converting the given pathname string into an abstract pathname.
File(String parent, String child)
// Creates a new File instance from a parent pathname string and a child pathname string.
File(URI uri)
// Creates a new File instance by converting the given file: URI into an abstract pathname.
Just reading this I already know exactly how to make a File object and create directories without running/testing anything. With the quickdoc in PyCharm I can do the same, so where is this type hint information in the official docs?

How to load abstract syntax tree of a python module by its name?

Let's say the name of the module is available in form of a string rather than module object. How do I locate its source code location and load the abstract syntax tree (if the source code is present)?
I'd take the problem in three steps:
Import the module by name. This should be relatively easy using importlib.import_module, though you could bodge up your own version with the builtin __import__ if you needed to.
Get the source code for the module. Using inspect.getsource is probably the easiest way (but you could also just try open(the_module.__file__).read() and it is likely to work).
Parse the source into an AST. This should be easy with ast.parse. Even for this step, the library isn't essential, as you can use the builtin compile instead, as long as you pass the appropriate flag (ast.PyCF_ONLY_AST appears to be 1024 on my system, so compile(source, filename, 'exec', 1024) should work).

PTVS IntelliSense not working for Built-in Function

class Class:
def __init__(self, path):
self._path = path
string = open(self._path, 'r'). #HERE
When I try to type read() intelliSense says no completions.
However, I know open() function returns file object, which has read() function. I want to see all supported function after typing a dot.
PyCharm shows me recommanded function list, but PTVS does not support.
I want to know this is casual things in PTVS or only happening to me.
My current Python Enviroment is Anaconda 4.3.0 (Python 3.5.3)
How can I fix it?
We've already fixed the specific case of open for our upcoming update (not the one that released today - the next one), but in short the problem is that you don't really know what open is going to return. In our fix, we guess one of two likely types, which should cover most use cases.
To work around it right now, your best option is to assign the result of open to a variable and force it to a certain type using an assert statement. For example:
f = open(self._path, 'r')
import io
assert isinstance(f, io.TextIOWrapper)
f = open(self._path, 'rb')
import io
assert isinstance(f, io.BufferedIOBase)
Note that your code will now fail if the variable is not the expected type, and that the code for Python 2 would be different from this, but until you can get the update where we embed this knowledge into our code it is the best you can do.

Can I make decode(errors="ignore") the default for all strings in a Python 2.7 program?

I have a Python 2.7 program that writes out data from various external applications. I continually get bitten by exceptions when I write to a file until I add .decode(errors="ignore") to the string being written out. (FWIW, opening the file as mode="wb" doesn't fix this.)
Is there a way to say "ignore encoding errors on all strings in this scope"?
You cannot redefine methods on built-in types, and you cannot change the default value of the errors parameter to str.decode(). There are other ways to achieve the desired behaviour, though.
The slightly nicer way: Define your own decode() function:
def decode(s, encoding="ascii", errors="ignore"):
return s.decode(encoding=encoding, errors=errors)
Now, you will need to call decode(s) instead of s.decode(), but that's not too bad, isn't it?
The hack: You can't change the default value of the errors parameter, but you can overwrite what the handler for the default errors="strict" does:
import codecs
def strict_handler(exception):
return u"", exception.end
codecs.register_error("strict", strict_handler)
This will essentially change the behaviour of errors="strict" to the standard "ignore" behaviour. Note that this will be a global change, affecting all modules you import.
I recommend neither of these two ways. The real solution is to get your encodings right. (I'm well aware that this isn't always possible.)
As mentioned in my thread on the issue the hack from Sven Marnach is even possible without a new function:
import codecs
codecs.register_error("strict", codecs.ignore_errors)
I'm not sure what your setup is exactly, but you can derive a class from str and override its decode method:
class easystr(str):
def decode(self):
return str.decode(self, errors="ignore")
If you then convert all incoming strings to easystr, errors will be silently ignored:
line = easystr(input.readline())
That said, decoding a string converts it to unicode, which should never be lossy. Could you figure out which encoding your strings are using and give that as the encoding argument to decode? That would be a better solution (and you can still make it the default in the above way).
Yet another thing you should try is to read your data differently. Do it like this and the decoding errors may well disappear:
import codecs
input = codecs.open(filename, "r", encoding="latin-1") # or whatever

How can I view source code for datetime.py from PyCharm?

I'm ctrl-clicking on the datetime.timedelta function from within PyCharm, and getting to a file named ....PyCharm10\system\python_stubs\76178323\datetime.py, which appears to contain many empty methods. Specifically, ctrl-clicking on datetime.delta brings me to this part of the file:
def __init__(self, *args, **kwargs): # real signature unknown
pass
Is this a bug? How can I view/trace through the real datetime.py?
Perhaps, there is no way. Possible, OS does not contain source code of the library, only the compiled binary (e.g. /usr/lib/python2.6/lib-dynload/datetime.so).

Categories