python mapping with syntax-checkable keys

python mapping with syntax-checkable keys - python

I need to create a mapping of keys (strings, all suitable python identifiers) to values in python (3.9).
All keys and values are constant and known at creation time and i want to make sure, that every single key has an associated value.
1. dict
The first idea that comes to mind for this would be using a dictionary, which comes with the big problem that keys (in my case) would be strings.
That means i have to retype the key each time a value is accessed manually in a string literal, so IDEs and type checkers can't spot typos, suggest key names in autocomplete and i can't use their utility functions to rename or find usages of a key.
1.5 dict with constant variable keys
the naive solution for this would be to create a constant for each key or an enum, which i don't think is a good solution. Not only is at least one name-lookup added to each access, it also means that the key definition and the value assignment are separated, which can lead to keys that don't have a value assigned to them.
2. enum
This leads to the idea to skip the dict and use an enum to associate the keys directly with the values. Enums are conveniently supported by syntax-checkers, auto completion an the likes, as they support both attribute reference via "dot-notation" and subscriptions via "[]".
However an enum has the big disadvantage that it requires all keys/Enum-Members to have unique values and keys violating this rule will automatically be converted to aliases which makes outputs very confusing.
I already thought about copying the Enum-Code and removing the unwanted bits, but this seems to be a lot of effort for such a basic problem.
question:
So basically, what i'm looking for is a pythonic, neat and concise way to define a (potentially immutable) mapping from string keys to arbitrary values which supports the following:
iterable (over keys)
keys with identical values don't interfere with each other
keys are required to have an associated value
keys are considered by syntax-checkers, auto-completion, refactorings, etc.
The preferred way of using it would be to define it in a python source file but it would be a nice bonus, if the solution supported easy means to write the data to a text file (json format, or ini or similar) and to create a new instance from such a file.
How would you do that and why would you choose a specific solution?

For the first part, I would use aenum1, which has a noalias setting (so duplicate values can exist with distinct names):
from aenum import NoAliasEnum
class Unqiue(NoAliasEnum):
first = 1
one = 1
and in use:
>>> Unique.first
<Unique.first: 1>
>>> Unique.one
<Unique.one: 1>
>>> # name lookup still works
>>> Unique['one']
<Unique.one: 1>
>>> # but value lookups do not
>>> Unique(1)
Traceback (most recent call last):
...
TypeError: NoAlias enumerations cannot be looked up by value
For the second part, decide which you want:
read and create enums from a file
create enum in Python and write to a file
Doing both doesn't seem to make a lot of sense.
To create from a file you can use my JSONEnumMeta answer.
To write to a file you can use my share enums with arduino answer (after adapting the __init_subclass__ code).
The only thing I'm not certain of is the last point of syntax-checker and auto-completion support.
1 Disclosure: I am the author of the Python stdlib Enum, the enum34 backport, and the Advanced Enumeration (aenum) library.

Related

What are the advantages of using column objects instead of strings in PySpark

In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark

Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.

It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.

why Python list() does not always have the same order?

I am using Python 3.5 and the documentation for it at
https://docs.python.org/3.5/library/stdtypes.html#sequence-types-list-tuple-range
says:
list([iterable])
(...)
The constructor builds a list whose items are the same and in the same order as iterable’s items.
OK, for the following script:
#!/usr/bin/python3
import random
def rand6():
return random.randrange(63)
random.seed(0)
check_dict = {}
check_dict[rand6()] = 1
check_dict[rand6()] = 1
check_dict[rand6()] = 1
print(list(check_dict))
I always get
[24, 48, 54]
But, if I change the function to:
def rand6():
return bytes([random.randrange(63)])
then the order returned is not always the same:
>./foobar.py
[b'\x18', b'6', b'0']
>./foobar.py
[b'6', b'0', b'\x18']
Why?

Python dictionaries are implemented as hash tables. In most Python versions (more on this later), the order you get the keys when you iterate over a dictionary is the arbitrary order of the values in the table, which has only very little to do with the order in which they were added (when hash collisions occur, the order of insertions can matter a little bit). This order is implementation dependent. The Python language does not offer any guarantee about the order other than that it will remain the same for several iterations over a dictionary if no keys are added or removed in between.
For your dictionary with integer keys, the hash table doesn't do anything fancy. Integers hash to themselves (except -1), so with the same numbers getting put in the dict, you get a consistent order in the hash table.
For the dictionary with bytes keys however, you're seeing different behavior due to Hash Randomization. To prevent a kind of dictionary collision attack (where a webapp implemented in Python could be DoSed by sending it data with thousands of keys that hash to the same value leading to lots of collisions and very bad (O(N**2)) performance), Python picks a random seed every time it starts up and uses it to randomize the hash function for Unicode and byte strings as well as datetime types.
You can disable the hash randomization by setting the environment variable PYTHONHASHSEED to 0 (or you can pick your own seed by setting it to any positive integer up to 2**32-1).
It's worth noting that this behavior has changed in Python 3.6. Hash randomization still happens, but a dictionary's iteration order is no longer based on the hash values of the keys. While the official language policy is still that the order is arbitrary, the implementation of dict in CPython now preserves the order that its values were added. You shouldn't rely upon this behavior when using regular dicts yet, as it's possible (though it appears unlikely at this point) that the developers will decide it was a mistake and change the implementation again. If you want to guarantee that iteration occurs in a specific order, use the collections.OrderedDict class instead of a normal dict.

How to dump all the variables in a file?

Is there an easy and more or less standard way to dump all the variables into a file, something like stacktrace but with the variables names and values? The ones that are in locals(), globals() and maybe dir().
I can't find an easy way, here's my code for "locals()" which doesn't work because the keys can be of different types:
vars1 = list(filter(lambda x: len(x) > 2 and locals()[x][:2] != "__", locals()))
And without filtering, when trying to dump the variables I get an error:
f.write(json.dumps(locals()))
# =>
TypeError: <filter object at 0x7f9bfd02b710> is not JSON serializable
I think there must be something better that doing it manually.

To start, in your non-working example, you don't exactly filter the keys (which should normally only be strings even if it's not technically required); locals()[x] is the values.
But even if you did filter the keys in some way, you don't generally know that all of the remaining values are JSON serialisable. Therefore, you either need to filter the values to keep only types that can be mapped to JSON, or you need a default serialiser implementation that applies some sensible serialisation to any value. The simplest thing would be to just use the built-in string representation as a fall-back:
json.dumps(locals(), default=repr)
By the way, there's also a more direct and efficient way of dumping JSON to a file (note the difference between dump and dumps):
json.dump(locals(), f, default=repr)

Generate variable definition code from runtime data structure

Say I have a dict at hand at runtime, is there an easy way to create code that defines the dict. For example it should output the string
"d = {'string_attr1' : 'value1', 'bool_attr1': True}"
Of course it would be possible to write a converter function by hand, which iterates over the key-value pairs and puts together the string. Would still require to handle special cases to decide if values have to be quoted or not, etc.
More generally: Is there a built in way or a library to generate variable declarations from runtime data structures?
Context: I would like to use a list of dicts as input for a code generator. The content of the dicts would be queried from an SQL database. I don't want to tightly couple code generation to the querying of the SQL database, so I think it would be convenient to go with generating a python source file defining a list of dictionaries, which can be used as an input to the code generator.

>>> help(repr)
Help on built-in function repr in module __builtin__:
repr(...)
repr(object) -> string
Return the canonical string representation of the object.
For most object types, eval(repr(object)) == object.

Representing filesystem table

I’m working on simple class something like “in memory linux-like filesystem” for educational purposes. Files will be as StringIO objects. I can’t make decision how to implement files-folders hierarchy type in Python. I’m thinking about using list of objects with fields: type, name, parent what else? Maybe I should look for trees and graphs.
Update:
There will be these methods:
new_dir(path),
dir_list(path),
is_file(path),
is_dir(path), remove(path),
read(file_descr),
file_descr open(file_path, mode=w|r),
close(file_descr),
write(file_descr, str)

It's perfectly possible to represent a tree as a nested set of lists. However, since entries are typically indexed by name, and a directory is generally considered to be unordered, nested dictionaries would make many operations faster and easier to write.
I wouldn't store the parent for each entry though, that's implicit from its position in the hierarchy.
Also, if you want your virtual file system to efficiently support hard links, you need to separate a file's contents from the directory hierarchy. That way, you can re-use the contents by giving each piece of content any number of names, which is what hard linking does.

May be you can try using networkx. You just have to intutive to adapt it to use with files and folder.
A simple example
import os,networkx as nx
G=nx.Graph()
for (path, dirs, files) in os.walk(os.getcwd()):
bname = os.path.split(path)
for f in files:
G.add_edge(bname,f)
# Now do what ever you want with the Graph

You should first ask the question: What operations should my "file system" support?
Based on the answer you select the data representation.
For example, if you choose to support only create and delete and the order of the files in the dictionary is not important, then select a python dictionary. A dictionary will map a file name (sub path name) to either a dictionary or the file container object.

What's the API of the filestore? Do you want to keep creation, modification and access times? Presumably the primary lookup will be by file name. Are any other retrieval operations anticipated?
If only lookup by name is required then one possible representation is to map the filestore root directory on to a Python dict. Each entry's key will be the filename, and the value will either be a StringIO object (hint: in Python 2 use cStringIO for better performance if it becomes an issue) or another dict. The StringIO objects represent your files, the dicts represent subdirectories.
So, to access any path you split it up into its constituent components (using .split("/")) and then use each to look up a successive element. Any KeyError exceptions imply "File or directory not found," as would any attempts to index a StringIO object (I'm too lazy to verify the specific exception).
If you want to implement greater detail then you would replace the StringIO objects and dicts with instances of some "filestore object" class. You could call it a "link" (since that's what it models: A Linux hard link). The various attributes of this object can easily be manipulated to keep the file attributes up to date, and the .data attribute can be either a StringIO object or a dict as before.
Overall I would prefer the second solution, since then it's easy to implement methods that do things like keep access times up to date by updating them as the operations are performed, but as I said much depends on the level of detail you want to provide.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.