I'm writing a set of python functions that perform some sort of conformance checking on a source code project. I'd like to specify quite verbose names for these functions, e.g.: check_5_theVersionOfAllVPropsMatchesTheVersionOfTheAutolinkHeader()
Could such excessively long names be a problem for python? Is there a maximum length for attribute names?
2.3. Identifiers and keywords from The Python Language Reference:
Identifiers are unlimited in length.
But you'll be violating PEP-8 most likely, which is not really cool:
Limit all lines to a maximum of 79 characters.
Also you'll be violating PEP-20 (the Zen of Python):
Readability counts.
They could be a problem for the programmer. Keep the function names reasonably short, and use docstrings to document them.
Since attribute names just get hashed and turned in to keys on inst.__dict__ for 99% of classes you'll ever encounter, there's no real limit on length. As long as it is hashable, it'll work as an attribute name. For the other 1% of classes that fiddle with __setattr__\ __getattr__\ __getattribute__ in ways that break the guarantee that anything hashable is a valid attribute name though, the previous does not apply.
Of course, as others have pointed out, you will have code style and quality concerns with longer named attributes. If you are finding yourself needing such long names, it's likely indicative of a design flaw in your program, and you should probably look at giving your data more hierarchical structure and better abstracting and dividing responsibility in your functions and methods.
Related
I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?
Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())
The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(
Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.
I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.
Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.
Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...
So am I right, or are there more insights/something more that I'm not considering?
The biggest reason is that it reduces type safety.
How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?
I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.
Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.
If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.
The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:
It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.
Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying.
The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.
In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.
Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.
Such properties are good, because you can reason and make assumptions about code using them.
They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.
In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.
An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:
len(print(something) for ... in ...)
to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.
I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.
Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.
You could have a look at methods around baysian filtering.
My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.
One of the things that I admire about Python is its distinction between mutable and immutable types. Having spent a while programming in c before coming to Python, I was astonished at how easily Python does away with all the complexities of pointer dereferencing that drive me mad in c. In Python everything just works the way I expect, and I quickly realized that the mutable/immutable distinction plays an important part in that.
There are still a few wrinkles, of course (mutable function argument defaults being a notable example) but overall, I feel that the mutable/immutable distinction greatly clarifies the question of what variables and their values are and how they ought to behave.
But where does it come from? I have to assume that GvR was not the first person to conceive of this distinction, and that Python was not the first language to use it. I'm interested in hearing about earlier languages that used this concept, as well as any early theoretical discussions of it.
If you like the idea of immutability, you should check out a pure functional language. In Haskell all (pure) variables are immutable (yet still referred to as "variables", but there you go). It is a great idea - you and the compiler both know that passing something into a function cannot change it in any way.
In C there is no explicit concept of immutable because the language is based on copy semantic. In Python instead values are always passed by reference and immutability plays an important role to keep the language manageable.
In Python you don't have pointers because everything is indeed a pointer! Imagine what it could mean for a Python program that even number objects could change value over time... you would be forced to play tricks like
class Shape:
def __init__(self, points):
self.points = points[:] # make a copy of the list
not only for lists and dicts, but also for numeric values and strings.
So basically in Python some types have immutable instances because they normally are used as "values" and you don't care about identity. In the rare cases in which you need for example a mutable object with a numeric value you need to explicitly wrap it up in say a class instance.
In other languages substantially based on reference semantic like LISP immutability is a choice left to the programmer and not a constraint even if many functions and idioms are supporting it (a big percentage of LISP standard library functions are non-destructive and it's indeed sort of a shame that destructive ones aren't always clearly distinguishable by the name).
So for example strings are mutable LISP but probably not many LISP programs actually modify strings in place because that would mean giving up the nice possibility of sharing and would require explicitly copying strings in many places (in most cases strings are just values, not objects in which you care about identity). Are strings immutable in LISP? No. Are programs mutating them? Almost never.
Not leaving a choice to the programmer is in the spirit of the Python language. It feels great when the choices you are forced to are in line with your idea... less decisions and things are exactly like you wanted them to be.
There are however in my opinion two different dangers with this approach:
Problems arise when those pre-made choices are NOT in line on what you would like or need to do. Python is a wonderful language, but with a different dictator could become pure hell.
Understanding and making choiches forces you to think and expands your mind. Not making choices instead just puts a cage on your mind and after a while you may not even realize that what you are using is just ONE possibility, not the only possibility. After a while you may begin to just think that "things MUST be done this way": you don't even feel you are forced because your mind has already been "mutilated".
Objective C is loaded with mutable/immutable distinctions (to the point where there are both NSString and NSMutableString, for example); it predates Python by about 8 years. Smalltalk, from which Objective C inherited much of its OO design, uses the concept to a lesser extent (notably, strings are not immutable; the trend these days is towards immutable strings as in Python, Ruby, etc.).
I'd recommend reading the following entries from Wikipedia:
Immutable object
Functional Programming
I've had some really awesome help on my previous questions for detecting paws and toes within a paw, but all these solutions only work for one measurement at a time.
Now I have data that consists off:
about 30 dogs;
each has 24 measurements (divided into several subgroups);
each measurement has at least 4 contacts (one for each paw) and
each contact is divided into 5 parts and
has several parameters, like contact time, location, total force etc.
Obviously sticking everything into one big object isn't going to cut it, so I figured I needed to use classes instead of the current slew of functions. But even though I've read Learning Python's chapter about classes, I fail to apply it to my own code (GitHub link)
I also feel like it's rather strange to process all the data every time I want to get out some information. Once I know the locations of each paw, there's no reason for me to calculate this again. Furthermore, I want to compare all the paws of the same dog to determine which contact belongs to which paw (front/hind, left/right). This would become a mess if I continue using only functions.
So now I'm looking for advice on how to create classes that will let me process my data (link to the zipped data of one dog) in a sensible fashion.
How to design a class.
Write down the words. You started to do this. Some people don't and wonder why they have problems.
Expand your set of words into simple statements about what these objects will be doing. That is to say, write down the various calculations you'll be doing on these things. Your short list of 30 dogs, 24 measurements, 4 contacts, and several "parameters" per contact is interesting, but only part of the story. Your "locations of each paw" and "compare all the paws of the same dog to determine which contact belongs to which paw" are the next step in object design.
Underline the nouns. Seriously. Some folks debate the value of this, but I find that for first-time OO developers it helps. Underline the nouns.
Review the nouns. Generic nouns like "parameter" and "measurement" need to be replaced with specific, concrete nouns that apply to your problem in your problem domain. Specifics help clarify the problem. Generics simply elide details.
For each noun ("contact", "paw", "dog", etc.) write down the attributes of that noun and the actions in which that object engages. Don't short-cut this. Every attribute. "Data Set contains 30 Dogs" for example is important.
For each attribute, identify if this is a relationship to a defined noun, or some other kind of "primitive" or "atomic" data like a string or a float or something irreducible.
For each action or operation, you have to identify which noun has the responsibility, and which nouns merely participate. It's a question of "mutability". Some objects get updated, others don't. Mutable objects must own total responsibility for their mutations.
At this point, you can start to transform nouns into class definitions. Some collective nouns are lists, dictionaries, tuples, sets or namedtuples, and you don't need to do very much work. Other classes are more complex, either because of complex derived data or because of some update/mutation which is performed.
Don't forget to test each class in isolation using unittest.
Also, there's no law that says classes must be mutable. In your case, for example, you have almost no mutable data. What you have is derived data, created by transformation functions from the source dataset.
The following advices (similar to #S.Lott's advice) are from the book, Beginning Python: From Novice to Professional
Write down a description of your problem (what should the problem do?). Underline all the nouns, verbs, and adjectives.
Go through the nouns, looking for potential classes.
Go through the verbs, looking for potential methods.
Go through the adjectives, looking for potential attributes
Allocate methods and attributes to your classes
To refine the class, the book also advises we can do the following:
Write down (or dream up) a set of use cases—scenarios of how your program may be used. Try to cover all the functionally.
Think through every use case step by step, making sure that everything we need is covered.
I like the TDD approach...
So start by writing tests for what you want the behaviour to be. And write code that passes. At this point, don't worry too much about design, just get a test suite and software that passes. Don't worry if you end up with a single big ugly class, with complex methods.
Sometimes, during this initial process, you'll find a behaviour that is hard to test and needs to be decomposed, just for testability. This may be a hint that a separate class is warranted.
Then the fun part... refactoring. After you have working software you can see the complex pieces. Often little pockets of behaviour will become apparent, suggesting a new class, but if not, just look for ways to simplify the code. Extract service objects and value objects. Simplify your methods.
If you're using git properly (you are using git, aren't you?), you can very quickly experiment with some particular decomposition during refactoring, and then abandon it and revert back if it doesn't simplify things.
By writing tested working code first you should gain an intimate insight into the problem domain that you couldn't easily get with the design-first approach. Writing tests and code push you past that "where do I begin" paralysis.
The whole idea of OO design is to make your code map to your problem, so when, for example, you want the first footstep of a dog, you do something like:
dog.footstep(0)
Now, it may be that for your case you need to read in your raw data file and compute the footstep locations. All this could be hidden in the footstep() function so that it only happens once. Something like:
class Dog:
def __init__(self):
self._footsteps=None
def footstep(self,n):
if not self._footsteps:
self.readInFootsteps(...)
return self._footsteps[n]
[This is now a sort of caching pattern. The first time it goes and reads the footstep data, subsequent times it just gets it from self._footsteps.]
But yes, getting OO design right can be tricky. Think more about the things you want to do to your data, and that will inform what methods you'll need to apply to what classes.
After skimming your linked code, it seems to me that you are better off not designing a Dog class at this point. Rather, you should use Pandas and dataframes. A dataframe is a table with columns. You dataframe would have columns such as: dog_id, contact_part, contact_time, contact_location, etc.
Pandas uses Numpy arrays behind the scenes, and it has many convenience methods for you:
Select a dog by e.g. : my_measurements['dog_id']=='Charly'
save the data: my_measurements.save('filename.pickle')
Consider using pandas.read_csv() instead of manually reading the text files.
Writing out your nouns, verbs, adjectives is a great approach, but I prefer to think of class design as asking the question what data should be hidden?
Imagine you had a Query object and a Database object:
The Query object will help you create and store a query -- store, is the key here, as a function could help you create one just as easily. Maybe you could stay: Query().select('Country').from_table('User').where('Country == "Brazil"'). It doesn't matter exactly the syntax -- that is your job! -- the key is the object is helping you hide something, in this case the data necessary to store and output a query. The power of the object comes from the syntax of using it (in this case some clever chaining) and not needing to know what it stores to make it work. If done right the Query object could output queries for more then one database. It internally would store a specific format but could easily convert to other formats when outputting (Postgres, MySQL, MongoDB).
Now let's think through the Database object. What does this hide and store? Well clearly it can't store the full contents of the database, since that is why we have a database! So what is the point? The goal is to hide how the database works from people who use the Database object. Good classes will simplify reasoning when manipulating internal state. For this Database object you could hide how the networking calls work, or batch queries or updates, or provide a caching layer.
The problem is this Database object is HUGE. It represents how to access a database, so under the covers it could do anything and everything. Clearly networking, caching, and batching are quite hard to deal with depending on your system, so hiding them away would be very helpful. But, as many people will note, a database is insanely complex, and the further from the raw DB calls you get, the harder it is to tune for performance and understand how things work.
This is the fundamental tradeoff of OOP. If you pick the right abstraction it makes coding simpler (String, Array, Dictionary), if you pick an abstraction that is too big (Database, EmailManager, NetworkingManager), it may become too complex to really understand how it works, or what to expect. The goal is to hide complexity, but some complexity is necessary. A good rule of thumb is to start out avoiding Manager objects, and instead create classes that are like structs -- all they do is hold data, with some helper methods to create/manipulate the data to make your life easier. For example, in the case of EmailManager start with a function called sendEmail that takes an Email object. This is a simple starting point and the code is very easy to understand.
As for your example, think about what data needs to be together to calculate what you are looking for. If you wanted to know how far an animal was walking, for example, you could have AnimalStep and AnimalTrip (collection of AnimalSteps) classes. Now that each Trip has all the Step data, then it should be able to figure stuff out about it, perhaps AnimalTrip.calculateDistance() makes sense.