In Python (2.7 and above, probably other versions too), it is possible to create a string that is centered by doing something like this:
'{:^10}'.format('abc')
The meaning of 'centered' is pretty clear when the total number of padding characters is even, but what about when it is odd?
When I print the above in vanilla C Python (and IPython), I get
' abc '
This appears to put the extra pad character on the right. However, the docs do not explicitly mention a spec for this behavior. Is the behavior of the centering format specifier in the presence of an odd number of padding characters specified somewhere, or is it an implementation detail that is not to be relied on?
You should be able to rely on this. I don't know that it is documented anywhere, but the standard python test suite asserts that the extra space is added on the right. Since test is part of the standard library, it's a good starting point for other python implementations and they'll be aiming for compliance with the reference implementation wherever possible.
Related
I have a large dataset with over 2 million rows of textual data. Now I want to remove the accents from the strings.
In the link below, two different modules are described to remove the accents:
What is the best way to remove accents in a Python unicode string?
The modules described are unicode and unicodedata. To me it's not clear what the differences are between the two and a comparison is hard, because I don't have many rows with accents and I don't know what accents might be replaced and which ones are not.
Therefore, I would like to know what the differences are between the two and which one is recommended to use.
There is only one module: unicodedata, which includes the unicode database, so the names and properties of unicode code points.
unicode was a built-in function in Python 2. This function just convert strings to unicode strings, so it was just the encoding, no need to store all the data. On python3 all strings are unicode (with some particularities). Just the encoding now should be defined explicitly.
On that answer, you see only import unicodedata, so only one module. To remove accents, you do no need just unicode code point, but also information about the type of a unicode code point (combining character), so you need unicodedata.
Maybe you mean unidecode. This is a special module, but outside standard library. It could be useful for some uses. The modules is simple and give only results in ASCII domain. This could be ok on some cases, but it could cause problems outside Latin writing system.
On the other hand, unicodedata do nothing for you. You should understand unicode and apply the right filter function (and maybe knowing how other languages works).
So it depends on the case, and maybe you need just other slug functions (to create non escaped string). When workign with languages, you should care not to overdo things (you may built an offensive word).
The E226 error code is about "missing whitespace around arithmetic operator".
I use Anaconda's package in Sublime which will highlight as a PEP8 E226 violation for example this line:
hypot2 = x*x + y*y
But in Guido's PEP8 style guide that line is actually shown as an example of recommended use of spaces within operators.
Question: which is the correct guideline? Always spaces around operators or just in some cases (as Guido's recommendation shows)?
Also: who decides what goes into PEP8? I would've thought Guido's recommendation would pretty much determine how that works.
The maintainers of the PEP8 tool decide what goes into it.
As you noticed, these do not always match the PEP8 style guide exactly. In this particular case, I don't know whether it's an oversite by the maintainers, or a deliberate decision. You'd have to ask them to find out, or you might find the answer in the commit history.
Guido recently asked the maintainers of pep8 and pep257 tools to rename them, to avoid this confusion. See this issue for example. As a result, the tools are getting renamed to pycodestyle and pydocstyle, respectively.
It says in PEP8:
If operators with different priorities are used, consider adding whitespace around the operators with the lowest priority(ies). Use your own judgment; however, never use more than one space, and always have the same amount of whitespace on both sides of a binary operator.
(Emphasis is my own).
In the listed example, + has a lower priority, so the BDFL elects to use whitespace around it and uses no whitespace around higher priority *.
In this case that happened to me. We should have space always between numbers or variables and operations.
example:
a=b*4 wrong
a = b * 4 correct
I want to get the expression of p__s_alpha = 1/k * Product(a_0,(i,1,O)) in latex. I use:
print sympy.latex(p__s_alpha)
When I run latex on the result I get the following equation:
However, I want to print this equation:
Is there a way to keep the representation of the expression the way it is?
I started writing up an answer about how you could make your own custom printer that does this, but this I realized that there's already an option in latex that does what you want, the long_frac_ratio option. If that ratio is small enough, any fraction that is small enough will be printed as 1/b*a instead of a/b.
In [31]: latex(p__s_alpha, long_frac_ratio=1)
Out[31]: '\\frac{1}{k} \\prod_{i=1}^{O} a_{0}'
If you're interested, here is some of what I was going to write about writing a custom printer:
Internally, SymPy makes no distinction between a/b and a*1/b. They are both represented by the exact same object (see http://docs.sympy.org/latest/tutorial/manipulation.html).
However, the printing system is different. As you can see from that page, a/b is represented as Mul(a, Pow(b, -1)), i.e., a*b**-1, but it is the printer that converts this into a fraction format (this holds for any printer, not just the LaTeX one).
The good news for you is that the printing system in SymPy is very extensible (and the other good news is that SymPy is BSD open source, so you can freely reuse the printing logic that is already there in extending it). To create a custom LaTeX printer that does what you want, you need to subclass LatexPrinter in sympy.printing.latex and override the _print_Mul function (because as noted above, a/b is a Mul). The logic in this function is not really split up modularly, so you'll really need to copy the whole source and change the relevant parts of it [as I noted above, for this one, there is already an option that does what you want, but in other cases, there may not be].
And a final note, if you make a modification that would probably be wanted be a wider audience, we would love to have you submit it as a pull request to SymPy.
I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.
Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.
You could have a look at methods around baysian filtering.
My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.
The style guide says that underscores should be used, but many Python built-in functions do not. What should the criteria be for underscores? I would like to stay consistent with Python style guidelines but this area seems a little vague. Is there a good rule of thumb, is it based on my own judgment, or does it just not really matter either way?
For example, should I name my function isfoo() to match older functions, or should I name it is_foo() to match the style guideline?
The style guide leaves this up to you:
Function names should be lowercase, with words separated by underscores as necessary to improve readability.
In other words, if you feel like adding an underscore to your method name would make it easier to read -- by all means go ahead an throw one (or two!) in there. If you think that there are enough other similar cases in the standard library, then feel free to leave it out. There is no hard rule here (although others may disagree about that point). The only thing which I think is universally accepted is that you shouldn't use "CapWords" or "camelCase" for your methods. "CapWords" should be reserved for classes, and I'm not sure of any precedence for "camelCase" anywhere (though I could be wrong about that) ...
The style guide says that underscores should be used, but many Python built-in functions do not.
Using the built-in isinstance() as an example, this was written in 1997.
PEP 8 -- Style Guide for Python Code wasn't published until 2001, which may account for this discrepancy.
What should the criteria be for underscores? I would like to stay consistent with Python style guidelines but this area seems a little vague. Is there a good rule of thumb, is it based on my own judgment, or does it just not really matter either way?
"When in doubt, use your best judgment."
It's a style guide, not a law! Follow the guide where it is appropriate, bearing in mind the caveats (emphasis my own):
Function names should be lowercase, with words separated by underscores as necessary to improve readability.
mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility.
Therefore...
For example, should I name my function isfoo() to match older functions, or should I name it is_foo() to match the style guideline?
You should probably call it isfoo(), to follow the convention of similar functions. It is readable. "Readability counts."