I'd appreciate some help on an efficient Pythonic solution for this problem.
Our internal coding standards mandate certain information fields should be in a block comment at the top of the file. In Perl, this was obviously a block of text beginning with '#'.
I'm experimenting with including this information in the module docstring in Python. The problem is I need to access some of this information in the program.
I have surgically extended docstring_parser to recognise the information fields, and create a data structure. This all works.
Except that one of the fields includes the source file location. That's fine on Unix, but we are a cross platform shop, and Windows uses '\' as a path separator. Python decides to process this as universal newlines and tabs, with weird results.
So the string %workspace%\PythonLib\rr2\tests\test_rr2.py
get rendered as:
%workspace%\PythonLib
r2 ests est_rr2.py
which isn't exactly readable anymore.
The fix I have attempted is based on repeated applications of str.replace(), but is there a better way?
#user2357112 is correct. The docstring can be made raw by beginning it with r""", and then everything works.
Related
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
Working from the command line I wrote a function called go(). When called it receives input asking the user for a directory address in the format drive:\directory. No need for extra slashes or quotes or r literal qualifiers or what have you. Once you've provided a directory, it lists all the non-hidden files and directories under it.
I want to update the function now with a statement that stores this location in a variable, so that I can start browsing my hierarchy without specifying the full address every time.
Unfortunately I don't remember what statements I put in the function in the first place to make it work as it does. I know it's simple and I could just look it up and rebuild it from scratch with not too much effort, but that isn't the point.
As someone who is trying to learn the language, I try to stay at the command line as much as possible, only visiting the browser when I need to learn something NEW. Having to refer to obscure findings attached to vaguely related questions to rediscover how to do things I've already done is very cumbersome.
So my question is, can I see the contents of functions I have written, and how?
Unfortunately no. Python does not have this level of introspection. Best you can do is see the compiled byte code.
The inspect module details what information is available at runtime: https://docs.python.org/3.5/library/inspect.html
As a secondary task to a Python auto-completion (https://github.com/davidhalter/jedi), I'm writing a VIM plugin with the ability to do renaming (refactoring).
The most comfortable way to do renaming is to use cw and autocommand InsertLeave :call do_renaming_func(). To do this I need to access the redo-register (see help redo-register) or something similar, which would record the written text.
If possible, I like to do this without macros, because I don't want to mess up anything.
The . register (#.) contains all editing keys, unfortunately in raw form, so also <Del> and <BS>, which show up as <80>kD, and which insert completion does not interpret. Instead, to extract only the net text entered, use the range delimited by the marks '[ and '] (last one exclusive).
For an example on how to do this, have a look at my PrevInsertComplete plugin.
The . register contains the last inserted text. See :help quote_..
The help doesn't specifically mention any caveats of when this register is populated, however it does mention that it doesn't work when editing the command line. This shouldn't be an issue for you.
The problem was not knowing which register it was, but to access it.
I eventually found the method:
getreg('.')
As #Ingo Karkat points out, this register might include some escape chars.
However, I used a different method in the end. I just read expand('<cword>') to get the new word (because a rename is always only one word). This is far easier and more reliable.
Here's the code (line 113):
https://github.com/davidhalter/jedi/commit/6920f15caf332cd14a7833a071765dfa77d82328
I've just started a coding blog, and I'm using the SyntaxHighlighter Evolved Wordpress plugin for Syntax Highlighting of my snippets.
I've just about finished writing a Pythonic post, and wanted to test out my code snippets before publishing.
If you double click code from inside my snippets, the plugin will stop highlighting the code, allowing you to select it as plain text. However, if I copy and paste some Python code from my snippets, it includes \xc2 or chracters in. This causes Python to winge about the encoding:
SyntaxError: Non-ASCII character '\xc2' in file ex2.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
I don't particularly want to be declaring encodings for every single Python snippet I write - and I don't even know if this will solve the issue.
The best solution would of course to be to get my plugin to not use characters in the plain text version. Or would it?
Does anyone have any ideas as to how I can get around this issue?
Ah, got it. Just a bit of poking around in the plugin's source fixed this issue for me...
If you beautify the syntaxhighlighter3/scripts/shCore.js file, then you can see there is a config variable, which includes:
space: " "
All I had to do was change it to space: " " and repack it.
I ran into this problem when python code had been copied from skype. Since I use vim to edit, I went ahead and found all of these by doing this:
:hls
/<space>
This shows where these odd space characters aren't because they're not highlighted.
Yank one of the characters which will store it into register 0.
Use the substitute command and use <ctrl-R> <0> to paste that character into the command prompt.
:%s/<ctrl-R><0>/ /g
It will look like
:%s/ / /g
but when run, it will correct the problem.
NBSP isn't considered whitespace for the purpose of indentation anyways, so you should take a look at what the pre select user script does and mimic it.
Say you have a some meta data for a custom file format that your python app reads. Something like a csv with variables that can change as the file is manipulated:
var1,data1
var2,data2
var3,data3
So if the user can manipulate this meta data, do you have to worry about someone crafting a malformed meta data file that will allow some arbitrary code execution? The only thing I can imagine if you you made the poor choice to make var1 be a shell command that you execute with os.sys(data1) in your own code somewhere. Also, if this were C then you would have to worry about buffers being blown, but I don't think you have to worry about that with python. If your reading in that data as a string is it possible to somehow escape the string "\n os.sys('rm -r /'), this SQL like example totally wont work, but is there similar that is possible?
If you are doing what you say there (plain text, just reading and parsing a simple format), you will be safe. As you indicate, Python is generally safe from the more mundane memory corruption errors that C developers can create if they are not careful. The SQL injection scenario you note is not a concern when simply reading in files in python.
However, if you are concerned about security, which it seems you are (interjection: good for you! A good programmer should be lazy and paranoid), here are some things to consider:
Validate all input. Make sure that each piece of data you read is of the expected size, type, range, etc. Error early, and don't propagate tainted variables elsewhere in your code.
Do you know the expected names of the vars, or at least their format? Make sure the validate that it is the kind of thing you expect before you use it. If it should be just letters, confirm that with a regex or similar.
Do you know the expected range or format of the data? If you're expecting a number, make sure it's a number before you use it. If it's supposed to be a short string, verify the length; you get the idea.
What if you get characters or bytes you don't expect? What if someone throws unicode at you?
If any of these are paths, make sure you canonicalize and know that the path points to an acceptable location before you read or write.
Some specific things not to do:
os.system(attackerControlledString)
eval(attackerControlledString)
__import__(attackerControlledString)
pickle/unpickle attacker controlled content (here's why)
Also, rather than rolling your own config file format, consider ConfigParser or something like JSON. A well understood format (and libraries) helps you get a leg up on proper validation.
OWASP would be my normal go-to for providing a "further reading" link, but their Input Validation page needs help. In lieu, this looks like a reasonably pragmatic read: "Secure Programmer: Validating Input". A slightly dated but more python specific one is "Dealing with User Input in Python"
Depends entirely on the way the file is processed, but generally this should be safe. In Python, you have to put in some effort if you want to treat text as code and execute it.