I am using a yaml file to store some config data, including some (many) regex strings that I don't want to keep in code. Everything works except when I try to search for some incorrectly escaped pipe characters with r'\\\|'. I tried quoted, unquoted and literal strings in yaml, nothing works. Yaml and Python string escape rules together seem to conspire to keep the number of backslashes in a string even. I open and load the file with
f=open(file_path, 'r', encoding='utf_8')
python_dict = yaml.load(f)
I'd like to reproduce
re.compile('\\\|')
using something like
re.compile(python_dict['escaped-pipes'])
Python 3.4 with PyYAML.
In YAML, \ is a special character only in doubly-quoted strings. The Python string r'\\\|' is a raw string, so it consists of three backslashes and a pipe. You have the following options to encode this in a YAML document:
plain: \\\| # plain scalar. YAML does not process the backslashes.
single: '\\\|' # single quoted scalar. YAML also does not process the backslashes.
double: "\\\\\\|" # double quoted scalar. YAML does process escape sequences here, so you
# need to double the backslashes
Single-quoted scalars are kind-of the YAML equivalent of Python's raw strings.
You may have other characters in your regexes that may interfere with YAML's syntax. You may want to use block scalars instead, which treat all characters as content and are ended via indentation:
block: |-
\\\|
next block: |-
\\\|
| starts a literal block scalar, - causes the final linebreak (before the next item) to not be part of the scalar, which is what you want.
Related
While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.
A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False
I am trying to add single quotes in dumping the following yaml string:
yaml_str = 'Type: modified'
But the output includes double quotes which are not required.
Here is my code:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
data['Type'] = f"'{data['Type']}'"
yaml.dump(data, sys.stdout)
The output:
Type: "'modified'"
The expected output:
Type: 'modified'
Any ideas, please?
I tried all kinds of string formatting, nothing helped.
I also tried to add yaml.preserve_quotes = True which also didn't do any good.
Your expectation is completely wrong, so string formatting is not going to help you at all. YAML, like many other languages need to be able to handle scalars that have embedded quotes and YAML has multiple ways to handle that:
if a string to be dumped into a scalar has special characters that need backslash escaping (e.g. the audible bell \a), the scalar needs to be between double quotes (and double quotes in the string escaped in the scalar)
if a string to be dumped into a scalar has no special characters, but starts with a double quote, the whole scalar can be single quoted (and any existing single quotes in the string, will need to be duplicated '' in the scalar)
If you want to force single quotes in ruamel.yaml, even if they are superfluous, you can use:
data['Type'] = ruamel.yaml.scalarstring.SingleQuotedScalarString('{data['Type']}')
although the much better solution would be to get rid of the program that reads your output file and requires the unnecessary quotes to be there in the first place.
Please note that having quotes in a string doesn't necessarily require the corresponding scalar to have quotes. E.g. a string that has no spaces and a quote somewhere between normal readable characters can be dumped without (extra) quotes.
I am reading rows from excel file and dumping to YAML file, after dumping i figured out some row are mentioned in single quote, double quote and plain text.
Data without any special characters are creaetd as plain text.
Data with \n character and parenthesis are created as 'Data here'
Data with special characters are created as "Data here"
I am using yaml dumper to create YAML file
with open(myprops['output'], "w") as f:
ruamel.yaml.dump(doc,f, Dumper=ruamel.yaml.RoundTripDumper,default_flow_style=False)
How to represent all data to be in single quote - 'Data here'?
You can force the dumper to use single quotes, when the scalar can be represented
using single quoted strings by providing the default_style="'" parameter.
This is not guaranteed to get you single quotes though, single quotes cannot do
the escape sequences that double quotes have (i.e. it is not like Python) and
some values might still get double quotes.
Using ruamel.yaml's new API (where round-trip-dumping is the default):
import sys
import ruamel.yaml
data = [
"25",
"with an\n embedded newline",
"entry with single quote: (')",
42
]
yaml = ruamel.yaml.YAML()
yaml.default_style = "'"
yaml.dump(data, sys.stdout)
which gives:
- '25'
- "with an\n embedded newline"
- 'entry with single quote: ('')'
- !!int '42'
Please note that in order to recognise 42 as an integer, because of
the quotes, that scalar needs to be tagged. The same holds for the
other special types YAML can represent (float, booleans, etc.) If you
don't want that make sure all the values you dump are strings.
You can also see the one escape mechanism single quoted scalars in YAML have:
as single quote in the scalar is doubled. (And if it had been at the end of the
Python string, you would have three single quotes in a row at the end of the scalar.
If you want consistency in your quoting, you should use double quotes, as that can represent all valid characters. Single quoted scalars in YAML can span multiple lines, so in principle it is possible to embed a newline. But there are restrictions on whitespace around the newline.
If you have a mix of string and non-string values in your input data, and you don't want to get the non-strings quoted, then you have to recurse over the data structure and replace each string x with ruamel.yaml.scalarstring.SingleQuotedScalarString(x), that is the
internal representation that ruamel.yaml uses if you specify yaml.preserve_quotes = True to distinguish single quoted input from plain/double/literal/folded scalars.
I have this YAML:
---
test: {"gender":0,"nacionality":"Alem\u00e3o"}
I am reading it using python 3.5 as follow:
with open('teste.yaml', 'r') as stream:
doc = yaml.load_all(stream)
for line in doc:
print(line)
This is the result I get:
{'test': {'gender': 0, 'nacionality': 'Alemão'}}
But If I change " for ' in my YAML, I get this:
{'test': {'nacionality': 'Alem\\u00e3o', 'gender': 0}}
As you can see, when I use " the string Alem\\u00e3o is converted to UTF, but with ' it does not.
So I have two questions:
Why do I get different outputs when I use ' and "?
What can I do to get the output as Alem\\u00e3o when using "?
That's how the YAML data format is defined. Within double quotes, specific escape sequences are interpreted. Within single quotes, they're not.
7.3.1. Double-Quoted Style
The double-quoted style is specified by surrounding “"” indicators. This is the only style capable of expressing arbitrary strings, by using “\” escape sequences. This comes at the cost of having to escape the “\” and “"” characters.
http://yaml.org/spec/1.2/spec.html#id2787109
What can I do to get the output as Alem\u00e3o when using "?
Escape the escape character:
test: {"gender":0,"nacionality":"Alem\\u00e3o"}
Backslash escaping in YAML is only available in double quotes scalars. Not in single quoted scalars, unquoted nor (litereral) block scalars.
To get the output as you wish, the best way is to drop the quotes all together and use this as input:
---
test: {gender: 0, nacionality: Alem\u00e3o}
Your program however is up for some improvement.
you should never use load_all() or load() on this kind of non-tagged YAML. That is unsafe and can lead to arbitrary code being executed on your machine if you don't have complete control over the source YAML. Newer versions of ruamel.yaml will throw a warning if you don't explicitly specify the unsafe Loader as an argument. Do yourself a favour and and get into the habit of using safe_load() and safe_load_all().
load_all() gives back an iterator over documents so using doc and line are misleading variable names. You should use:
import ruamel.yaml as yaml
with open('teste.yaml', 'r') as stream:
for doc in yaml.safe_load_all(stream):
print(doc)
or if there is always just one document in teste.yaml you can simplify that to:
import ruamel.yaml as yaml
with open('teste.yaml') as stream:
print(yaml.safe_load(stream))
both of which will give you:
{'test': {'gender': 0, 'nacionality': 'Alem\\u00e3o'}}
Please note that it is mandatory in YAML to have a space after the : separating key and value in a mapping. Only for compatibility with JSON is it allowed to drop the space assuming the key is quoted (double and single quotes both work). So this works as input as well:
---
test: {"gender":0, 'nacionality':Alem\u00e3o}
I want to parse yaml documents like the following
meta-info-1: val1
meta-info-2: val2
---
Plain text/markdown content!
jhaha
If I load_all this with PyYAML, I get the following
>>> list(yaml.load_all(open('index.yml')))
[{'meta-info-1': 'val1', 'meta-info-2': 'val2'}, 'Plain text/markdown content! jhaha']
What I am trying to achieve here is that the yaml file should contain two documents, and the second one is supposed to be interpreted as a single string document, more specifically any large body of text with markdown formatting. I don't want it to be parsed as YAML syntax.
In the above example, PyYAML returns the second document as a single string. But if the second document has a : character in place of the ! for instance, I get a syntax error. This is because PyYAML is parsing the stuff in that document.
Is there a way I can tell PyYAML that the second document is a just a raw string and not to parse it?
Edit: A few excellent answers there. While using quotes or the literal syntax solves the said problem, I'd like the users to be able to write the plain text without any extra cruft. Just the three -'s (or .'s) and write away a large body of plain text. Which might also include quotes too. So, I'd like to know if I can tell PyYAML to parse only one document, and give the second to me raw.
Eidt 2: So, adapting agf's idea, instead of using a try/except as the second document could be valid yaml syntax,
config_content, body_content = open(filename).read().split('\n---')
config = yaml.loads(config_content)
body = yaml.loads(body_content)
Thanks agf.
You can do
raw = open(filename).read()
docs = []
for raw_doc in raw.split('\n---'):
try:
docs.append(yaml.load(raw_doc))
except SyntaxError:
docs.append(raw_doc)
If you won't have control over the format of the original document.
From the PyYAML docs,
Double-quoted is the most powerful style and the only style that can express any scalar value. Double-quoted scalars allow escaping. Using escaping sequences \x** and \u****, you may express any ASCII or Unicode character.
So it sounds like there is no way to represent an arbitrary scalar in the parsing if it's not double quoted.
If all you want is to escape the colon character in YAML, then enclose it within single or double quotes. Also, you can try literal style for your second document which should be treated as single scalar.