Loading document as raw string in yaml with PyYAML - python

I want to parse yaml documents like the following
meta-info-1: val1
meta-info-2: val2
---
Plain text/markdown content!
jhaha
If I load_all this with PyYAML, I get the following
>>> list(yaml.load_all(open('index.yml')))
[{'meta-info-1': 'val1', 'meta-info-2': 'val2'}, 'Plain text/markdown content! jhaha']
What I am trying to achieve here is that the yaml file should contain two documents, and the second one is supposed to be interpreted as a single string document, more specifically any large body of text with markdown formatting. I don't want it to be parsed as YAML syntax.
In the above example, PyYAML returns the second document as a single string. But if the second document has a : character in place of the ! for instance, I get a syntax error. This is because PyYAML is parsing the stuff in that document.
Is there a way I can tell PyYAML that the second document is a just a raw string and not to parse it?
Edit: A few excellent answers there. While using quotes or the literal syntax solves the said problem, I'd like the users to be able to write the plain text without any extra cruft. Just the three -'s (or .'s) and write away a large body of plain text. Which might also include quotes too. So, I'd like to know if I can tell PyYAML to parse only one document, and give the second to me raw.
Eidt 2: So, adapting agf's idea, instead of using a try/except as the second document could be valid yaml syntax,
config_content, body_content = open(filename).read().split('\n---')
config = yaml.loads(config_content)
body = yaml.loads(body_content)
Thanks agf.

You can do
raw = open(filename).read()
docs = []
for raw_doc in raw.split('\n---'):
try:
docs.append(yaml.load(raw_doc))
except SyntaxError:
docs.append(raw_doc)
If you won't have control over the format of the original document.
From the PyYAML docs,
Double-quoted is the most powerful style and the only style that can express any scalar value. Double-quoted scalars allow escaping. Using escaping sequences \x** and \u****, you may express any ASCII or Unicode character.
So it sounds like there is no way to represent an arbitrary scalar in the parsing if it's not double quoted.

If all you want is to escape the colon character in YAML, then enclose it within single or double quotes. Also, you can try literal style for your second document which should be treated as single scalar.

Related

How to get raid of double quotes when dumping a string that includes single quotes

I am trying to add single quotes in dumping the following yaml string:
yaml_str = 'Type: modified'
But the output includes double quotes which are not required.
Here is my code:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
data['Type'] = f"'{data['Type']}'"
yaml.dump(data, sys.stdout)
The output:
Type: "'modified'"
The expected output:
Type: 'modified'
Any ideas, please?
I tried all kinds of string formatting, nothing helped.
I also tried to add yaml.preserve_quotes = True which also didn't do any good.
Your expectation is completely wrong, so string formatting is not going to help you at all. YAML, like many other languages need to be able to handle scalars that have embedded quotes and YAML has multiple ways to handle that:
if a string to be dumped into a scalar has special characters that need backslash escaping (e.g. the audible bell \a), the scalar needs to be between double quotes (and double quotes in the string escaped in the scalar)
if a string to be dumped into a scalar has no special characters, but starts with a double quote, the whole scalar can be single quoted (and any existing single quotes in the string, will need to be duplicated '' in the scalar)
If you want to force single quotes in ruamel.yaml, even if they are superfluous, you can use:
data['Type'] = ruamel.yaml.scalarstring.SingleQuotedScalarString('{data['Type']}')
although the much better solution would be to get rid of the program that reads your output file and requires the unnecessary quotes to be there in the first place.
Please note that having quotes in a string doesn't necessarily require the corresponding scalar to have quotes. E.g. a string that has no spaces and a quote somewhere between normal readable characters can be dumped without (extra) quotes.

How to write unicode characters to LaTeX document via pylatex

Assume the string myStr which contains three special characters.
myStr = "emdash —; delta Δ; thin space:  ;"
Further assume that we wish to write this string to a LaTeX document via pylatex.
If we write the string as is to a LaTeX document, errors occur during its compilation:
import pylatex
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr -- not encoded')):
doc.append(myStr)
doc.generate_pdf("myStr_notEncoded", clean_tex=False)
...
! Package inputenc Error: Unicode character Δ (U+0394)
(inputenc) not set up for use with LaTeX.
...
! Package inputenc Error: Unicode character   (U+2009)
(inputenc) not set up for use with LaTeX.
...
If we first encode the string via pylatexenc, the special characters are either represented by their respective LaTeX encoding (emdash, delta) or encoded in a way unclear to me (thin space).
import pylatexenc
from pylatexenc import latexencode
myStr_latex = pylatexenc.latexencode.unicode_to_latex(myStr)
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr')):
doc.append(myStr_latex)
doc.generate_pdf("myStr", clean_tex=False)
How do I have to write the string into the LaTeX document so that the special characters are printed as the actual characters when compiling with pdflatex?
Edit 1:
I also tried to change the default encoding inside the LaTeX document for the unencoded pathway but it results in a series of compilation errors as well.
doc.preamble.append(pylatex.NoEscape("\\usepackage[utf8]{inputenc}"))
You were close with your pylatexenc solution. When you encode latex yourself, e.g. with pylatexenc.latexencode.unicode_to_latex() you have to ensure that you tell pylatex the string should not be additional escaped. To wit:
Using regular LaTeX strings may not be as simple as is seems though, because by default almost all strings are escaped[...] there are cases where raw LaTeX strings should just be used directly in the document. This is why the NoEscapestring type exists. This is just a subclass of str, but it will not be escaped
In other words to solve, just make sure to use NoEscape to tell pylatex your string is already encoded as latex and not to encode it again:
import pylatex
from pylatexenc import latexencode
myStr_latex = latexencode.unicode_to_latex(myStr)
doc = pylatex.Document()
with doc.create(pylatex.Section('myStr')):
doc.append(pylatex.utils.NoEscape(myStr_latex))
doc.generate_pdf("myStr", clean_tex=False)
Until a better solution has been found, a workable (although by no means preferable) solution would be to encode all the offending characters via the package newunicodechar:
doc.preamble.append(pylatex.Command('usepackage', 'newunicodechar'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{²}{\ensuremath{{}^2}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{ }{\,}'))
doc.preamble.append(pylatex.NoEscape(r"\newunicodechar{′}{'}"))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{−}{\ensuremath{-}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{∶}{\ensuremath{:}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{≤}{\ensuremath{\leq}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{≥}{\ensuremath{\geq}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{α}{\ensuremath{\alpha}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{β}{\ensuremath{\beta}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{Δ}{\ensuremath{\Delta}}')) ## (U+0394)
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{∆}{\ensuremath{\Delta}}')) ## (U+2206)
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{λ}{\ensuremath{\lambda}}'))
doc.preamble.append(pylatex.NoEscape(r'\newunicodechar{μ}{\ensuremath{\mu}}'))

How to get r'\\\|' from a yaml file

I am using a yaml file to store some config data, including some (many) regex strings that I don't want to keep in code. Everything works except when I try to search for some incorrectly escaped pipe characters with r'\\\|'. I tried quoted, unquoted and literal strings in yaml, nothing works. Yaml and Python string escape rules together seem to conspire to keep the number of backslashes in a string even. I open and load the file with
f=open(file_path, 'r', encoding='utf_8')
python_dict = yaml.load(f)
I'd like to reproduce
re.compile('\\\|')
using something like
re.compile(python_dict['escaped-pipes'])
Python 3.4 with PyYAML.
In YAML, \ is a special character only in doubly-quoted strings. The Python string r'\\\|' is a raw string, so it consists of three backslashes and a pipe. You have the following options to encode this in a YAML document:
plain: \\\| # plain scalar. YAML does not process the backslashes.
single: '\\\|' # single quoted scalar. YAML also does not process the backslashes.
double: "\\\\\\|" # double quoted scalar. YAML does process escape sequences here, so you
# need to double the backslashes
Single-quoted scalars are kind-of the YAML equivalent of Python's raw strings.
You may have other characters in your regexes that may interfere with YAML's syntax. You may want to use block scalars instead, which treat all characters as content and are ended via indentation:
block: |-
\\\|
next block: |-
\\\|
| starts a literal block scalar, - causes the final linebreak (before the next item) to not be part of the scalar, which is what you want.

Yaml load converting string to UTF8?

I have this YAML:
---
test: {"gender":0,"nacionality":"Alem\u00e3o"}
I am reading it using python 3.5 as follow:
with open('teste.yaml', 'r') as stream:
doc = yaml.load_all(stream)
for line in doc:
print(line)
This is the result I get:
{'test': {'gender': 0, 'nacionality': 'Alemão'}}
But If I change " for ' in my YAML, I get this:
{'test': {'nacionality': 'Alem\\u00e3o', 'gender': 0}}
As you can see, when I use " the string Alem\\u00e3o is converted to UTF, but with ' it does not.
So I have two questions:
Why do I get different outputs when I use ' and "?
What can I do to get the output as Alem\\u00e3o when using "?
That's how the YAML data format is defined. Within double quotes, specific escape sequences are interpreted. Within single quotes, they're not.
7.3.1. Double-Quoted Style
The double-quoted style is specified by surrounding “"” indicators. This is the only style capable of expressing arbitrary strings, by using “\” escape sequences. This comes at the cost of having to escape the “\” and “"” characters.
http://yaml.org/spec/1.2/spec.html#id2787109
What can I do to get the output as Alem\u00e3o when using "?
Escape the escape character:
test: {"gender":0,"nacionality":"Alem\\u00e3o"}
Backslash escaping in YAML is only available in double quotes scalars. Not in single quoted scalars, unquoted nor (litereral) block scalars.
To get the output as you wish, the best way is to drop the quotes all together and use this as input:
---
test: {gender: 0, nacionality: Alem\u00e3o}
Your program however is up for some improvement.
you should never use load_all() or load() on this kind of non-tagged YAML. That is unsafe and can lead to arbitrary code being executed on your machine if you don't have complete control over the source YAML. Newer versions of ruamel.yaml will throw a warning if you don't explicitly specify the unsafe Loader as an argument. Do yourself a favour and and get into the habit of using safe_load() and safe_load_all().
load_all() gives back an iterator over documents so using doc and line are misleading variable names. You should use:
import ruamel.yaml as yaml
with open('teste.yaml', 'r') as stream:
for doc in yaml.safe_load_all(stream):
print(doc)
or if there is always just one document in teste.yaml you can simplify that to:
import ruamel.yaml as yaml
with open('teste.yaml') as stream:
print(yaml.safe_load(stream))
both of which will give you:
{'test': {'gender': 0, 'nacionality': 'Alem\\u00e3o'}}
Please note that it is mandatory in YAML to have a space after the : separating key and value in a mapping. Only for compatibility with JSON is it allowed to drop the space assuming the key is quoted (double and single quotes both work). So this works as input as well:
---
test: {"gender":0, 'nacionality':Alem\u00e3o}

Replace numeric character references in XML document using Python

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:
<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>
but each # sign is preceded by a & sign and hence the output looks like: 𐌰𐌽𐌳𐌰𐌿𐍂𐌰
This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")
I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:
RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')
This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).
I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.
Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?
Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Usually, it would be best to do replacement operations like this on parsed text nodes, to avoid messing up non-character-reference sequences in other places like comments or PIs. However of course that's not possible in this case since this isn't really XML at all. You could try to fix it up with a crude regex, though it would be better to find out where the invalid input is coming from and kick the person responsible until they fix it.
>>> def lenient_deccharref(m):
... return unichr(int(m.group(1)))
...
>>> tag= '<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>'
>>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8')
'<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>'
This is the correct UTF-8 encoding of 𐌰𐌽𐌳𐌰𐌿𐍂𐌰. The utf-8 codec allows you to encode a sequence of surrogates to correct UTF-8 even on a wide-Unicode platform where the surrogates should not have appeared in the string in the first place.
>>> _.decode('utf-8')
u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>'

Categories