Dividing a .yml file up - python

I need to break down .yml files into 3 parts: Header, Working (the part I need to edit), and footer. The header is everything that comes before the 'Resource: ' block, and the footer is everything after the resource block. I essentially need to create code that creates 3 lists, dictionaries, strings, whatever works, that holds these three sections of the YAML file, then allows me to run more code against the working piece, then concatenate all of them together at the end and produce the new document with the same indentations. No changes should be made to the header or the tail.
Note: I've looked up everything about yaml parsing and whatnot, but cannot seem to implement the advice I've found effectively. A solution that does not involve importing yaml would be preferred, but if you must, please explain what is really going on with the import yaml code so I can understand what I'm messing up.

Files that contain one or more YAML documents (in short: a YAML file, which, since
Sept. 2006, has been recommended to have the extension .yaml), are
text files and can be concatenated from parts as such. The only requirement being
that in the end you have a text file that is a valid YAML file.
The easiest is of course to have the Header and footer in separate
files for that, but as you are talking about multiple YAML files this
soon becomes unwieldy. It is however always possible to do some basic
parsing of the file contents.
Since your Working part starts with Resource:, and you indicate 3
lists or dictionaries (you cannot have three strings at the root of a
YAML document). The root level data structure of your YAML document
needs to be a either a mapping, and everything else, except for the
keys to that mapping need to be indented (in theory it only needs to
be indented more, but in practise this almost always means that the
keys are not indented), Like (m.yaml):
# header
a: 1
b:
- 2
- c: 3 # end of header
Resource:
# footer
c:
d: "the end" # really
or the root level needs to be a sequence (s.yaml):
# header
- a: 1
b:
- 2
- c: 3
- 42 # end of header
- Resource:
# footer
- c:
d: "the end" # really
both can easily be split without loading the YAML, here is the example code for doing so for
the file with the root level mapping:
from pathlib import Path
from ruamel.yaml import YAML
inf = Path('m.yaml')
header = [] # list of lines
resource = []
footer = []
for line in inf.open():
if not resource:
if line.startswith('Resource:'): # check if we are at end of the header
resource.append(line)
continue
header.append(line)
continue
elif not footer:
if not line or line[0] == ' ': # still in the resource part
resource.append(line)
continue
footer.append(line)
# you now have lists of lines for the header and the footer
# define the new data structure for the resource this is going to be a single key/value dict
upd_resource = dict(Resource=['some text', 'for the resource spec', {'a': 1, 'b': 2}])
# write the header lines, dump the resource lines, write the footer lines
outf = Path('out.yaml')
with outf.open('w') as out:
out.write(''.join(header))
yaml = YAML()
yaml.indent(mapping=2, sequence=2, offset=0) # the default values
yaml.dump(upd_resource, out)
out.write(''.join(footer))
print(outf.read_text())
this gives:
# header
a: 1
b:
- 2
- c: 3 # end of header
Resource:
- some text
- for the resource spec
- a: 1
b: 2
# footer
c:
d: "the end" # really
Doing the same while parsing the YAML file is not more difficult. The following automitcally handles
both cases (whether the root level is a mapping or a sequence):
from pathlib import Path
from ruamel.yaml import YAML
inf = Path('s.yaml')
upd_resource_val = ['some text', 'for the resource spec', {'a': 1, 'b': 2}]
outf = Path('out.yaml')
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=2, offset=0)
yaml.preserve_quotes = True
data = yaml.load(inf)
if isinstance(data, dict):
data['Resource'] = upd_resource_val
else: # assume a list,
for item in data: # search for the item which has as value a dict with key Resource
try:
if 'Resource' in item:
item['Resource'] = upd_resource_val
break
except TypeError:
pass
yaml.dump(data, outf)
This creates the following out.yaml:
# header
- a: 1
b:
- 2
- c: 3
- 42 # end of header
- Resource:
- some text
- for the resource spec
- a: 1
b: 2
# footer
- c:
d: "the end" # really
If the m.yaml file had been the input, the output would have
been exactly the same as with the text based "concatenation" example code.

Related

ruamel.yaml: Preserve comments and blank lines when dumping a class that was previously loaded from yaml

I have a class that I wish to load/modify/dump from and to yaml while preserving comments and general formating using ruamel.yaml (0.17.21).
My issue is that after a yaml --> python --> yaml roundtrip, some comments disappear, some inline comment get put on their own line, and some bank lines (which are comments in ruamel.yaml I believe) are missing.
I'm not sure if I'm doing something wrong, or if this is a bug report.
Here's a minimal working example:
import sys
from ruamel.yaml import YAML, yaml_object
yaml = YAML()
#yaml_object(yaml)
class ExampleClass():
def __init__(self, subentries):
if not 'subentry_0' in subentries:
raise AssertionError
for k,v in subentries.items():
setattr(self, k, v)
# Here I can also define a `__setstate__` method that calls the init for me
# But it doesn't change much
source = """
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
"""
a = yaml.load(source)
yaml.dump(a, sys.stdout)
Outputs:
# top-level comment
entry: !ExampleClass
# entry inline comment
subentry_0: 0
subentry_1: 1
subentry_2: 2
entry 2: |
This is a long
text
entry
Where some funky stuff happened to the comments and blank spaces.
If I initialyze my class via a['entry'].__init__(a['entry'].__dict__), I also lose most comments and blank lines, but it looks better:
# top-level comment
entry: !ExampleClass
subentry_0: 0
subentry_1: 1
subentry_2: 2
entry 2: |
This is a long
text
entry
For blank lines, it'd be acceptable to me to just strip them all and then insert blank lines back between top-level entries.
There are two issues here. One is that when you want to round-trip, you should not register tags for your own objects.
ruamel.yaml can round-trip tagged collections (mapping, sequence) and most scalars (most notably it cannot
round-trip a tagged null/~). This gives you subclasses of standard python types that mostly behave
as you would expect and preserve all of the comments as well as any tags.
The second issue is that
comments between keys and their values have issues,
and tags interfere with comments (i.e. are not
properly covered by enough testcases because of lazyness of the ruamel.yaml author). IIRC comments between a key and a tagged value get complete lost.
The easiest solution for this second issue (for now) is probably to post-process the output.
import sys
import ruamel.yaml
yaml_str = """\
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
"""
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=4)
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
# print(data['entry'].tag.value)
def correct_comment_after_tag(s):
# if a previous line ends in a tag and this line has enough spaces
# at the start, append the end of the line to the previous one
res = []
prev_line = -1 # -1 if previous line didn't end in tag, else length of previous line
for line in s.splitlines():
linesplit = line.split()
if linesplit and linesplit[-1].startswith('!'):
prev_line = len(line)
else:
if prev_line > 0:
if line.lstrip().startswith('#') and line.find('#') > prev_line:
res[-1] += line[prev_line:]
prev_line = -1
continue
prev_line = -1
res.append(line)
return '\n'.join(res)
yaml.dump(data, sys.stdout, transform=correct_comment_after_tag)
which gives:
# top-level comment
entry: !ExampleClass # entry inline comment
subentry_0: 0
subentry_1: 1 # subentry inline comment
# separation comment
subentry_2: 2
entry 2: |
This is a long
text
entry
To get the ExampleClass behaviour I would probably duck-type a __getattr__ on ruamel.yaml.comments.CommentedMap
that checks for subentry_0 and returns the value for key. If usually know up-front that I am going to round-trip or not
and use yamlrt = YAML() if I do and yamls = YAML(typ='safe') with classes registered in yamls if I don't).
If you need to do (extra) checks on tagged nodes, it is IMO easiest to recursively walk over the
data structure, testing dict, list and possible items for their .tag attribute and do the appropriate check when the tag matches.
Alternatively, you probably get a Python datastructure that preserves comments on round-trip
when you make ExampleClass a subclass
of CommentedMap, but I am not sure.

Split out-of-order string in python

I'm reading text files which may look like this
file1.txt:
Header A
blab
iuyt
Header B
bkas
rtyu
Header C
asdf
file2.txt:
Header B
asdw
Header A
hufd
ousu
Header C
dfsn
At the end of the file might be a newline, space, or nothing at all. The headers are the same in all the files but may be ordered differently as above.
I would like to map this so that a = blab\niuyt for the first input or a = hufd\nousu for the second.
I'm not sure I fully understand your question. It sounds to me as though you want to take an input:
XABCDE
or, equivalently (at least as far as I can tell in your notation):
BCXADE
DEBCXA
and return a mapping like
{"x": "A", "b": "C", "d": "E"}
(which is one way of representing the name-value pairs).
Is that correct? If so:
# This is the input.
c = "XABCDE"
# This is a dictionary comprehension, one way
# of creating a set of key-value pairs.
{
c[idx].lower(): c[idx + 1] # Map the first element of each pair to the second.
for idx in range(0, len(c), 2) # Iterate over the pairs in the string.
}
The question's been edited materially since my original answer, so I'm adding a separate answer.
The OP's input is given as follows: there is a file foo.txt with the following contents:
Header A
blab
iuyt
Header B
bkas
rtyu
Header C
asdf
The OP's expected output is a dictionary mapping header values to the contents (not lines) following the header, i.e.:
{
"A": "blab\niuyt",
"B": "bkas\nrtyu",
"C": "asdf"
}
Note the last trailing line delimiter (\n) before each new header should not be included.
One approach:
import re
from collections import defaultdict
# Given something like "Header A" with a trailing newline,
# this will match "A" under group "key". The header formats
# in the example are simple enough that you could fetch the
# value using _, group = line.split(" "), but this accomodates
# more complex formats. Note that this regular expression
# assumes each header will be followed by AT LEAST ONE line
# of data in a file!
PATTERN = re.compile(r"^Header\s*(?P<key>.+)(\r\n|\r|\n)")
# Using defaultdict with an str constructor means we don't have to check
# for key existence before attempting an append. Check the standard library
# documentation for more info:
# https://docs.python.org/3/library/collections.html#collections.defaultdict
structured_output = defaultdict(str)
with open("txt", "r") as handle:
last_match = None # Track the second-last match we made.
for line in handle:
maybe_match = PATTERN.match(line)
if maybe_match: # We've matched a new header group. Strip the trailing newline from preceding, if any.
# This is either (a) the FIRST header we're matching or (b) the n-th header.
# In the first case, structured_output[key] returns "" (see defaultdict), and "".rstrip("\n")
# is "". In the second case, we strip the last newline from the previous group (per the spec).
group = (last_match or maybe_match).group("key")
structured_output[group] = structured_output[group].rstrip()
# Move the last_match "pointer" forward to the
# new header.
last_match = maybe_match
else: # This is not a header, it's a line of data: append it.
structured_output[last_match.group("key")] += line
# Once we've run off the end of the file, we should still rstrip the _last_ group.
structured_output[last_match.group("key")] = structured_output[last_match.group("key")].rstrip()
Use an iterator:
it = iter(string1)
res = {c: next(it) for c in it}

inserting node in yaml with ruamel

I would like to have printed the following layout:
extra:
identifiers:
biotools:
- http://bio.tools/abyss
I'm using this code to add nodes:
yaml_file_content['extra']['identifiers'] = {}
yaml_file_content['extra']['identifiers']['biotools'] = ['- http://bio.tools/abyss']
But, instead, I'm getting this output, that encapsulates the tool in []:
extra:
identifiers:
biotools: ['- http://bio.tools/abyss']
I have tried other combinations but didn't work?
The dash in - http://bio.tools/abyss indicates a sequence element and is added on output if you dump a Python list in block style.
So instead of doing:
yaml_file_content['extra']['identifiers']['biotools'] = ['- http://bio.tools/abyss']
you should be doing:
yaml_file_content['extra']['identifiers']['biotools'] = ['http://bio.tools/abyss']
and then force the output of all composite elements in block style using:
yaml.default_flow_style = False
If you want finer grained control, create a ruamel.yaml.comments.CommentedSeq instance:
tmp = ruamel.yaml.comments.CommentedSeq(['http://bio.tools/abyss'])
tmp.fa.set_block_style()
yaml_file_content['extra']['identifiers']['biotools'] = tmp
Once you have loaded a YAML file it's no longer "yaml"; it's now a Python data structure, and the contents of the biotools key is a list:
>>> import ruamel.yaml as yaml
>>> data = yaml.load(open('data.yml'))
>>> data['extra']['identifiers']['biotools']
['http://bio.tools/abyss']
Like any other Python list, you can append to it:
>>> data['extra']['identifiers']['biotools'].append('http://bio.tools/anothertool')
>>> data['extra']['identifiers']['biotools']
['http://bio.tools/abyss', 'http://bio.tools/anothertool']
And if you print out the data structure you get valid YAML:
>>> print( yaml.dump(data))
extra:
identifiers:
biotools: [http://bio.tools/abyss, http://bio.tools/anothertool]
Of course, if for some reason you don't like that list representation you can also get the syntactically equivalent:
>>> print( yaml.dump(data, default_flow_style=False))
extra:
identifiers:
biotools:
- http://bio.tools/abyss
- http://bio.tools/anothertool

Load specific PyYAML documents from file

I have a .yml file, and I'm trying to load certain documents from it. I know that:
print yaml.load(open('doc_to_open.yml', 'r+'))
will open the first (or only) document in a .yml file, and that:
for x in yaml.load_all(open('doc_to_open.yml', 'r+')):
print x
which will print all the YAML documents in the file. But say I just want to open the first three documents in the file, or want to open the 8th document in the file. How would I do that?
If you don't want to parse the first seven YAML files at all, e.g. for efficiency reasons, you will have to search for the 8th document yourself.
There is the possibility to hook into the first stage of the parser and count the number of DocumentStartTokens() within the stream, and only start passing on the tokens after the 8th and stopping to do so on the 9th, but doing that is far from trivial. And that would still tokenize, at least, all of the preceding documents.
The completely inefficient way, for which an efficient replacement, IMO, would need to behave the same, would be to use .load_all() and select the appropriate document, after complete tokenizing/parsing/composing/resolving all of the documents ¹:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML()
for idx, data in enumerate(yaml.load_all(open('input.yaml'):
if idx == 7:
yaml.dump(data, sys.stdout)
If you run the above on a document input.yaml:
---
document: 0
---
document: 1
---
document: 2
---
document: 3
---
document: 4
---
document: 5
---
document: 6
---
document: 7 # < the 8th document
---
document: 8
---
document: 9
...
you get the output:
document: 7 # < the 8th document
You unfortunately cannot naively just count the number of document markers (---), as the document doesn't have to start with one:
document: 0
---
document: 1
.
.
nor does it have to have the marker on the first line if the file starts with a directive ²:
%YAML 1.2
---
document: 0
---
document: 1
.
.
or starts with a "document" consisting of comments only:
# the 8th document is the interesting one
---
document: 0
---
document: 1
.
.
To account for all that you can use:
def get_nth_yaml_doc(stream, doc_nr):
doc_idx = 0
data = []
for line in stream:
if line == u'---\n' or line.startswith('--- '):
doc_idx += 1
continue
if line == '...\n':
break
if doc_nr < doc_idx:
break
if line.startswith(u'%'):
continue
if doc_idx == 0: # no initial '---' YAML files don't start with
if line.lstrip().startswith('#'):
continue
doc_idx = 1
if doc_idx == doc_nr:
data.append(line)
return yaml.load(''.join(data))
with open("input.yaml") as fp:
data = get_nth_yaml_doc(fp, 8)
yaml.dump(data, sys.stdout)
and get:
document: 7 # < the 8th document
in all of the above cases, efficiently, without even tokenizing the preceding YAML documents (nor the following).
There is an additional caveat in that the YAML file could start with a byte-order-marker, and that the individual documents within a stream can start with these markers. The above routine doesn't handle that.
¹ This was done using ruamel.yaml of which I am the author, and which is an enhanced version of PyYAML. AFAIK PyYAML would work the same (but would e.g. drop the comment on the roundtrip).
² Technically the directive is in its own directives document, so you should count that as document but the .load_all() doesn't give you that document back, so I don't count it as such.

Cascaded string split, pythonic way

Take for example this format from IANA: http://www.iana.org/assignments/language-subtag-registry
%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%
Say I open the file:
import urllib
f = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry")
all=f.read()
Normally you would do like this
lan=all.split("%%")
the iterate lan and split("\n") then iterate the result and split(":"), is there a way to to this in python in one batch without the iteration and the output still be like this:
[[["Type","language"],["Subtag", "ae"],...]...]?
As a single comprehension:
raw = """\
%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%"""
data = [
dict(
row.split(': ')
for row in item_str.split("\n")
if row # required to avoid the empty lines which contained '%%'
)
for item_str in raw.split("%%")
if item_str # required to avoid the empty items at the start and end
]
>>> data[0]['Added']
'2005-10-16'
I don't see any sense in trying to do this in a single pass, if the elements you are getting to after each split are semantically diffent.
You could start by spliting by ":" -- that wold get you to the fine grained data - but what good would that be, if you wold not know were does this data belong?
That said, you could put all the levels of separation inside a generator, and have it yield
dictionary-objects with your data, ready for consunption:
def iana_parse(data):
for record in data.split("%%\n"):
# skip empty records at file endings:
if not record.strip():
continue
rec_data = {}
for line in record.split("\n"):
key, value = line.split(":")
rec_data[key.strip()] = value.strip()
yield rec_data
It can be done as a one liner as you request in the comments - but as I commented back,
It could be written to fit as a single expression in one line. It took more time to write than the example above, and would be nearly impossible to maintain. The code in the example above unfolds the logic in a few lines of code, that are placed "out of the way" - i.e. not inline where you are deaing witht he actual data, providing readability and maintainability for both tasks.
That said, parsing as a structure of nested lists as you want can be done thus:
structure = [[[token.strip() for token in line.split(":")] for line in record.split("\n") ] for record in data.split("%%") if record.strip() ]
Regexes, but I don't see the point:
re.split('%%|:|\\n', string)
Here multiple patterns were chained using the or | operator.
You can use itertools.groupby:
ss = """%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
"""
sss = ss.splitlines(True) #List which looks like you're iterating over a file object
import itertools
output = []
for k,v in itertools.groupby(sss,lambda x: x.strip() == '%%'):
if(k): #Hit a '%%' record. Need a new group.
print "\nNew group:\n"
current = {}
output.append(current)
else: #just a regular record, write the data to our current record dict.
for line in v:
print line.strip()
key,value = line.split(None,1)
current[key] = value
One benefit of this answer is that it doesn't require you to read the entire file. The whole expression is evaluated lazily.

Categories