Unexpected behavior when using YAML Storage with TinyDB

Unexpected behavior when using YAML Storage with TinyDB - python

I'm using TinyDB for a small CLI utility to manage personal document drafts. The database stores metadata for each draft; the file should be human-editable (so that I can add details manually), and for this reason I'd like to use YAML over JSON as the format.
I implemented a YamlStorage class subclassing storages.Storage as indicated in the TinyDB docs:
class TestYamlStorage(Storage):
"""
Store the data in a YAML file.
Written following the example at http://tinydb.readthedocs.io/en/latest/extend.html#write-a-custom-storage
"""
def __init__(self, filename): # (1)
super().__init__()
self.filename = filename
touch(filename)
def read(self):
with open(self.filename) as handle:
try:
data = yaml.load(handle.read())
return data
except yaml.YAMLError:
return None # (3)
def write(self, data):
print('writing data: {}'.format(data))
with open(self.filename, 'w') as handle:
yaml.dump(data, handle)
def close(self): # (4)
pass
Everything works fine when inserting only one element, or multiple elements at the same time using insert_multiple:
db = TinyDB('db.yaml', storage=TestYamlStorage)
dicts = [
dict(name='Homer', age=38),
dict(name='Marge', age=34),
dict(name='Bart', age=10)
]
# this works as expected
db.insert_multiple(dicts)
The resulting db.yaml:
_default:
1: {age: 38, name: Homer}
2: {age: 34, name: Marge}
3: {age: 10, name: Bart}
However, when inserting elements multiple times with insert, the resulting YAML file is different:
db = TinyDB('db.yaml', storage=TestYamlStorage)
db.insert(dict(name='Homer', age=38))
db.insert(dict(name='Bart', age=10))
db.yaml:
_default:
1: !!python/object/new:tinydb.database.Element
dictitems: {age: 38, name: Homer}
state: {eid: 1}
2: {age: 10, name: Bart}
The data in this format (apart from looking messier) seems to be not compatible with yaml.safe_load (calling db.all() returns []). My interpretation is that the YAML serialization process is in some way "over-eager", i.e. that the Element instance gets written to db.yaml instead of the underlying data.
Is there something wrong with my code? I've tried to fiddle with PyYAML options, using a different YAML module (ruamel.yaml), and create a second YamlStorage class copying from the default JSONStorage, but without any difference.
Version info: Python 3.4.3, TinyDB 3.2.0, PyYAML 3.11. I posted a runnable MWE with all imports here.
Edit
After #Anthon's suggestion, I tried printing the YAML output to sys.stdout immediately before dumping to file. The problem is reproduced also in this case. See notebook.

When you update an existing "database" you retrieve a database.Element which includes (as you can see in the second YAML file) state information.
When that again is saved you are not saving a dict, but an instance of this Element which is a subclass of dict and for that ruamel.yaml (and PyYAML) needs to store both the dictitems (the key value pairs for the dict) and the state (a dictionary representing that attributes and their values).
Converting your Element to a dict explicitly before writing should do the trick:
def write(self, data):
print('writing data: {}'.format(data))
with open(self.filename, 'w') as handle:
yaml.dump(dict(data), handle)
# ^^^^ ^

Related

Python - Get empty key from yml file

I have a lot yaml file names that have similar structure but with different data. I need to parse out selective data, and put into a single csv (excel) file as three columns.
But i facing an issue with empty key, that always gives me an "KeyError: 'port'"
my yaml file example:
base:
server: 10.100.80.47
port: 3306
namePrefix: well
user: user1
password: kjj&%$
base:
server: 10.100.80.48
port:
namePrefix: done
user: user2
password: fhfh#$%
In the second block i have an empty "port", and my script is stuck on that point.
I need that always that an empty key is found it doesn't write anything.
from asyncio.windows_events import NULL
from queue import Empty
import yaml
import csv
import glob
yaml_file_names = glob.glob('./*.yaml')
rows_to_write = []
for i, each_yaml_file in enumerate(yaml_file_names):
print("Processing file {} of {} file name: {}".format(
i+1, len(yaml_file_names),each_yaml_file))
with open(each_yaml_file) as file:
data = yaml.safe_load(file)
for v in data:
if "port" in v == "":
data['base']['port'] = ""
rows_to_write.append([data['base']['server'],data['base']['port'],data['server']['host'],data['server']['contex']])
with open('output_csv_file.csv', 'w', newline='') as out:
csv_writer = csv.writer(out)
csv_writer.writerow(["server","port","hostname", "contextPath"])
csv_writer.writerows(rows_to_write)
print("Output file output_csv_file.csv created")

You are trying to access the key by index e.g.
data['base']['port']
But what you want is to access it with the get method like so:
data['base'].get('port')
This way if the key does not exists it return None as default, and you could even change the default value to whatever you want by passing it as the second parameter.

In PyYAML, an empty element is returned as None, not an empty string.
if data['base']['port'] is None:
data['base']['port'] = ""

Your yaml file is invalid. In yaml file, whenever you have a key (like port: in your example) you must provide a value, you cannot leave it empty and go to the next line. Unless the value is the next bunch of keys of course, but in that case you need to ident the following lines one step more, which is obviously not what you intend to do here.
This is likely why you cannot parse the file as expected with the python yaml module. If you are the creator of those yaml file, you really need to put a key in the file like port: None if you don't want to provide a value for the port, or even better you just not provide any port key at all.
If they are provided to you by someone else, ask them to provide valid yaml files.
Then the other solutions posted should work.

python importlib seems to be sharing data between instances

Ok, so I am having a weird one. I am running python in a SideFX Hython (their custom build) implementation that is using PDG. The only real difference between Hython and vanilla Python is some internal functions for handling geometry data and compiled nodes, which shouldn't be an issue even though they are being used.
The way the code runs, I am generating a list of files from the disk which creates PDG work items. Those work items are then processed in parallel by PDG. Here is the code for that:
import importlib.util
import pdg
import os
from pdg.processor import PyProcessor
import json
class CustomProcessor(PyProcessor):
def __init__(self, node):
PyProcessor.__init__(self,node)
self.extractor_module = 'GeoExtractor'
def onGenerate(self, item_holder, upstream_items, generation_type):
for upstream_item in upstream_items:
new_item = item_holder.addWorkItem(parent=upstream_item, inProcess=True)
return pdg.result.Success
def onCookTask(self, work_item):
spec = importlib.util.spec_from_file_location("callback", "Geo2Custom.py")
GE = importlib.util.module_from_spec(spec)
spec.loader.exec_module(GE)
GE.convert(f"{work_item.attribValue('directory')}/{work_item.attribValue('filename')}{work_item.attribValue('extension')}", work_item.index, f'FRAME { work_item.index }', self.extractor_module)
return pdg.result.Success
def bulk_convert (path_pattern, extractor_module = 'GeoExtractor'):
type_registry = pdg.TypeRegistry.types()
try:
type_registry.registerNode(CustomProcessor, pdg.nodeType.Processor, name="customprocessor", label="Custom Processor", category="Custom")
except Exception:
pass
whereItWorks = pdg.GraphContext("testBed")
whatWorks = whereItWorks.addScheduler("localscheduler")
whatWorks.setWorkingDir(os.getcwd (), '$HIP')
whereItWorks.setValues(f'{whatWorks.name}', {'maxprocsmenu':-1, 'tempdirmenu':0, 'verbose':1})
findem = whereItWorks.addNode("filepattern")
whereItWorks.setValue(f'{findem.name}', 'pattern', path_pattern, 0)
generic = whereItWorks.addNode("genericgenerator")
whereItWorks.setValue(generic.name, 'itemcount', 4, 0)
custom = whereItWorks.addNode("customprocessor")
custom.extractor_module = extractor_module
node1 = [findem]
node2 = [custom]*len(node1)
for n1, n2 in zip(node1, node2):
whereItWorks.connect(f'{n1.name}.output', f'{n2.name}.input')
n2.cook(True)
for node in whereItWorks.graph.nodes():
node.dirty(False)
whereItWorks.disconnect(f'{n1.name}.output', f'{n2.name}.input')
print ("FULLY DONE")
import os
import hou
import traceback
import CustomWriter
import importlib
def convert (filename, frame_id, marker, extractor_module = 'GeoExtractor'):
Extractor = importlib.__import__ (extractor_module)
base, ext = os.path.splitext (filename)
if ext == '.sc':
base = os.path.splitext (base)[0]
dest_file = base + ".custom"
geo = hou.Geometry ()
geo.loadFromFile (filename)
try:
frame = Extractor.extract_geometry (geo, frame_id)
except Exception as e:
print (f'F{ frame_id } Geometry extraction failed: { traceback.format_exc () }.')
return None
print (f'F{ frame_id } Geometry extracted. Writing file { dest_file }.')
try:
CustomWriter.write_frame (frame, dest_file)
except Exception as e:
print (f'F{ frame_id } writing failed: { e }.')
print (marker + " SUCCESS")
The onCookTask code is run when the work item is processed.
Inside of the GeoExtractor.py program I am importing the geometry file defined by the work item, then converting it into a couple Pandas dataframes to collate and process the massive volumes of data quickly, which is then passed to a custom set of functions for writing binary files to disk from the Pandas data.
Everything appears to run flawlessly, until I check my output binaries and see that they escalate in file size much more than they should, indicating that either something is being shared between instances or not cleared from memory and subsequent loads of the extractor code is appending the dataframes which are named the same.
I have run the GeoExtractor code sequentially with the python instance closing between each file conversion using the exact same code and the files are fine, growing only very slowly as the geometry data volume grows, so the issue has to lie somewhere in the parallelization of it using PDG and calling the GeoExtractor.py code over and over for each work item.
I have contemplated moving the importlib stuff to the __init__() for the class leaving only the call to the member function in the onCookTask() function. Maybe even going so far as to pass a unique variable for each work item which is used inside GeoExtractor to create a closure of the internal functions so they are unique instances in memory.
I tried to do a stripped down version of GeoExtractor and since I'm not sure where the leak is, I just ended up pulling out comments with proprietary or superfluous information and changing some custom library names, but the file ended up kinda long so I am including a pastebin: https://pastebin.com/4HHS8D2W
As for CustomGeometry and CustomWriter, there is no working form of either of those libraries that will be NDA safe, so unfortunately they have to stay blackboxed. The CustomGeometry is a handful of container classes which organize all of the data coming out of the geometry, and the writer is a formatter/writer for the binary format we are utilizing. I am hoping the issue wouldn't be in either of them.
Edit 1: I fixed an issue in the example code.
Edit 2: Added larger examples.

Import a JSON project wise, so it loads just once

I have a Python project that performs a JSON validation against a specific schema.
It will run as a Transform step in GCP Dataflow, so it's very important that all dependencies are gathered before the run to avoid downloading the same file again and again.
The schema is placed in a separated Git repository.
The nature of the Transformer is that you receive a single record in your class, and you work with it. The typical flow is that you load the JSON Schema, you validate the record against it, and then you do stuff with the invalid and with the valid. Loading the schema in this way means that I download the schema from the repo for every record, and it could be hundred thousands.
The code gets "cloned" into the workers and then work kinda independent.
Inspired by the way Python loads the requirements at the beginning (one single time) and using them as imports, I thought I could add the repository (where the JSON schema lives) as a Python requirement, and then simply use it in my Python code. But of course, it's a JSON, not a Python module to be imported. How can it work?
An example would be something like:
requirements.txt
git+git://github.com/path/to/json/schema#41b95ec
dataflow_transformer.py
import apache_beam as beam
import the_downloaded_schema
from jsonschema import validate
class Verifier(beam.DoFn):
def process(self, record: dict):
validate(instance=record, schema=the_downloaded_schema)
# ... more stuff
yield record
class Transformer(beam.PTransform):
def expand(self, record):
return (
record
| "Verify Schema" >> beam.ParDo(Verifier())
)

You can load the json schema once and use it as a side input.
An example:
import json
import requests
json_current='https://covidtracking.com/api/v1/states/current.json'
def get_json_schema(url):
with requests.Session() as session:
schema = json.loads(session.get(url).text)
return schema
schema_json = get_json_schema(json_current)
def feed_schema(data, schema):
yield {'record': data, 'schema': schema[0]}
schema = p | beam.Create([schema_json])
data = p | beam.Create(range(10))
data_with_schema = data | beam.FlatMap(feed_schema, schema=beam.pvalue.AsSingleton(schema))
# Now do your schema validation
Just a demonstration of what the data_with_schema pcollection looks like

Why don't you just use a class for loading your resources that uses a cache in order to prevent double loading? Something along the lines of:
class JsonLoader:
def __init__(self):
self.cache = set()
def import(self, filename):
filename = os.path.absname(filename)
if filename not in self.cache:
self._load_json(filename)
self.cache.add(filename)
def _load_json(self, filename):
...

How to link typing-like nested classes and other urls with Sphinx and RST

Using intersphinx and autodoc, having:
:param stores: Array of objects
:type stores: list[dict[str,int]]
Would result in an entry like:
stores (list[dict[str,int]]) - Array of objects.
Is there a way to convert list[dict[str,int]] outside of the autodoc :param: derivative (or others like :rtype:) with raw RST (within the docstring) or programatically given a 'list[dict[str,int]]' string?
Additionally, is it possible to use external links within the aforementioned example?
Example
Consider a script.py file:
def some_func(arg1):
"""
This is a head description.
:param arg1: The type of this param is hyperlinked.
:type arg1: list[dict[str,int]]
Is it possible to hyperlink this, here: dict[str,list[int]]
Or even add custom references amongst the classes: dict[int,ref]
Where *ref* links to a foreign, external source.
"""
Now in the Sphinx conf.py file add:
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx'
]
intersphinx_mapping = {
'py': ('https://docs.python.org/3', None),
}
In your index.rst, add:
Title
=====
.. toctree::
:maxdepth: 2
.. autofunction:: script.some_func
And now simply make the html for the page.
The list[dict[str,int]] next to :type arg1: will be hyperlinked as shown at the beginning of this question, but dict[str,list[int]] obviously won't. Is there a way to make the latter behave like the former?

I reached a solution after digging around sphinx's code.
Injecting External References (:param:)
I created a custom extension that connects to the missing-reference event and attempts to resolve unknown references.
Code of reflinks.py:
import docutils.nodes as nodes
_cache = {}
def fill_cache(app):
_cache.update(app.config.reflinks)
def missing_reference(app, env, node, contnode):
target = node['reftarget']
try:
uri = _cache[target]
except KeyError:
return
newnode = nodes.reference('', '', internal = False, refuri = uri)
if not node.get('refexplicit'):
name = target.replace('_', ' ') # style
contnode = contnode.__class__(name, name)
newnode.append(contnode)
return newnode
def setup(app):
app.add_config_value('reflinks', None, False)
app.connect('builder-inited', fill_cache)
app.connect('missing-reference', missing_reference, priority = 1000)
Explanation
I consulted intersphinx's methodology for resolving unknown references and connected the function with high priority so it's hopefully only consulted as a last result.
Followup
Include the extenion.
Adding to conf.py:
reflinks = {'google': 'https://google.com'}
Allowed for script.py:
def some_func(arg1):
"""
:param arg1: Google homepages.
:type arg1: dict[str, google]
"""
Where dict[str, google] are now all hyperlinks.
Formatting Nested Types
There were instances where I wanted to use type structures like list[dict[str,myref]] outside of fields like :param:, :rtype:, etc. Another short extension did the trick.
Code of nestlinks.py:
import sphinx.domains.python as domain
import docutils.parsers.rst.roles as roles
_field = domain.PyTypedField('class')
def handle(name, rawtext, text, lineno, inliner, options = {}, content = []):
refs = _field.make_xrefs('class', 'py', text)
return (refs, [])
def setup(app):
roles.register_local_role('nref', handle)
Explanation
After reading this guide on roles, and digging here and here I realised that all I needed was a dummy field to handle the whole reference-making work and pretend like it's trying to reference classes.
Followup
Include the extension.
Now script.py:
def some_func(arg1):
"""
:param arg1: Google homepages.
:type arg1: dict[str, google]
Now this :nref:`list[dict[str,google]]` is hyperlinked!
"""
Notes
I am using intersphinx and autodoc to link to python's types and document my function's docstrings.
I am not well-versed in Sphinx's underlying mechanisms so take my methodology with a grain of salt.
The examples are provided are adjusted for the sake of being re-usable and generic and have not been tested.
The usability of such features is obviously questionable and only necessary when libraries like extlinks don't cover your needs.

Loading and dumping multiple yaml files with ruamel.yaml (python)

Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.

First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.

For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected behavior when using YAML Storage with TinyDB - python

Related

Python - Get empty key from yml file

python importlib seems to be sharing data between instances

Import a JSON project wise, so it loads just once

How to link typing-like nested classes and other urls with Sphinx and RST

Loading and dumping multiple yaml files with ruamel.yaml (python)

Categories

Resources