pyyaml: dumping without tags - python

I have
>>> import yaml
>>> yaml.dump(u'abc')
"!!python/unicode 'abc'\n"
But I want
>>> import yaml
>>> yaml.dump(u'abc', magic='something')
'abc\n'
What magic param forces no tagging?

You can use safe_dump instead of dump. Just keep in mind that it won't be able to represent arbitrary Python objects then. Also, when you load the YAML, you will get a str object instead of unicode.

How about this:
def unicode_representer(dumper, uni):
node = yaml.ScalarNode(tag=u'tag:yaml.org,2002:str', value=uni)
return node
yaml.add_representer(unicode, unicode_representer)
This seems to make dumping unicode objects work the same as dumping str objects for me (Python 2.6).
In [72]: yaml.dump(u'abc')
Out[72]: 'abc\n...\n'
In [73]: yaml.dump('abc')
Out[73]: 'abc\n...\n'
In [75]: yaml.dump(['abc'])
Out[75]: '[abc]\n'
In [76]: yaml.dump([u'abc'])
Out[76]: '[abc]\n'

You need a new dumper class that does everything the standard Dumper class does but overrides the representers for str and unicode.
from yaml.dumper import Dumper
from yaml.representer import SafeRepresenter
class KludgeDumper(Dumper):
pass
KludgeDumper.add_representer(str,
SafeRepresenter.represent_str)
KludgeDumper.add_representer(unicode,
SafeRepresenter.represent_unicode)
Which leads to
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper)
[abc, "abc\xE7"]
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper,encoding=None)
[abc, "abc\xE7"]
Granted, I'm still stumped on how to keep this pretty.
>>> print u'abc\xe7'
abcç
And it breaks a later yaml.load()
>>> yy=yaml.load(yaml.dump(['abc','abc\xe7'],Dumper=KludgeDumper,encoding=None))
>>> yy
['abc', 'abc\xe7']
>>> print yy[1]
abc�
>>> print u'abc\xe7'
abcç

little addition to interjay's excellent answer, you can keep your unicode on a reload if you take care of your file encodings.
# -*- coding: utf-8 -*-
import yaml
import codecs
data = dict(key = u"abcç\U0001F511")
fn = "test2.yaml"
with codecs.open(fn, "w", encoding="utf-8") as fo:
yaml.safe_dump(data, fo)
with codecs.open(fn, encoding="utf-8") as fi:
data2 = yaml.safe_load(fi)
print ("data2:", data2, "type(data.key):", type(data2.get("key")) )
print data2.get("key")
test2.yaml contents in my editor:
{key: "abc\xE7\uD83D\uDD11"}
print outputs:
('data2:', {'key': u'abc\xe7\U0001f511'}, 'type(data.key):', <type 'unicode'>)
abcç🔑
Plus, after reading http://nedbatchelder.com/blog/201302/war_is_peace.html I am pretty sure that safe_load/safe_dump is where I want to be anyway.

I've just started with Python and YAML, but probably this may also help. Just compare outputs:
def test_dump(self):
print yaml.dump([{'name': 'value'}, {'name2': 1}], explicit_start=True)
print yaml.dump_all([{'name': 'value'}, {'name2': 1}])

Related

Python YAML dump into single line

I want to dump a python object into a YAML string that only contains a single line. However, ruamel.yaml.safe_dump appends newline characters as well as (sometimes) '...'
Dumping for example list or dict objects appends a single newline character:
import ruamel.yaml as yaml
yaml.safe_dump([1, None], default_flow_style=None)
Outputs: '[1, null]\n'
The output I need is: '[1, null]'
When dumping "scalar" objects, even more is appended:
import ruamel.yaml as yaml
yaml.safe_dump(None, default_flow_style=None)
Outputs: 'null\n...\n'
The output I need is: 'null'
Both expected outputs are valid YAML syntax I think, i.e.
yaml.safe_load('null')
correctly returns None.
Is there any way (besides manually removing the trailing line breaks and '...', which is very hacky) to achieve what I want?
You should not be using the old API in ruamel.yaml , it's deprecated and about to be removed.
If you want everything on one line, you should probably use .default_flow_style = True
depending on how complex your data-structure can become, and widen the output so you
don't get linewraps.
Contrary to JSON, YAML normally appends a newline, so it is best to just transform the output
to hack anything of after the first one.
import sys
import ruamel.yaml
class DSL:
def __init__(self):
pass
#property
def yaml(self):
try:
return self._yaml
except AttributeError:
pass
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = True
yaml.width = 2048
self._yaml = yaml
return yaml
def __call__(self, data, stream=sys.stdout):
def strip_nl(s):
result, rest = s.split('\n', 1)
if rest not in ['', '...\n']:
print('multi-line YAML output', repr(rest))
sys.exit(1)
return result
self.yaml.dump(data, stream, transform=strip_nl)
dsl = DSL()
sys.stdout.write('|')
dsl([1, None])
sys.stdout.write('|\n')
sys.stdout.write('|')
dsl(None)
sys.stdout.write('|\n')
sys.stdout.write('|')
dsl(dict(a=[1, None], b=42))
sys.stdout.write('|\n')
which gives:
|[1, null]|
|null|
|{a: [1, null], b: 42}|

how to easily read Python built-in types from a file

I have a file which lists values of some Python built-in types: None, integers, and strings, with proper Python syntax, including escaping. For example, the file might look like this:
2
"""\\nfoo
bar
""" 'foo bar'
None
I then want to read that file into the array of the values. For the above example, the array would be:
[2, '\\nfoo\nbar\n', 'foo bar', None]
I can do this by carefully parsing and/or using split function.
Is there an easy way to do it?
I would recommend changing your file format. That said, what you have is parseable. It might get harder to parse if you have multi-token values like lists, but with only None, ints, and strings, you can tokenize the input with tokenize and parse it with something like ast.literal_eval:
import tokenize
import ast
values = []
with open('input_file') as f:
for token_type, token_string, _, _, _ in tokenize.generate_tokens(f.readline):
# Ignore newlines and the file-ending dummy token.
if token_type in (tokenize.ENDMARKER, tokenize.NEWLINE, tokenize.NL):
continue
values.append(ast.literal_eval(token_string))
You can use ast.literal_val
>>> import ast
>>> ast.literal_eval('2')
2
>>> type(ast.literal_eval('2')
<type 'int'>
>>> ast.literal_eval('[1,2,3]')
[1, 2, 3]
>>> type(ast.literal_eval('[1,2,3]')
<type 'list'>
>>> ast.literal_eval('"a"')
'a'
>>> type(ast.literal_eval('"a"')
<type 'str'>
This almost gets you there, but due to the way strings work, it ends up combining the two strings:
import ast
with open('tokens.txt') as in_file:
current_string = ''
tokens = []
for line in in_file:
current_string += line.strip()
try:
new_token = ast.literal_eval(current_string)
tokens.append(new_token)
current_string = ''
except SyntaxError:
print("Couldn't parse current line, combining with next")
tokens
Out[8]: [2, '\\nfoobarfoo bar', None]
The problem is that in Python, if you have two string literals sitting next to each other, they concatenate even if you don't use +, e.g.:
x = 'string1' 'string2'
x
Out[10]: 'string1string2'
I apologize for posting an answer to my own question, but it looks like, what works, is that I replace unquoted whitespace (including newlines), with commas, and then put [] around the whole thing and import.

Storing a UNC in JSON and loading into a dict

I have a situation where a JSON configuration document, editable by users, needs to be loaded into a dictionary in my application.
One specific scenario causing problems is a windows UNC path, such as:
\\server\share\file_path
So, valid JSON for this would intuitively be:
{"foo" : "\\\server\\share\\file_path"}
however this is invalid.
I'm going in circles with this. Here are some trials:
# starting with a json string
>>> x = '{"foo" : "\\\server\\share\\file_path"}'
>>> json.loads(x)
ValueError: Invalid \escape: line 1 column 18 (char 18)
# that didn't work, let's try to reverse engineer a dict that's correct
>>> d = {"foo":"\\server\share\file_path"}
>>> d["foo"]
'\\server\\share\x0cile_path'
# good grief, where'd my "f" go?
SUMMARY
How do I create a properly formatted JSON document that includes \\server\share\file_path?
How to I load that string into a dictionary that will return the exact value?
You're running into the escape sequences supported by the string literal. Using raw strings, this becomes clearer:
>>> d = {"foo":"\\server\share\file_path"}
>>> d
{'foo': '\\server\\share\x0cile_path'}
>>> d = {"foo": r"\\server\share\file_path"}
>>> d
{'foo': '\\\\server\\share\\file_path'}
>>> import json
>>> json.dumps(d)
'{"foo": "\\\\\\\\server\\\\share\\\\file_path"}'
>>> with open('out.json', 'w') as f: f.write(json.dumps(d))
...
>>>
$ cat out.json
{"foo": "\\\\server\\share\\file_path"}
Without raw strings, you must "escape all the things!"
>>> d = {"foo":"\\server\share\file_path"}
>>> d
{'foo': '\\server\\share\x0cile_path'}
>>> d = {"foo":"\\\\server\\share\\file_path"}
>>> d
{'foo': '\\\\server\\share\\file_path'}
>>> print d['foo']
\\server\share\file_path

Python readline() from a string?

In python, is there a built-in way to do a readline() on string? I have a large chunk of data and want to strip off just the first couple lines w/o doing split() on the whole string.
Hypothetical example:
def handleMessage(msg):
headerTo = msg.readline()
headerFrom= msg.readline()
sendMessage(headerTo,headerFrom,msg)
msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
handleMessage(msg)
I want this to result in:
sendMessage("Bob Smith", "Jane Doe", "Jane,\nPlease order...")
I know it would be fairly easy to write a class that does this, but I'm looking for something built-in if possible.
EDIT: Python v2.7
In Python 3, you can use io.StringIO:
>>> msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
>>> msg
'Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n'
>>>
>>> import io
>>> buf = io.StringIO(msg)
>>> buf.readline()
'Bob Smith\n'
>>> buf.readline()
'Jane Doe\n'
>>> len(buf.read())
44
In Python 2, you can use StringIO (or cStringIO if performance is important):
>>> import StringIO
>>> buf = StringIO.StringIO(msg)
>>> buf.readline()
'Bob Smith\n'
>>> buf.readline()
'Jane Doe\n'
The easiest way for both python 2 and 3 is using string's method splitlines(). This returns a list of lines.
>>> "some\nmultilene\nstring\n".splitlines()
['some', 'multilene', 'string']
Why not just only do as many splits as you need? Since you're using all of the resulting parts (including the rest of the string), loading it into some other buffer object and then reading it back out again is probably going to be slower, not faster (plus the overhead of function calls).
If you want the first N lines separated out, just do .split("\n", N).
>>> foo = "ABC\nDEF\nGHI\nJKL"
>>> foo.split("\n", 1)
['ABC', 'DEF\nGHI\nJKL']
>>> foo.split("\n", 2)
['ABC', 'DEF', 'GHI\nJKL']
So for your function:
def handleMessage(msg):
headerTo, headerFrom, msg = msg.split("\n", 2)
sendMessage(headerTo,headerFrom,msg)
or if you really wanted to get fancy:
# either...
def handleMessage(msg):
sendMessage(*msg.split("\n", 2))
# or just...
handleMessage = lambda msg: sendMessage(*msg.split("\n", 2))
Do it like StringIO does it:
i = self.buf.find('\n', self.pos)
So this means:
pos = msg.find("\n")
first_line = msg[:pos]
...
Seems more elegant than using the whole StringIO...
in Python string have method splitlines
msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
msg_splitlines = msg.splitlines()
headerTo = msg_splitlines[0]
headerFrom= msg_splitlines[1]
sendMessage(headerTo,headerFrom,msg)

Python strings / match case

I have a CSV file which has the following format:
id,case1,case2,case3
Here is a sample:
123,null,X,Y
342,X,X,Y
456,null,null,null
789,null,null,X
For each line I need to know which of the cases is not null. Is there an easy way to find out which case(s) are not null without splitting the string and going through each element?
This is what the result should look like:
123,case2:case3
342,case1:case2:case3
456:None
789:case3
You probably want to take a look at the CSV module, which has readers and writers that will enable you to create transforms.
>>> from StringIO import StringIO
>>> from csv import DictReader
>>> fh = StringIO("""
... id,case1,case2,case3
...
... 123,null,X,Y
...
... 342,X,X,Y
...
... 456,null,null,null
...
... 789,null,null,X
... """.strip())
>>> dr = DictReader(fh)
>>> dr.next()
{'case1': 'null', 'case3': 'Y', 'case2': 'X', 'id': '123'}
At which point you can do something like:
>>> from csv import DictWriter
>>> out_fh = StringIO()
>>> writer = DictWriter(fh, fieldnames=dr.fieldnames)
>>> for mapping in dr:
... writer.write(dict((k, v) for k, v in mapping.items() if v != 'null'))
...
The last bit is just pseudocode -- not sure dr.fieldnames is actually a property. Replace out_fh with the filehandle that you'd like to output to.
Anyway you slice it, you are still going to have to go through the list. There are more and less elegant ways to do it. Depending on the python version you are using, you can use list comprehensions.
ids=line.split(",")
print "%s:%s" % (ids[0], ":".join(["case%d" % x for x in range(1, len(ids)) if ids[x] != "null"])
Why do you treat spliting as a problem? For performance reasons?
Literally you could avoid splitting with smart regexps (like:
\d+,null,\w+,\w+
\d+,\w+,null,\w+
...
but I find it a worse solution than reparsing the data into lists.
You could use the Python csv module, comes in with the standard installation of python... It will not be much easier, though...

Categories