Yajl parse error with githubarchive.org JSON stream in Python

Yajl parse error with githubarchive.org JSON stream in Python - python

I'm trying to parse a GitHub archive file with yajl-py. I believe the basic format of the file is a stream of JSON objects, so the file itself is not valid JSON, but it contains objects which are.
To test this out, I installed yajl-py and then used their example parser (from https://github.com/pykler/yajl-py/blob/master/examples/yajl_py_example.py) to try to parse a file:
python yajl_py_example.py < 2012-03-12-0.json
where 2012-03-12-0.json is one of the GitHub archive files that's been decompressed.
It appears this sort of thing should work from their reference implementation in Ruby. Do the Python packages not handle JSON streams?
By the way, here's the error I get:
yajl.yajl_common.YajlError: parse error: trailing garbage
9478bbc3","type":"PushEvent"}{"repository":{"url":"https://g
(right here) ------^

You need to use a stream parser to read the data. Yajl supports stream parsing, which allows you to read one object at a time from a file/stream. Having said that, it doesn't look like Python has working bindings for Yajl..
py-yajl has iterload commented out, not sure why: https://github.com/rtyler/py-yajl/commit/a618f66005e9798af848c15d9aa35c60331e6687#L1R264
Not a Python solution, but you can use Ruby bindings to read in the data and emit it in a format you need:
# gem install yajl-ruby
require 'open-uri'
require 'zlib'
require 'yajl'
gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz')
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
print event
end

The example does not enable any of the Yajl extra features, for what you are looking for you need to enable allow_multiple_values flag on the parser. Here is what you need to modify to the basic example to have it parse your file.
--- a/examples/yajl_py_example.py
+++ b/examples/yajl_py_example.py
## -37,6 +37,7 ## class ContentHandler(YajlContentHandler):
def main(args):
parser = YajlParser(ContentHandler())
+ parser.allow_multiple_values = True
if args:
for fn in args:
f = open(fn)
Yajl-Py is a thin wrapper around yajl, so you can use all the features Yajl provides. Here are all the flags that yajl provides that you can enable:
yajl_allow_comments
yajl_dont_validate_strings
yajl_allow_trailing_garbage
yajl_allow_multiple_values
yajl_allow_partial_values
To turn these on in yajl-py you do the following:
parser = YajlParser(ContentHandler())
# enabling these features, note that to make it more pythonic, the prefix `yajl_` was removed
parser.allow_comments = True
parser.dont_validate_strings = True
parser.allow_trailing_garbage = True
parser.allow_multiple_values = True
parser.allow_partial_values = True
# then go ahead and parse
parser.parse()

I know this has been answered, but I prefer the following approach and it does not use any packages. The github dictionary is on a single line for some reason, so you cannot assume a single dictionary per line. This looks like:
{"json-key":"json-val", "sub-dict":{"sub-key":"sub-val"}}{"json-key2":"json-val2", "sub-dict2":{"sub-key2":"sub-val2"}}
I decided to create a function which fetches one dictionary at a time. It returns json as a string.
def read_next_dictionary(f):
depth = 0
json_str = ""
while True:
c = f.read(1)
if not c:
break #EOF
json_str += str(c)
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
break
return json_str
I used this function to loop through the Github archive with a while loop:
arr_of_dicts = []
f = open(file_path)
while True:
json_as_str = read_next_dictionary(f)
try:
json_dict = json.loads(json_as_str)
arr_of_dicts.append(json_dict)
except:
break # exception on loading json to end loop
pprint.pprint(arr_of_dicts)
This works on the dataset post here: http://www.githubarchive.org/ (after gunzip)

As a workaround you can split the GitHub Archive files into lines and then parse each line as json:
import json
with open('2013-05-31-10.json') as f:
lines = f.read().splitlines()
for line in lines:
rec = json.loads(line)
...

Related

python ruamel.yaml package, how to get header comment lines?

I want to get YAML file comments on header lines, like
# 11111111111111111
# 11111111111111111
# 22222222222222222
# bbbbbbbbbbbbbbbbb
---
start:
....
And I used the ca attribute on the loaded data, butfound there are no these comments on it. Is there any other way to get these comments?

Currently (ruamel.yaml==0.17.17) the comments that occur
before the document start token (---) are not passed on from the
DocumentStartToken to the DocumentStartEvent, so these comments are
effectively lost during parsing. Even if they were passed on, it is
non-trivial to preserve them as the DocumentStartEvent is silently
dropped during composition.
You can either put the comments after the end of directives indicator
(---) which allows you to get at the comments using the .ca
attribute without a problem, or remove that indicator altogether as it
is superfluous (at least in your example). Alternatively you will have to
write a small wrapper around the loader:
import sys
import pathlib
import ruamel.yaml
fn = pathlib.Path('input.yaml')
def load_with_pre_directives_comments(yaml, path):
comments = []
text = path.read_text()
if '\n---\n' not in text and '\n--- ' not in text:
return yaml.load(text), comments
for line in text.splitlines(True):
if line.lstrip().startswith('#'):
comments.append(line)
elif line.startswith('---'):
return yaml.load(text), comments
break
yaml = ruamel.yaml.YAML()
yaml.explicit_start = True
data, comments = load_with_pre_directives_comments(yaml, fn)
print(''.join(comments), end='')
yaml.dump(data, sys.stdout)
which gives:
# 11111111111111111
# 11111111111111111
# 22222222222222222
# bbbbbbbbbbbbbbbbb
---
start: 42

Force Sphinx to interpret Markdown in Python docstrings instead of reStructuredText

I'm using Sphinx to document a python project. I would like to use Markdown in my docstrings to format them. Even if I use the recommonmark extension, it only covers the .md files written manually, not the docstrings.
I use autodoc, napoleon and recommonmark in my extensions.
How can I make sphinx parse markdown in my docstrings?

Sphinx's Autodoc extension emits an event named autodoc-process-docstring every time it processes a doc-string. We can hook into that mechanism to convert the syntax from Markdown to reStructuredText.
Unfortunately, Recommonmark does not expose a Markdown-to-reST converter. It maps the parsed Markdown directly to a Docutils object, i.e., the same representation that Sphinx itself creates internally from reStructuredText.
Instead, I use Commonmark for the conversion in my projects. Because it's fast — much faster than Pandoc, for example. Speed is important as the conversion happens on the fly and handles each doc-string individually. Other than that, any Markdown-to-reST converter would do. M2R2 would be a third example. The downside of any of these is that they do not support Recommonmark's syntax extensions, such as cross-references to other parts of the documentation. Just the basic Markdown.
To plug in the Commonmark doc-string converter, make sure that package is installed (pip install commonmark) and add the following to Sphinx's configuration file conf.py:
import commonmark
def docstring(app, what, name, obj, options, lines):
md = '\n'.join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
def setup(app):
app.connect('autodoc-process-docstring', docstring)
Meanwhile, Recommonmark was deprecated in May 2021. The Sphinx extension MyST, a more feature-rich Markdown parser, is the replacement recommended by Sphinx and by Read-the-Docs. MyST does not yet support Markdown in doc-strings either, but the same hook as above can be used to get limited support via Commonmark.
A possible alternative to the approach outlined here is using MkDocs with the MkDocStrings plug-in, which would eliminate Sphinx and reStructuredText entirely from the process.

Building on #john-hennig answer, the following will keep the restructured text fields like: :py:attr:, :py:class: etc. . This allows you to reference other classes, etc.
import re
import commonmark
py_attr_re = re.compile(r"\:py\:\w+\:(``[^:`]+``)")
def docstring(app, what, name, obj, options, lines):
md = '\n'.join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
for i, line in enumerate(lines):
while True:
match = py_attr_re.search(line)
if match is None:
break
start, end = match.span(1)
line_start = line[:start]
line_end = line[end:]
line_modify = line[start:end]
line = line_start + line_modify[1:-1] + line_end
lines[i] = line
def setup(app):
app.connect('autodoc-process-docstring', docstring)

I had to extend the accepted answer by john-hen to allow multi-line descriptions of Args: entries to be considered a single parameter:
def docstring(app, what, name, obj, options, lines):
wrapped = []
literal = False
for line in lines:
if line.strip().startswith(r'```'):
literal = not literal
if not literal:
line = ' '.join(x.rstrip() for x in line.split('\n'))
indent = len(line) - len(line.lstrip())
if indent and not literal:
wrapped.append(' ' + line.lstrip())
else:
wrapped.append('\n' + line.strip())
ast = commonmark.Parser().parse(''.join(wrapped))
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += rst.splitlines()
def setup(app):
app.connect('autodoc-process-docstring', docstring)

The current #john-hennig is great, but seems to be failing for multi-line Args: in python style. Here was my fix:
def docstring(app, what, name, obj, options, lines):
md = "\n".join(lines)
ast = commonmark.Parser().parse(md)
rst = commonmark.ReStructuredTextRenderer().render(ast)
lines.clear()
lines += _normalize_docstring_lines(rst.splitlines())
def _normalize_docstring_lines(lines: list[str]) -> list[str]:
"""Fix an issue with multi-line args which are incorrectly parsed.
```
Args:
x: My multi-line description which fit on multiple lines
and continue in this line.
```
Is parsed as (missing indentation):
```
:param x: My multi-line description which fit on multiple lines
and continue in this line.
```
Instead of:
```
:param x: My multi-line description which fit on multiple lines
and continue in this line.
```
"""
is_param_field = False
new_lines = []
for l in lines:
if l.lstrip().startswith(":param"):
is_param_field = True
elif is_param_field:
if not l.strip(): # Blank line reset param
is_param_field = False
else: # Restore indentation
l = " " + l.lstrip()
new_lines.append(l)
return new_lines
def setup(app):
app.connect("autodoc-process-docstring", docstring)

Loading and dumping multiple yaml files with ruamel.yaml (python)

Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.

First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.

For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml

Adapting Python Code to Extract Mentions from Tweets for Mention Network

Background:
I found this code from Alex Hanna (https://github.com/alexhanna) and I've been trying to adapt it to my situation, but haven't been as successful as I would like, and was hoping for help.
This code takes JSON output from the Twitter API and extracts user mentions from it to create a mention network. Currently, the code is executed as follows:
cat file.json | mentionMapper.py
Desires:
I would like to change it so that the python code reads in the JSON file. Previously I've used the code below, but haven't been able to figure out how to adapt it for this instance.
for line in in_file:
try:
tweet = json.loads(line)
except:
pass
I would also like the code to write the output to a CSV file instead of printing it to the screen or having to manually direct the output using > after the code.
Ultimate Goal:
Edgelist of Twitter users who mention one another.
Code:
import json
import sys
def main():
for line in sys.stdin:
line = line.strip()
data = ''
try:
data = json.loads(line)
except ValueError as detail:
continue
if not (isinstance(data, dict)):
## not a dictionary, skip
pass
elif 'delete' in data:
## a delete element, skip for now.
pass
elif 'user' not in data:
## bizarre userless edge case
pass
else:
if 'entities' in data and len(data['entities']['user_mentions']) > 0:
user = data['user']
user_mentions = data['entities']['user_mentions']
for u2 in user_mentions:
print ",".join([
user['screen_name'],
u2['screen_name'],
"1"
])
if __name__ == '__main__':
main()

python process communications via pipes: Race condition

So I have two Python3.2 processes that need to communicate with each other. Most of the information that needs to be communicated are standard dictionaries. Named pipes seemed like the way to go so I made a pipe class that can be instantiated in both processes. this class implements a very basic protocol for getting information around.
My problem is that sometimes it works, sometimes it doesn't. There seems to be no pattern to this behavior except the place where the code fails.
Here are the bits of the Pipe class that matter. Shout if you want more code:
class Pipe:
"""
there are a bunch of constants set up here. I dont think it would be useful to include them. Just think like this: Pipe.WHATEVER = 'WHATEVER'
"""
def __init__(self,sPath):
"""
create the fifo. if it already exists just associate with it
"""
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.iFH = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.iFHBlocking = os.open(sPath,os.O_RDWR)
def write(self,dMessage):
"""
write the dict to the fifo
if dMessage is not a dictionary then there will be an exception here. There never is
"""
self.writeln(Pipe.MESSAGE_START)
for k in dMessage:
self.writeln(Pipe.KEY)
self.writeln(k)
self.writeln(Pipe.VALUE)
self.writeln(dMessage[k])
self.writeln(Pipe.MESSAGE_END)
def writeln(self,s):
os.write(self.iFH,bytes('{0} : {1}\n'.format(Pipe.LINE_START,len(s)+1),'utf-8'))
os.write(self.iFH,bytes('{0}\n'.format(s), 'utf-8'))
os.write(self.iFH,bytes(Pipe.LINE_END+'\n','utf-8'))
def readln(self):
"""
look for LINE_START, get line size
read until LINE_END
clean up
return string
"""
iLineStartBaseLength = len(self.LINE_START)+3 #'{0} : '
try:
s = os.read(self.iFH,iLineStartBaseLength).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if Pipe.LINE_START in s:
#get the length of the line
sLineLen = ''
while True:
try:
sCurrent = os.read(self.iFH,1).decode('utf-8')
except:
return Pipe.READLINE_FAIL
if sCurrent == '\n':
break
sLineLen += sCurrent
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
#read the line
sLine = os.read(self.iFHBlocking,iLineLen).decode('utf-8')
#read the line terminator
sTerm = os.read(self.iFH,len(Pipe.LINE_END+'\n')).decode('utf-8')
if sTerm == Pipe.LINE_END+'\n':
return sLine
return Pipe.READLINE_FAIL
else:
return Pipe.READLINE_FAIL
def read(self):
"""
read from the fifo, make a dict
"""
dRet = {}
sKey = ''
sValue = ''
sCurrent = None
def value_flush():
nonlocal dRet, sKey, sValue, sCurrent
if sKey:
dRet[sKey.strip()] = sValue.strip()
sKey = ''
sValue = ''
sCurrent = ''
if self.message_start():
while True:
sLine = self.readln()
if Pipe.MESSAGE_END in sLine:
value_flush()
return dRet
elif Pipe.KEY in sLine:
value_flush()
sCurrent = Pipe.KEY
elif Pipe.VALUE in sLine:
sCurrent = Pipe.VALUE
else:
if sCurrent == Pipe.VALUE:
sValue += sLine
elif sCurrent == Pipe.KEY:
sKey += sLine
else:
return Pipe.NO_MESSAGE
It sometimes fails here (in readln):
try:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
except:
raise Exception('Not a valid line length: "{0}"'.format(sLineLen))
It doesn't fail anywhere else.
An example error is:
Not a valid line length: "KE 17"
The fact that it's intermittent says to me that it's due to some kind of race condition, I'm just struggling to figure out what it might be. Any ideas?
EDIT added stuff about calling processes
How the Pipe is used is it is instantiated in processA and ProcessB by calling the constructor with the same path. Process A will then intermittently write to the Pipe and processB will try to read from it. At no point do I ever try to get the thing acting as a two way.
Here is a more long winded explanation of the situation. I've been trying to keep the question short but I think it's about time I give up on that. Anyhoo, I have a daemon and a Pyramid process that need to play nice. There are two Pipe instances in use: One that only Pyramid writes to, and one that only the daemon writes to. The stuff Pyramid writes is really short, I have experienced no errors on this pipe. The stuff that the daemon writes is much longer, this is the pipe that's giving me grief. Both pipes are implemented in the same way. Both processes only write dictionaries to their respective Pipes (if this were not the case then there would be an exception in Pipe.write).
The basic algorithm is: Pyramid spawns the daemon, the daemon loads craze object hierarchy of doom and vast ram consumption. Pyramid sends POST requests to the daemon which then does a whole bunch of calculations and sends data to Pyramid so that a human-friendly page can be rendered. the human can then respond to what's in the hierarchy by filling in HTML forms and suchlike thus causing pyramid to send another dictionary to the daemon, and the daemon sending back a dictionary response.
So: only one pipe has exhibited any problems, the problem pipe has a lot more traffic than the other one, and it is a guarentee that only dictionaries are written to either
EDIT as response to question and comment
Before you tell me to take out the try...except stuff read on.
The fact that the exception gets raised at all is what is bothering me. iLineLengh = int(stuff) looks to me like it should always be passed a string that looks like an integer. This is the case only most of the time, not all of it. So if you feel the urge to comment about how it's probably not an integer please please don't.
To paraphrase my question: Spot the race condition and you will be my hero.
EDIT a little example:
process_1.py:
oP = Pipe(some_path)
while 1:
oP.write({'a':'foo','b':'bar','c':'erm...','d':'plop!','e':'etc'})
process_2.py:
oP = Pipe(same_path_as_before)
while 1:
print(oP.read())

After playing around with the code, I suspect the problem is coming from how you are reading the file.
Specifically, lines like this:
os.read(self.iFH, iLineStartBaseLength)
That call doesn't necessarily return iLineStartBaseLength bytes - it might consume "LI" , then return READLINE_FAIL and retry. On the second attempt, it will get the remainder of the line, and somehow end up giving the non-numeric string to the int() call
The unpredictability likely comes from how the fifo is being flushed - if it happens to flush when the complete line is written, all is fine. If it flushes when the line is half-written, weirdness.
At least in the hacked-up version of the script I ended up with, the oP.read() call in process_2.py often got a different dict to the one sent (where the KEY might bleed into the previous VALUE and other strangeness).
I might be mistaken, as I had to make a bunch of changes to get the code running on OS X, and further while experimenting. My modified code here
Not sure exactly how to fix it, but.. with the json module or similar, the protocol/parsing can be greatly simplified - newline separated JSON data is much easier to parse:
import os
import time
import json
import errno
def retry_write(*args, **kwargs):
"""Like os.write, but retries until EAGAIN stops appearing
"""
while True:
try:
return os.write(*args, **kwargs)
except OSError as e:
if e.errno == errno.EAGAIN:
time.sleep(0.5)
else:
raise
class Pipe(object):
"""FIFO based IPC based on newline-separated JSON
"""
ENCODING = 'utf-8'
def __init__(self,sPath):
self.sPath = sPath
if not os.path.exists(sPath):
os.mkfifo(sPath)
self.fd = os.open(sPath,os.O_RDWR | os.O_NONBLOCK)
self.file_blocking = open(sPath, "r", encoding=self.ENCODING)
def write(self, dmsg):
serialised = json.dumps(dmsg) + "\n"
dat = bytes(serialised.encode(self.ENCODING))
# This blocks until data can be read by other process.
# Can just use os.write and ignore EAGAIN if you want
# to drop the data
retry_write(self.fd, dat)
def read(self):
serialised = self.file_blocking.readline()
return json.loads(serialised)

Try getting rid of the try:, except: blocks and seeing what exception is actually being thrown.
So replace your sample with just:
iLineLen = int(sLineLen.strip(string.punctuation+string.whitespace))
I bet it'll now throw a ValueError, and it's because you're trying to cast "KE 17" to an int.
You'll need to strip more than string.whitespace and string.punctuation if you're going to cast the string to an int.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Yajl parse error with githubarchive.org JSON stream in Python - python

As a workaround you can split the GitHub Archive files into lines and then parse each line as json: import json with open('2013-05-31-10.json') as f: lines = f.read().splitlines() for line in lines: rec = json.loads(line) ...

Related

python ruamel.yaml package, how to get header comment lines?

Force Sphinx to interpret Markdown in Python docstrings instead of reStructuredText

Loading and dumping multiple yaml files with ruamel.yaml (python)

Adapting Python Code to Extract Mentions from Tweets for Mention Network

python process communications via pipes: Race condition

Categories

Resources