Python yield JSON docs from a stream - python

I have a REST API (RavenDB's Query Streaming) that returns a lot of data in JSON format. It's too much to load into memory and parse in one go:
The issue is that rather than 'one document per line', which would make it very easy, it returns a single string with our documents in a field called "Results", as follows:
{"Results":[
{"Name":"Hello World"}
]}
What I really want to do is use python's requests library to stream the response like so:
r = requests.get('.../streams/query/Raven/DocumentsByEntityName?query=', stream=True)
for chunk in r.iter_content(chunk_size=512, decode_unicode=False):
print chunk
But I want to yield individual JSON documents, so as not to have to parse the entire response. What would be the most efficient way to yield one JSON document at a time?

json.load() has an optional object_pairs_hook argument which you may be able to use. The idea is to capture each inner dict as it goes along, returning from your callback function an empty dict (or maybe None) so as to avoid building up the gigantic data structure in memory.
Keep in mind that this is not a performance optimization: in my testing (using import simplejson as json), I found that while I could save memory, using the hooks to inspect each element made the parsing actually several times slower. Still, if you are out of memory, it's better than nothing.

Here is how I am going about things at the moment. What I am doing is matching the braces ({}) so that I can output just the inner JSON documents, one per line (see: JSON Lines).
The buys me the ability to stream the output to a text file which I can decode per-line later on without having to decode the whole item in memory.
Any suggestions or optimizations would be most welcome!
def yield_stream(url1 = '/streams/query/Raven/DocumentsByEntityName?query=', query1=''):
r = requests.get(conf.db + url1 + query1, auth=conf.db_auth, stream=True)
i = 0
is_doc = False
is_str = False
doc1 = []
for chunk in r.iter_content(chunk_size=1024, decode_unicode=True):
for char in chunk:
if is_doc:
doc1.append(char)
if doc1[-2:-1] != ['\\'] and doc1[-1:] == ['"']:
is_str = not is_str
if char == '{' and not is_str:
i += 1
if i == 2:
doc1.append(char)
is_doc = True
if char == '}' and not is_str:
i -= 1
if i == 1:
yield ''.join(doc1)
doc1 = []
is_doc = False

Related

How to pull out JSON blobs from large random string which contains plain text and JSON

I have a large file full of short JSON blobs, and random strings between the JSON.
The JSON objects are all different and do not contain a regular format. The strings also contain random data and do not have a consistent length or structure.
How would I filter this string to only pull out the valid JSON?
Does Python have a string filter function which could be used to reject anything is not JSON?
Every sample I can find will either examine the entire string to understand if it's JSON, which we know will not work for this example.
There's a slow, tedious approach you can take. You can attempt to parse it, and each time you get an exception, that gives you information about what to discard. You either discard a prefix that couldn't be parsed as JSON, or you attempt to reparse a prefix followed by additional data, then parsing whatever comes after the successful parse.
import json
def extract_json_values(input_str):
results = []
while input_str:
try:
value = json.loads(input_str)
input_str = ""
except json.decoder.JSONDecodeError as exc:
if str(exc).startswith("Expecting value"):
input_str = input_str[exc.pos+1:]
continue
elif str(exc).startswith("Extra data"):
value = json.loads(input_str[:exc.pos])
input_str = input_str[exc.pos:]
results.append(value)
return results
for x in extract_json_values('x"foo"x3[1,2,3]asfa{"bar": "baz"}'):
print(x)
This should output
foo
3
[1, 2, 3]
{'bar': 'baz'}
You could use a greedy approach to scan through all substrings of the file content for valid JSON blobs, yielding on the longest match:
import json
def extract_json_blobs(content):
i = 0
while i < len(content):
if content[i] == '{':
for j in range(len(content) - 1, i, -1):
if content[j] == '}':
try:
yield json.loads(content[i:j+1])
i = j
break
except json.JSONDecodeError as e:
pass
i += 1
>>> content = 'abd 123 ** 9 {. {"id":"10"} aaasd {"foo": {"id": [1234]}} aae 324'
>>> for blob in extract_json_blobs(content):
>>> print(blob)
{'id': '10'}
{'foo': {'id': [1234]}}
If you want to yield any valid JSON (i.e., not only JSON "objects") drop the checks on { and } above.

How to load a dataframe from a file containing unwanted characters?

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

optimal method to parse a json object in a datafile

I am trying to setup a simple data file format, and I am working with these files in Python for analysis. The format basically consists of header information, followed by the data. For syntax and future extensibility reasons, I want to use a JSON object for the header information. An example file looks like this:
{
"name": "my material",
"sample-id": null,
"description": "some material",
"funit": "MHz",
"filetype": "material_data"
}
18 6.269311533 0.128658208 0.962033017 0.566268827
18.10945274 6.268810641 0.128691962 0.961950095 0.565591807
18.21890547 6.268312637 0.128725463 0.961814928 0.564998228...
If the data length/structure is always the same, this is not hard to parse. However, it brought up in my mind a question about the most flexible way to parse out the JSON object, given an unknown number of lines, and an unknown number of nested curly braces, and potentially more than one JSON object in the file.
If there is only one JSON object in the file, one can use this regular expression:
with open(fname, 'r') as fp:
fstring = fp.read()
json_string = re.search('{.*}', fstring, flags=re.S)
However, if there is more than one JSON string, and I want to grab the first one, I need to use something like this:
def grab_json(mystring):
lbracket = 0
rbracket = 0
lbracket_pos = 0
rbracket_pos = 0
for i in range(len(mystring)):
if mystring[i] == '{':
lbracket = 1
lbracket_pos = i
break
for i in range(lbracket_pos+1, len(mystring)):
if mystring[i] == '}':
rbracket += 1
if rbracket == lbracket:
rbracket_pos = i
break
elif mystring[i] == '{':
lbracket += 1
json_string = mystring[lbracket_pos : rbracket_pos + 1]
return json_string, lbracket_pos, rbracket_pos
json_string, beg_pos, end_pos = grab_json(fstring)
I guess the question as always: is there a better way to do this? Better meaning simpler code, more flexible code, more robust code, or really anything?
The easiest solution, as Klaus suggested, is just to use JSON for the entire file. That makes your life much simpler because than writing is just json.dump and reading is just json.load.
A second solution is to put the metadata in a separate file, which keeps reading and writing simple at the expense of multiple files for each data set.
A third solution would be, when writing the file to disk, to prepend the length of the JSON data. So writing might look something like:
metadata_json = json.dumps(metadata)
myfile.write('%d\n' % len(metadata_json))
myfile.write(metadata_json)
myfile.write(data)
Then reading looks like:
with open('myfile') as fd:
len = fd.readline()
metadata_json = fd.read(int(len))
metadata = json.loads(metadata)
data = fd.read()
A fourth option is to adopt an existing storage format (maybe hdf?) that already has the features you are looking for in terms of storing both data and metadata in the same file.
I would store headers separately. It'll give you a possibility to use the same header file for multiple data files
Alternatively you may want to take a look at Apache Parquet Format especially if you want to process your data on distributed cluster(s) using Spark power

how to put data from a text file into array in python

I am trying to put data from a text file into an array. below is the array i am trying to create.
[("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)]
But instead when i use the text file and load the data from it I get below as my output. it should output as above, i realise i have to split it but i dont really know how for this sort of set array. could anyone help me with this
['("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w),
("harmonic minor",r,w,s,w,w,s,w+s,s)']
below is my text file I am trying to load.
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
And this is how im loading it
file = open("slide.txt", "r")
scale = [file.readline()]
If you mean a list instead of an array:
with open(filename) as f:
list_name = f.readlines()
Some questions come to mind about what the rest of your implementation looks like and how you figure it all will work, but below is an example of how this could be done in a pretty straight forward way:
class W(object):
pass
class S(object):
pass
class WS(W, S):
pass
class R(object):
pass
def main():
# separate parts that should become tuples eventually
text = str()
with open("data", "r") as fh:
text = fh.read()
parts = text.split("),")
# remove unwanted characters and whitespace
cleaned = list()
for part in parts:
part = part.replace('(', '')
part = part.replace(')', '')
cleaned.append(part.strip())
# convert text parts into tuples with actual data types
list_of_tuples = list()
for part in cleaned:
t = construct_tuple(part)
list_of_tuples.append(t)
# now use the data for something
print list_of_tuples
def construct_tuple(data):
t = tuple()
content = data.split(',')
for item in content:
t = t + (get_type(item),)
return t
# there needs to be some way to decide what type/object should be used:
def get_type(id):
type_mapping = {
'"harmonic minor"': 'harmonic minor',
'"major"': 'major',
'"relative minor"': 'relative minor',
's': S(),
'w': W(),
'w+s': WS(),
'r': R()
}
return type_mapping.get(id)
if __name__ == "__main__":
main()
This code makes some assumptions:
there is a file data with the content:
("major",r,w,w,s,w,w,w,s), ("relative minor",r,w,s,w,w,s,w,w), ("harmonic minor",r,w,s,w,w,s,w+s,s)
you want a list of tuples which contains the values.
It's acceptable to have w+s represented by some data type, as it would be difficult to have something like w+s appear inside a tuple without it being evaluated when the tuple is created. Another way to do it would be to have w and s represented by data types that can be used with +.
So even if this works, it might be a good idea to think about the format of the text file (if you have control of that), and see if it can be changed into something which would allow you to use some parsing library in a simple way, e.g. see how it could be more easily represented as csv or even turn it into json.

Manipulate string data

I'm new to python and trying to create a script to modify the output of a JS file to match what is required to send data to an API. The JS file is being read via urllib2.
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# JS Data
# m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
# m[mi++]="19.12.12 09:25:00|1911;2060;3277;293;59"
# Required format for API
# addbatchstatus.jsp?data=20121219,09:25,3277.0,1911,-1,-1,59.0,293.0;20121219,09:30,3440.0,1964,-1,-1,60.0,293.0
As a breakdown (Required values are bold)
m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
and need to add values of -1,-1 into the string
I've managed to get the date into the correct format and replace characters and line breaks to make the output look as such, but I have a feeling I'm heading down the wrong track if I need to be able to reorder this string values. Although it looks like the order is in reverse in regards to time as well.
20121219,09:30:00,1964,2121,3440,293,60;20121219,09:25:00,1911,2060,3277,293,59
Any help would be greatly appreciated! I'm thinking along the lines of regex might be what I need.
Here's a Regex pattern to strip out the bits you don't want
m\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+
and replace with
20\P<year>\P<month>\P<day>,\P<time>,\P<v3>,\P<v1>,-1,-1,\P<v5>,\P<v4>
This pattern assumes that the characters before the date are constant. You can replace m\[mi\+\+\]=" with [^\d]+ if you want more general handling of that bit.
So to put this in practice in python:
import re
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def repl(match):
return '20%s%s%s,%s,%s,%s,-1,-1,%s,%s'%(match.group('year'),
match.group('month'),
match.group('day'),
match.group('time'),
match.group('v3'),
match.group('v1'),
match.group('v5'),
match.group('v4'))
pattern = re.compile(r'm\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+')
data = [re.sub(pattern, repl, line).split(',') for line in getPage().split('\n')]
# If you want to sort your data
data = sorted(data, key=lambda x:x[0], reverse=True)
# If you want to write your data back to a formatted string
new_string = ';'.join(','.join(x) for x in data)
# If you want to write it back to file
with open('new/file.txt', 'w') as f:
f.write(new_string)
Hope that helps!

Categories