Recursively looping through JSON output in Python

Recursively looping through JSON output in Python - python

I have the following json output:
{
"code":0,
"message":"success",
"data":[
{
"group_id":"12345678901234567890",
"display_name":"GROUP",
"description":"Group 1",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP1",
"description":"KK Group 1",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP2",
"description":"KK Group 2",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP3",
"description":"KK Group 3",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
}
]
}
I have this definition that should be looping through the JSON output received from a pycurl command and find all groups that start with KK for example and create a list from all the ID's in the respective monitors field to add to another section of a script I wrote. In the above output it should provide 9 IDs (3 from each group) ... For whatever reason it's only grabbing the first 3 monitor IDs.
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
return i['monitors']
The output from this will be passed into the following:
for monitor in ReturnedMonitors():
putData = 'activate/' + monitor
print "Activated: " + modifyMonitors('PUT',putData)
modifyMonitors is another pycurl definition that will post to a site.

If you are trying to create a generator function with - ReturnedMonitors() , you have created it wrongly , when you do return it returns from the function, you need to use the yield keyword , also, if you need to yield each id in `monitors list separately , you should loop over them and yield separately. Example -
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
for x in i['monitors']:
yield x
For Python 3.3 + , you can use yield from to yield all values from an iterable/iterator (called generator delegation) -
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
yield from i['monitors']

Related

How to parse a markdown file to json in python?

I have many markdown files with titles, subheadings, sub-subheadings etc.
I'm interested in parsing them into a JSON that'll separate for each heading the text and "subheadings" in it.
For example, I've got the following markdown file, I want it to be parsed into something of the form:
outer1
outer2
# title 1
text1.1
## title 1.1
text1.1.1
# title 2
text 2.1
to:
{
"text": [
"outer1",
"outer2"
],
"inner": [
{
"section": [
{
"title": "title 1",
"inner": [
{
"text": [
"text1.1"
],
"inner": [
{
"section": [
{
"title": "title 1.1",
"inner": [
{
"text": [
"text1.1.1"
]
}
]
}
]
}
]
}
]
},
{
"title": "title 2",
"inner": [
{
"text": [
"text2.1"
]
}
]
}
]
}
]
}
To further illustrate the need - notice how the inner heading is nested inside the outer one, whereas the 2nd outer heading is not.
I tried using pyparser to solve this but it seems to me that it's not able to achieve this because to get section "title 2" to be on the same level as "title 1" I need some sort of "counting logic" to check that the number or "#" in the new header is less than or equal which is something I can't seem to do.
Is this an issue with the expressibility of pyparser? Is there another kind of parser that could achieve this?
I could implement this in pure python but I wanted to do something better.
Here is my current pyparsing implementation which doesn't work as explained above:
section = pp.Forward()("section")
inner_block = pp.Forward()("inner")
start_section = pp.OneOrMore(pp.Word("#"))
title_section = line
title = start_section.suppress() + title_section('title')
line = pp.Combine(
pp.OneOrMore(pp.Word(pp.unicode.Latin1.printables), stop_on=pp.LineEnd()),
join_string=' ', adjacent=False)
text = \~title + pp.OneOrMore(line, stop_on=(pp.LineEnd() + pp.FollowedBy("#")))
inner_block \<\< pp.Group(section | (text('text') + pp.Optional(section.set_parse_action(foo))))
section \<\< pp.Group(title + pp.Optional(inner_block))
markdown = pp.OneOrMore(inner_block)
test = """\
out1
out2
# title 1
text1.1
# title 2
text2.1
"""
res = markdown.parse_string(test, parse_all=True).as_dict()
test_eq(res, dict(
inner=[
dict(
text = ["out1", "out2"],
section=[
dict(title="title 1", inner=[
dict(
text=["text1.1"]
),
]),
dict(title="title 2", inner=[
dict(
text=["text2.1"]
),
]),
]
)
]
))

I took a slightly different approach to this problem, using scan_string instead of parse_string, and doing more of the data structure management and storage in a scan_string loop instead of in the parser itself with parse actions.
scan_string scans the input and for each match found, returns the matched tokens as a ParseResults, and the start and end locations of the match in the source string.
Starting with an import, I define an expression for a title line:
import pyparsing as pp
# define a pyparsing expression that will match a line with leading '#'s
title = pp.AtLineStart(pp.Word("#")) + pp.rest_of_line
To get ready to gather data by title, I define a title_stack list, and a last_end int to keep track of the end of the last title found (so we can slice out the contents of the last title that was parsed). I initialize this stack with a fake entry representing the start of the file:
# initialize title_stack with level-0 title at the start of the file
title_stack.append([0, '<start of file>'])
Here is the scan loop using scan_string:
for t, start, end in title.scan_string(sample):
# save content since last title in the last item in title_stack
title_stack[-1].append(sample[last_end:start].lstrip("\n"))
# add a new entry to title_stack
marker, title_content = t
level = len(marker)
title_stack.append([level, title_content.lstrip()])
# update last_end to the end of the current match
last_end = end
# add trailing text to the final parsed title
title_stack[-1].append(sample[last_end:])
At this point, title_stack contains a list of 3-element lists, the title level, the title text, and the body text for that title. Here is the output for your sample markdown:
[[0, '<start of file>', 'outer1\nouter2\n\n'],
[1, 'title 1', 'text1.1\n\n'],
[2, 'title 1.1', 'text1.1.1\n\n'],
[3, 'title 1.1.1', 'text 1.1.1\n\n'],
[1, 'title 2', 'text 2.1']]
From here, you should be able to walk this list and convert it into your desired tree structure.

Why is my for loop filling in values into the next json object if it has an empty array?

I'm working with a large number of large json files. I've written the below (extremely un elegant code) in order to generate two dictionaries with which I can create a dataframe to work with. However, in instances where the JSON has values with empty arrays, my code is propagating the last 'valid' values into the subsequent objects with empty arrays. I've tried replacing empty arrays with blanks but that doesn't seem to work either. (I know my code is very bad - still learning so please keep that in mind)
dicts = []
fid = []
x=0
while x < 1:
for i in files:
n=[]
k = []
t = []
op = open(i)
data = op.read()
js = json.loads(data)
items = js['metadata']['items']
#items = json.dumps(items).replace('[]','""')
#items = json.loads(items)
fileid = js['id']
fid.append(fileid)
##Everything after this point is what's throwing me off##
for a in items:
b = json.loads(json.dumps(a, sort_keys =True))
key = b['name']
k.append(key)
val = b['values']
values = []
for c in val:
j=json.dumps(c['value'])
if isinstance(c, list) == False:
continue
values.append(j)
j = ';'.join(values) #<-- For objects with more than one value
t.append(j)
output_dict = dict(zip([key], [j]))
n.append(output_dict)
dicts.append(n)
x = x+1
Here is an example section of the json where I'm observing this behavior:
x = {"metadata": {
"items": [
{
"values": [
{ "attribute1": "attribute", #<-- NOT IMPORTANT
"value": "VALUE 1" #<----VALUE I'M AFTER
},
{"attribute2": "attribute",#<-- NOT IMPORTANT
"value2": "VALUE 2"#<----VALUE I'M AFTER
}
],
"name": "NAME 1" #<--NAME I'M AFTER
},
{
"values": [
{
"value": []#<-- EMPTY ARRAY
}
],
"name": "NAME 2"}
]
}
}
In the above snippet, my ideal output is a list of dictionary pairings that looks like:
[{"NAME 1": "VALUE 1; VALUE 2", "NAME 2": " "...}]
But what I'm getting is:
[{"NAME 1": "VALUE 1; VALUE 2"}, {"NAME 2": "VALUE 1; VALUE 2"}...}]
I've tried deconstructing my work, and can't figure out why. I've re-indented and done a walk through a couple times and I don't understand why it would behave like this. What about the way my loop is constructed is causing this?

Python parse JSON file

{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information",
"Microsoft.AspNetCore": "Warning",
"System.Net.Http.HttpClient.Default.ClientHandler": "Warning",
"System.Net.Http.HttpClient.Default.LogicalHandler": "Warning"
}
},
"AllowedHosts": "*",
"AutomaticTransferOptions": {
"DateOffsetForDirectoriesInDays": -1,
"DateOffsetForPortfoliosInDays": -3,
"Clause": {
"Item1": "1"
}
},
"Authentication": {
"ApiKeys": [
{
"Key": "AB8E5976-2A7C-4EEE-92C1-7B0B4DC840F6",
"OwnerName": "Cron job",
"Claims": [
{
"Type": "http://schemas.microsoft.com/ws/2008/06/identity/claims/role",
"Value": "StressTestManager"
}
]
},
{
"Key": "B11D4F27-483A-4234-8EC7-CA121712D5BE",
"OwnerName": "Test admin",
"Claims": [
{
"Type": "http://schemas.microsoft.com/ws/2008/06/identity/claims/role",
"Value": "StressTestAdmin"
},
{
"Type": "http://schemas.microsoft.com/ws/2008/06/identity/claims/role",
"Value": "TestManager"
}
]
},
{
"Key": "EBF98F2E-555E-4E66-9D77-5667E0AA1B54",
"OwnerName": "Test manager",
"Claims": [
{
"Type": "http://schemas.microsoft.com/ws/2008/06/identity/claims/role",
"Value": "TestManager"
}
]
}
],
"LDAP": {
"Domain": "domain.local",
"MachineAccountName": "Soft13",
"MachineAccountPassword": "vixuUEY7884*",
"EnableLdapClaimResolution": true
}
},
"Authorization": {
"Permissions": {
"Roles": [
{
"Role": "TestAdmin",
"Permissions": [
"transfers.create",
"bindings.create"
]
},
{
"Role": "TestManager",
"Permissions": [
"transfers.create"
]
}
]
}
}
}
I have JSON above and need to parse it with output like this
Logging__LogLevel__Default
Authentication__ApiKeys__0__Claims__0__Type
Everything is ok, but I always get some strings with this output
Authentication__ApiKeys__0__Key
Authentication__ApiKeys__0__OwnerName
Authentication__ApiKeys__0__Claims__0__Type
Authentication__ApiKeys__0__Claims__0__Value
Authentication__ApiKeys__0__Claims__0
Authentication__ApiKeys__2
Authorization__Permissions__Roles__0__Role
Authorization__Permissions__Roles__0__Permissions__1
Authorization__Permissions__Roles__1__Role
Authorization__Permissions__Roles__1__Permissions__0
Authorization__Permissions__Roles__1
Why does my code adds not full strings like
Authentication__ApiKeys__0__Claims__0
Authentication__ApiKeys__2
Authorization__Permissions__Roles__1
And why it doesn't print every value from
Authorization__Permissions__Roles__0__Permissions__*
and from
Authorization__Permissions__Roles__1__Permissions__*
I have this code in python3:
def checkdepth(sub_key, variable):
delmt = '__'
for item in sub_key:
try:
if isinstance(sub_key[item], dict):
sub_variable = variable + delmt + item
checkdepth(sub_key[item], sub_variable)
except TypeError:
continue
if isinstance(sub_key[item], list):
sub_variable = variable + delmt + item
for it in sub_key[item]:
sub_variable = variable + delmt + item + delmt + str(sub_key[item].index(it))
checkdepth(it, sub_variable)
print(sub_variable)
if isinstance(sub_key[item], int) or isinstance(sub_key[item], str):
sub_variable = variable + delmt + item
print (sub_variable)
for key in data:
if type(data[key]) is str:
print(key + '=' +str(data[key]))
else:
variable = key
checkdepth(data[key], variable)
I know that the problem in block where I process list data type, but I don't know where is the problem exactly

Use a recursive generator:
import json
with open('input.json') as f:
data = json.load(f)
def strkeys(data):
if isinstance(data,dict):
for k,v in data.items():
for item in strkeys(v):
yield f'{k}__{item}' if item else k
elif isinstance(data,list):
for i,v in enumerate(data):
for item in strkeys(v):
yield f'{i}__{item}' if item else str(i)
else:
yield None # termination condition, not a list or dict
for s in strkeys(data):
print(s)
Output:
Logging__LogLevel__Default
Logging__LogLevel__Microsoft
Logging__LogLevel__Microsoft.Hosting.Lifetime
Logging__LogLevel__Microsoft.AspNetCore
Logging__LogLevel__System.Net.Http.HttpClient.Default.ClientHandler
Logging__LogLevel__System.Net.Http.HttpClient.Default.LogicalHandler
AllowedHosts
AutomaticTransferOptions__DateOffsetForDirectoriesInDays
AutomaticTransferOptions__DateOffsetForPortfoliosInDays
AutomaticTransferOptions__Clause__Item1
Authentication__ApiKeys__0__Key
Authentication__ApiKeys__0__OwnerName
Authentication__ApiKeys__0__Claims__0__Type
Authentication__ApiKeys__0__Claims__0__Value
Authentication__ApiKeys__1__Key
Authentication__ApiKeys__1__OwnerName
Authentication__ApiKeys__1__Claims__0__Type
Authentication__ApiKeys__1__Claims__0__Value
Authentication__ApiKeys__1__Claims__1__Type
Authentication__ApiKeys__1__Claims__1__Value
Authentication__ApiKeys__2__Key
Authentication__ApiKeys__2__OwnerName
Authentication__ApiKeys__2__Claims__0__Type
Authentication__ApiKeys__2__Claims__0__Value
Authentication__LDAP__Domain
Authentication__LDAP__MachineAccountName
Authentication__LDAP__MachineAccountPassword
Authentication__LDAP__EnableLdapClaimResolution
Authorization__Permissions__Roles__0__Role
Authorization__Permissions__Roles__0__Permissions__0
Authorization__Permissions__Roles__0__Permissions__1
Authorization__Permissions__Roles__1__Role
Authorization__Permissions__Roles__1__Permissions__0

Using json_flatten this can be converted to pandas, but it's not clear if that's what you want. Also, when you do convert it can use df.iloc[0] to see why each column is being provided (ie you see the value for that key).
Note: you need to pass a list so I just wrapped your json above in [].
# https://github.com/amirziai/flatten
dic = your json from above
dic =[dic] # put it in a list
dic_flattened = (flatten(d, '__') for d in dic) # add your delimiter
df = pd.DataFrame(dic_flattened)
df.iloc[0]
Logging__LogLevel__Default Information
Logging__LogLevel__Microsoft Warning
Logging__LogLevel__Microsoft.Hosting.Lifetime Information
Logging__LogLevel__Microsoft.AspNetCore Warning
Logging__LogLevel__System.Net.Http.HttpClient.Default.ClientHandler Warning
Logging__LogLevel__System.Net.Http.HttpClient.Default.LogicalHandler Warning
AllowedHosts *
AutomaticTransferOptions__DateOffsetForDirectoriesInDays -1
AutomaticTransferOptions__DateOffsetForPortfoliosInDays -3
AutomaticTransferOptions__Clause__Item1 1
Authentication__ApiKeys__0__Key AB8E5976-2A7C-4EEE-92C1-7B0B4DC840F6
Authentication__ApiKeys__0__OwnerName Cron job
Authentication__ApiKeys__0__Claims__0__Type http://schemas.microsoft.com/ws/2008/06/identi...
Authentication__ApiKeys__0__Claims__0__Value StressTestManager
Authentication__ApiKeys__1__Key B11D4F27-483A-4234-8EC7-CA121712D5BE
Authentication__ApiKeys__1__OwnerName Test admin
Authentication__ApiKeys__1__Claims__0__Type http://schemas.microsoft.com/ws/2008/06/identi...
Authentication__ApiKeys__1__Claims__0__Value StressTestAdmin
Authentication__ApiKeys__1__Claims__1__Type http://schemas.microsoft.com/ws/2008/06/identi...
Authentication__ApiKeys__1__Claims__1__Value TestManager
Authentication__ApiKeys__2__Key EBF98F2E-555E-4E66-9D77-5667E0AA1B54
Authentication__ApiKeys__2__OwnerName Test manager
Authentication__ApiKeys__2__Claims__0__Type http://schemas.microsoft.com/ws/2008/06/identi...
Authentication__ApiKeys__2__Claims__0__Value TestManager
Authentication__LDAP__Domain domain.local
Authentication__LDAP__MachineAccountName Soft13
Authentication__LDAP__MachineAccountPassword vixuUEY7884*
Authentication__LDAP__EnableLdapClaimResolution true
Authorization__Permissions__Roles__0__Role TestAdmin
Authorization__Permissions__Roles__0__Permissions__0 transfers.create
Authorization__Permissions__Roles__0__Permissions__1 bindings.create
Authorization__Permissions__Roles__1__Role TestManager
Authorization__Permissions__Roles__1__Permissions__0 transfers.create

Ok, I looked at your code and it's hard to follow. You're variable and function names are not easy to understand their purpose. Which is fine cause everyone has to learn best practice and all the little tips and tricks in python. So hopefully I can help you out.
You have a recursive-ish function. Which is definingly the best way to handle a situation like this. However your code is part recursive and part not. If you go recursive to solve a problem you have to go 100% recursive.
Also the only time you should print in a recursive function is for debugging. Recursive functions should have an object that is passed down the function and gets appended to or altered and then passed back once it gets to the end of the recursion.
When you get a problem like this, think about which data you actually need or care about. In this problem we don't care about the values that are stored in the object, we just care about the keys. So we should write code that doesn't even bother looking at the value of something except to determine its type.
Here is some code I wrote up that should work for what you're wanting to do. But take note that because I did purely a recursive function my code base is small. Also my function uses a list that is passed around and added to and then at the end I return it so that we can use it for whatever we need. If you have questions just comment on this question and I'll answer the best I can.
def convert_to_delimited_keys(obj, parent_key='', delimiter='__', keys_list=None):
if keys_list is None: keys_list = []
if isinstance(obj, dict):
for k in obj:
convert_to_delimited_keys(obj[k], delimiter.join((parent_key, str(k))), delimiter, keys_list)
elif isinstance(obj, list):
for i, _ in enumerate(obj):
convert_to_delimited_keys(obj[i], delimiter.join((parent_key, str(i))), delimiter, keys_list)
else:
# Append to list, but remove the leading delimiter due to string.join
keys_list.append(parent_key[len(delimiter):])
return keys_list
for item in convert_to_delimited_keys(data):
print(item)

Get array from file which also includes strings

I'm writing an automation script in Python that makes use of another library. The output I'm given contains the array I need, however the output also includes log messages in string format that are irrelevant.
For my script to work, I need to retrieve only the array which is in the file.
Here's an example of the output I'm getting.
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
How would I get only the array from this file?
FWIW, I'd rather not have the logic based on the strings outside of the array, as they could be subject to change.
UPDATE: Script I'm getting the data from is here: https://github.com/brave/ab2cb/tree/master/ab2cb
My full code is here:
def pipe_in(process, filter_lists):
try:
for body, _, _ in filter_lists:
process.stdin.write(body)
finally:
process.stdin.close()
def write_block_lists(filter_lists, path, expires):
block_list = generate_metadata(filter_lists, expires)
process = subprocess.Popen(('ab2cb'),
cwd=ab2cb_dirpath,
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
threading.Thread(target=pipe_in, args=(process, filter_lists)).start()
result = process.stdout.read()
with open('output.json', 'w') as destination_file:
destination_file.write(result)
destination_file.close()
if process.wait():
raise Exception('ab2cb returned %s' % process.returncode)
The output will ideally be modified in stdout and written later to file as I still need to modify the data within the previously mentioned array.

You can use regex too
import re
input = """
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
asd
asd
"""
regex = re.compile(r"\[(.|\n)*(?:^\]$)", re.M)
x = re.search(regex, input)
print(x.group(0))
EDIT
re.M turns on 'MultiLine matching'
https://repl.it/repls/InfantileDopeyLink

I have written a library for this purpose. It's not often that I get to plug it!
from jsonfinder import jsonfinder
logs = r"""
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
Something else that looks like JSON: [1, 2]
"""
for start, end, obj in jsonfinder(logs):
if (
obj
and isinstance(obj, list)
and isinstance(obj[0], dict)
and {"action", "trigger"} <= obj[0].keys()
):
print(obj)
Demo: https://repl.it/repls/ImperfectJuniorBootstrapping
Library: https://github.com/alexmojaki/jsonfinder
Install with pip install jsonfinder.

Retrieve data from json using Python

I am trying to retrieve the urls from the sub elements given comp1 and comp2 as input to the python script
{
"main1": {
"comp1": {
"url": [
"http://kcdclcm.com",
"http://dacklsd.com"
]
},
"comp2": {
"url": [
"http://dccmsdlkm.com",
"http://clsdmcsm.com"
]
}
},
"main2": {
"comp3": {
"url": [
"http://csdc.com",
"http://uihjkn.com"
]
},
"comp4": {
"url": [
"http://jkll.com",
"http://ackjn.com"
]
}
}
}
Following is the snippet of the python function, I am trying to use to grab the urls
import json
data = json.load(open('test.json'))
def geturl(comp):
if comp in data[comp]:
for url in data[comp]['url']:
print url
geturl('comp1')
geturl('comp2')
I totally understand the error is in the 4th and 5th line of the script, since i am trying to grab the url information from the second element of the json data without passing the first element 'main1' or 'main2'. Same script works fine if I replace the 4th and 5th line as below:
if comp in data['main1']:
for url in data['main1'][comp]['url']:
In my case, i would not know main1 and main2 as the user would just pass comp1, comp2, comp3 and comp4 part as input to the script. Is there a way to find the url information given that only the second element is known
Any inputs would be highly appreciated.

You need to iterate through the keys/values in the dict to check if the second level key you are searching for is present:
import json
data = json.load(open('test.json'))
def geturl(comp):
for k, v in data.items():
if comp in v and 'url' in v[comp]:
print "%s" % "\n".join(v[comp]['url'])
geturl('comp1')
geturl('comp2')

If you want to search the urls with only comp key in every main, you just need to do it like this:
import json
data = json.load(open('test.json'))
def geturl(comp):
for mainKey in data:
main = data[mainKey]
if comp in main:
urls = main[comp]['url']
for url in urls:
print url
geturl('comp1')
geturl('comp2')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recursively looping through JSON output in Python - python

Related

How to parse a markdown file to json in python?

Why is my for loop filling in values into the next json object if it has an empty array?

Python parse JSON file

Get array from file which also includes strings

Retrieve data from json using Python

Categories

Resources