How to parse a markdown file to json in python?

How to parse a markdown file to json in python? - python

I have many markdown files with titles, subheadings, sub-subheadings etc.
I'm interested in parsing them into a JSON that'll separate for each heading the text and "subheadings" in it.
For example, I've got the following markdown file, I want it to be parsed into something of the form:
outer1
outer2
# title 1
text1.1
## title 1.1
text1.1.1
# title 2
text 2.1
to:
{
"text": [
"outer1",
"outer2"
],
"inner": [
{
"section": [
{
"title": "title 1",
"inner": [
{
"text": [
"text1.1"
],
"inner": [
{
"section": [
{
"title": "title 1.1",
"inner": [
{
"text": [
"text1.1.1"
]
}
]
}
]
}
]
}
]
},
{
"title": "title 2",
"inner": [
{
"text": [
"text2.1"
]
}
]
}
]
}
]
}
To further illustrate the need - notice how the inner heading is nested inside the outer one, whereas the 2nd outer heading is not.
I tried using pyparser to solve this but it seems to me that it's not able to achieve this because to get section "title 2" to be on the same level as "title 1" I need some sort of "counting logic" to check that the number or "#" in the new header is less than or equal which is something I can't seem to do.
Is this an issue with the expressibility of pyparser? Is there another kind of parser that could achieve this?
I could implement this in pure python but I wanted to do something better.
Here is my current pyparsing implementation which doesn't work as explained above:
section = pp.Forward()("section")
inner_block = pp.Forward()("inner")
start_section = pp.OneOrMore(pp.Word("#"))
title_section = line
title = start_section.suppress() + title_section('title')
line = pp.Combine(
pp.OneOrMore(pp.Word(pp.unicode.Latin1.printables), stop_on=pp.LineEnd()),
join_string=' ', adjacent=False)
text = \~title + pp.OneOrMore(line, stop_on=(pp.LineEnd() + pp.FollowedBy("#")))
inner_block \<\< pp.Group(section | (text('text') + pp.Optional(section.set_parse_action(foo))))
section \<\< pp.Group(title + pp.Optional(inner_block))
markdown = pp.OneOrMore(inner_block)
test = """\
out1
out2
# title 1
text1.1
# title 2
text2.1
"""
res = markdown.parse_string(test, parse_all=True).as_dict()
test_eq(res, dict(
inner=[
dict(
text = ["out1", "out2"],
section=[
dict(title="title 1", inner=[
dict(
text=["text1.1"]
),
]),
dict(title="title 2", inner=[
dict(
text=["text2.1"]
),
]),
]
)
]
))

I took a slightly different approach to this problem, using scan_string instead of parse_string, and doing more of the data structure management and storage in a scan_string loop instead of in the parser itself with parse actions.
scan_string scans the input and for each match found, returns the matched tokens as a ParseResults, and the start and end locations of the match in the source string.
Starting with an import, I define an expression for a title line:
import pyparsing as pp
# define a pyparsing expression that will match a line with leading '#'s
title = pp.AtLineStart(pp.Word("#")) + pp.rest_of_line
To get ready to gather data by title, I define a title_stack list, and a last_end int to keep track of the end of the last title found (so we can slice out the contents of the last title that was parsed). I initialize this stack with a fake entry representing the start of the file:
# initialize title_stack with level-0 title at the start of the file
title_stack.append([0, '<start of file>'])
Here is the scan loop using scan_string:
for t, start, end in title.scan_string(sample):
# save content since last title in the last item in title_stack
title_stack[-1].append(sample[last_end:start].lstrip("\n"))
# add a new entry to title_stack
marker, title_content = t
level = len(marker)
title_stack.append([level, title_content.lstrip()])
# update last_end to the end of the current match
last_end = end
# add trailing text to the final parsed title
title_stack[-1].append(sample[last_end:])
At this point, title_stack contains a list of 3-element lists, the title level, the title text, and the body text for that title. Here is the output for your sample markdown:
[[0, '<start of file>', 'outer1\nouter2\n\n'],
[1, 'title 1', 'text1.1\n\n'],
[2, 'title 1.1', 'text1.1.1\n\n'],
[3, 'title 1.1.1', 'text 1.1.1\n\n'],
[1, 'title 2', 'text 2.1']]
From here, you should be able to walk this list and convert it into your desired tree structure.

Related

How to convert list of items into json parent-child hierarchy?

I have some items in a list with points, sub-points and sub-sub points need to pass all of them into json in parent-child hierarchy.
I tried as each point made it to list, if a point consist of point, sub or sub-sub-point all becomes in one list of a point.
my list appears like this:
lst=["1. content","(a) content","(b) ","(i)","(ii"),"(c)","2.","3.","(A)","(B)","4."]
for ptags in soup.findAll('p'):
lst.append(ptags.get_text())
regex = r"^\([a-z]\)\s.*"
regex1=r"^\([\D]+\)\s.*"
j=0
sub = []
for i in lst:
if sub:
match = re.match(regex, i)
match1=re.match(regex1,i)
if match:
sub.append(i)
elif match1:
sub.append(i)
else:
j=j+1
sub = [i]
Notes[str(j)] = sub
else:
if sub:
Notes[str(j)] = sub
sub = [i]
Notes[str(j)] = i
I need the json hierarchy as output in this way :
"1. content",
"(a) content",
"(b) ",
"(i)",
"(ii"),
"(c)",
"2.",
"3.",
"(A)",
"(B)",
"4."
######################################JSON STRUCTURE
[
{
"1. content": [
"(a) content",
{
"(b) ": [
"(i)",
"(ii)"
]
},
"(c)"
]
},
"2.",
{
"3.": [
"(A)",
"(B)"
]
},
"4."
]

If you want to have a similar hierarchy, you should change your data to a dict. Because you did not include enough information in your code, I just add a sample of what your data should look like:
from json import dumps
lst = [{"1. content": ["(a) content", {"(b) ": ["(i)","(ii)"]},"(c)"]},"2.",{"3.": ["(A)","(B)"]},"4."]
Each hierarchy level should be a dictionary. For elements those have no children, you can pass them as a simple list element.
You can get your json string using dumps now:
dumps(lst)

Get array from file which also includes strings

I'm writing an automation script in Python that makes use of another library. The output I'm given contains the array I need, however the output also includes log messages in string format that are irrelevant.
For my script to work, I need to retrieve only the array which is in the file.
Here's an example of the output I'm getting.
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
How would I get only the array from this file?
FWIW, I'd rather not have the logic based on the strings outside of the array, as they could be subject to change.
UPDATE: Script I'm getting the data from is here: https://github.com/brave/ab2cb/tree/master/ab2cb
My full code is here:
def pipe_in(process, filter_lists):
try:
for body, _, _ in filter_lists:
process.stdin.write(body)
finally:
process.stdin.close()
def write_block_lists(filter_lists, path, expires):
block_list = generate_metadata(filter_lists, expires)
process = subprocess.Popen(('ab2cb'),
cwd=ab2cb_dirpath,
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
threading.Thread(target=pipe_in, args=(process, filter_lists)).start()
result = process.stdout.read()
with open('output.json', 'w') as destination_file:
destination_file.write(result)
destination_file.close()
if process.wait():
raise Exception('ab2cb returned %s' % process.returncode)
The output will ideally be modified in stdout and written later to file as I still need to modify the data within the previously mentioned array.

You can use regex too
import re
input = """
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
asd
asd
"""
regex = re.compile(r"\[(.|\n)*(?:^\]$)", re.M)
x = re.search(regex, input)
print(x.group(0))
EDIT
re.M turns on 'MultiLine matching'
https://repl.it/repls/InfantileDopeyLink

I have written a library for this purpose. It's not often that I get to plug it!
from jsonfinder import jsonfinder
logs = r"""
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
Something else that looks like JSON: [1, 2]
"""
for start, end, obj in jsonfinder(logs):
if (
obj
and isinstance(obj, list)
and isinstance(obj[0], dict)
and {"action", "trigger"} <= obj[0].keys()
):
print(obj)
Demo: https://repl.it/repls/ImperfectJuniorBootstrapping
Library: https://github.com/alexmojaki/jsonfinder
Install with pip install jsonfinder.

How to remove the first and last portion of a string in Python?

How can i cut from such a string (json) everything before and including the first [ and everything behind and including the last ] with Python?
{
"Customers": [
{
"cID": "w2-502952",
"soldToId": "34124"
},
...
...
],
"status": {
"success": true,
"message": "Customers: 560",
"ErrorCode": ""
}
}
I want to have at least only
{
"cID" : "w2-502952",
"soldToId" : "34124",
}
...
...

String manipulation is not the way to do this. You should parse your JSON into Python and extract the relevant data using normal data structure access.
obj = json.loads(data)
relevant_data = obj["Customers"]

Addition to #Daniel Rosman answer, if you want all the list from JSON.
result = []
obj = json.loads(data)
for value in obj.values():
if isinstance(value, list):
result.append(*value)

While I agree that Daniel's answer is the absolute best way to go, if you must use string splitting, you can try .find()
string = #however you are loading this json text into a string
start = string.find('[')
end = string.find(']')
customers = string[start:end]
print(customers)
output will be everything between the [ and ] braces.

If you really want to do this via string manipulation (which I don't recommend), you can do it this way:
start = s.find('[') + 1
finish = s.find(']')
inner = s[start : finish]

Recursively looping through JSON output in Python

I have the following json output:
{
"code":0,
"message":"success",
"data":[
{
"group_id":"12345678901234567890",
"display_name":"GROUP",
"description":"Group 1",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP1",
"description":"KK Group 1",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP2",
"description":"KK Group 2",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
},
{
"group_id":"12345678901234567890",
"display_name":"KK-GROUP3",
"description":"KK Group 3",
"monitors":[
"12345678901234567890",
"12345678901234567890",
"12345678901234567890"
]
}
]
}
I have this definition that should be looping through the JSON output received from a pycurl command and find all groups that start with KK for example and create a list from all the ID's in the respective monitors field to add to another section of a script I wrote. In the above output it should provide 9 IDs (3 from each group) ... For whatever reason it's only grabbing the first 3 monitor IDs.
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
return i['monitors']
The output from this will be passed into the following:
for monitor in ReturnedMonitors():
putData = 'activate/' + monitor
print "Activated: " + modifyMonitors('PUT',putData)
modifyMonitors is another pycurl definition that will post to a site.

If you are trying to create a generator function with - ReturnedMonitors() , you have created it wrongly , when you do return it returns from the function, you need to use the yield keyword , also, if you need to yield each id in `monitors list separately , you should loop over them and yield separately. Example -
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
for x in i['monitors']:
yield x
For Python 3.3 + , you can use yield from to yield all values from an iterable/iterator (called generator delegation) -
def ReturnedMonitors():
listOfChecks = json.loads(connectSite('GET',''))
for i in listOfChecks['data']:
while i['display_name'].startswith(options.clusterName.upper()):
yield from i['monitors']

Parsing muilti dimensional Json array to Python

I'm in over my head, trying to parse JSON for my first time and dealing with a multi dimensional array.
{
"secret": "[Hidden]",
"minutes": 20,
"link": "http:\/\/www.1.com",
"bookmark_collection": {
"free_link": {
"name": "#free_link#",
"bookmarks": [
{
"name": "1",
"link": "http:\/\/www.1.com"
},
{
"name": "2",
"link": "http:\/\/2.dk"
},
{
"name": "3",
"link": "http:\/\/www.3.in"
}
]
},
"boarding_pass": {
"name": "Boarding Pass",
"bookmarks": [
{
"name": "1",
"link": "http:\/\/www.1.com\/"
},
{
"name": "2",
"link": "http:\/\/www.2.com\/"
},
{
"name": "3",
"link": "http:\/\/www.3.hk"
}
]
},
"sublinks": {
"name": "sublinks",
"link": [
"http:\/\/www.1.com",
"http:\/\/www.2.com",
"http:\/\/www.3.com"
]
}
}
}
This is divided into 3 parts, the static data on my first dimension (secret, minutes, link) Which i need to get as seperate strings.
Then I need a dictionary per "bookmark collection" which does not have fixed names, so I need the name of them and the links/names of each bookmark.
Then there is the seperate sublinks which is always the same, where I need all the links in a seperate dictionary.
I'm reading about parsing JSON but most of the stuff I find is a simple array put into 1 dictionary.
Does anyone have any good techniques to do this ?

After you parse the JSON, you will end up with a Python dict. So, suppose the above JSON is in a string named input_data:
import json
# This converts from JSON to a python dict
parsed_input = json.loads(input_data)
# Now, all of your static variables are referenceable as keys:
secret = parsed_input['secret']
minutes = parsed_input['minutes']
link = parsed_input['link']
# Plus, you can get your bookmark collection as:
bookmark_collection = parsed_input['bookmark_collection']
# Print a list of names of the bookmark collections...
print bookmark_collection.keys() # Note this contains sublinks, so remove it if needed
# Get the name of the Boarding Pass bookmark:
print bookmark_collection['boarding_pass']['name']
# Print out a list of all bookmark links as:
# Boarding Pass
# * 1: http://www.1.com/
# * 2: http://www.2.com/
# ...
for bookmark_definition in bookmark_collection.values():
# Skip sublinks...
if bookmark_definition['name'] == 'sublinks':
continue
print bookmark_definition['name']
for bookmark in bookmark_definition['bookmarks']:
print " * %(name)s: %(link)s" % bookmark
# Get the sublink definition:
sublinks = parsed_input['bookmark_collection']['sublinks']
# .. and print them
print sublinks['name']
for link in sublinks['link']:
print ' *', link

Hmm, doesn't json.loads do the trick?
For example, if your data is in a file,
import json
text = open('/tmp/mydata.json').read()
d = json.loads(text)
# first level fields
print d['minutes'] # or 'secret' or 'link'
# the names of each of bookmark_collections's items
print d['bookmark_collection'].keys()
# the sublinks section, as a dict
print d['bookmark_collection']['sublinks']
The output of this code (given your sample input above) is:
20
[u'sublinks', u'free_link', u'boarding_pass']
{u'link': [u'http://www.1.com', u'http://www.2.com', u'http://www.3.com'], u'name': u'sublinks'}
Which, I think, gets you what you need?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse a markdown file to json in python? - python

Related

How to convert list of items into json parent-child hierarchy?

Get array from file which also includes strings

How to remove the first and last portion of a string in Python?

Recursively looping through JSON output in Python

Parsing muilti dimensional Json array to Python

Categories

Resources