Parse only selected records from empty-line separated file - python

I have a file with the following structure:
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag. text tag always occurs in the first line of each block.
I wonder how to properly extract only blocks with a relation tag, which is not necessarily present in each block. My attempt is pasted below:
from itertools import groupby
with open('test.txt') as f:
for nonempty, group in groupby(f, bool):
if nonempty:
process_block() ## ?
Desired output is a json dump:
{
"result": [
{
"text": "Baz",
"relation": ["Bla","Foo"]
},
{
"text": "Zoo",
"relation": ["Bla","Baz"]
}
]
}

I have a proposed solution in pure python that returns a block if it contains the value in any position. This could most likely be done more elegant in a proper framework like pandas.
from pprint import pprint
fname = 'ex.txt'
# extract blocks
with open(fname, 'r') as f:
blocks = [[]]
for line in f:
if len(line) == 1:
blocks.append([])
else:
blocks[-1] += [line.strip().split('|')]
# remove blocks that don't contain 'relation
blocks = [block for block in blocks
if any('relation' == x[1] for x in block)]
pprint(blocks)
# [[['SE', 'text', 'Baz'],
# ['SE', 'entity', 'Bla'],
# ['SE', 'relation', 'Bla'],
# ['SE', 'relation', 'Foo']],
# [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]
# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
df = pd.DataFrame(block)
json_dict = {}
json_dict['text'] = list(df[2][df[1] == 'text'])
json_dict['relation'] = list(df[2][df[1] == 'relation'])
results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'
Let's go through it
Read the file into a list and divide each block by a blank line and divide columns with the | character.
Go through each block in the list and sort out any that does not contain relation.
Print the output.

You can not store the same key twice in a dictionary as mentioned in the comments.
You can read your file, split at '\n\n' into blocks, split blocks into lines at '\n', split lines into data at '|'.
You then can put it into a suiteable datastructure and parse it into a string using module json:
Create data file:
with open("f.txt","w")as f:
f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')
Read data and process it:
with open("f.txt") as f:
all_text = f.read()
as_blocks = all_text.split("\n\n")
# skip SE when splitting and filter only with |relation|
with_relation = [[k.split("|")[1:]
for k in b.split("\n")]
for b in as_blocks if "|relation|" in b]
print(with_relation)
Create a suiteable data structure - grouping multiple same keys into a list:
result = []
for inner in with_relation:
result.append({})
for k,v in inner:
# add as simple key
if k not in result[-1]:
result[-1][k] = v
# got key 2nd time, read it as list
elif k in result[-1] and not isinstance(result[-1][k], list):
result[-1][k] = [result[-1][k], v]
# got it a 3rd+ time, add to list
else:
result[-1][k].append(v)
print(result)
Create json from data structure:
import json
print( json.dumps({"result":result}, indent=4))
Output:
# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']],
[['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]
# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']},
{'text': 'Zoo', 'relation': ['Bla', 'Baz']}]
# json string
{
"result": [
{
"text": "Baz",
"entity": "Bla",
"relation": [
"Bla",
"Foo"
]
},
{
"text": "Zoo",
"relation": [
"Bla",
"Baz"
]
}
]
}

In my opinion this is a very good case for a small parser.
This solution uses a PEG parser called parsimonious but you could totally use another one:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json
data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""
class TagVisitor(NodeVisitor):
grammar = Grammar(r"""
content = (ws / block)+
block = line+
line = ~".+" nl?
nl = ~"[\n\r]"
ws = ~"\s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_content(self, node, visited_children):
filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
return {"result": filtered}
def visit_block(self, node, visited_children):
text, relations = None, []
for child in visited_children:
if child[1] == "text" and not text:
text = child[2].strip()
elif child[1] == "relation":
relations.append(child[2])
if relations:
return {"text": text, "relation": relations}
def visit_line(self, node, visited_children):
tag1, tag2, text = node.text.split("|")
return tag1, tag2, text.strip()
tv = TagVisitor()
result = tv.parse(data)
print(json.dumps(result))
This yields
{"result":
[{"text": "Baz", "relation": ["Bla", "Foo"]},
{"text": "Zoo", "relation": ["Bla", "Baz"]}]
}
The idea is to phrase a grammar, build an abstract syntax tree out of it and return the block's content in a suitable data format.

Related

how do i unflatten a dataframe back to json/xml format?

i am doing analysis on semi structured data, and for that i had to flatten both xml and json files to a pandas dataframe, now when the analysis is done, i do the improvement like drop null values and fix some data errors i need to generate xml or json files (depending on which format the user entered).
this is what i'm using to flatten xml :
import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd
def flatten_xml(node, key_prefix=()):
"""
Walk an XML node, generating tuples of key parts and values.
"""
# Copy tag content if any
text = (node.text or '').strip()
if text:
yield key_prefix, text
# Copy attributes
for attr, value in node.items():
yield key_prefix + (attr,), value
# Recurse into children
for child in node:
yield from flatten_xml(child, key_prefix + (child.tag,))
def dictify_key_pairs(pairs, key_sep='.'):
"""
Dictify key pairs from flatten_xml, taking care of duplicate keys.
"""
out = {}
# Group by candidate key.
key_map = defaultdict(list)
for key_parts, value in pairs:
key_map[key_sep.join(key_parts)].append(value)
# Figure out the final dict with suffixes if required.
for key, values in key_map.items():
if len(values) == 1: # No need to suffix keys.
out[key] = values[0]
else: # More than one value for this key.
for suffix, value in enumerate(values, 1):
out[f'{key}{key_sep}{suffix}'] = value
return out
# Parse XML with etree
tree = et.parse('NCT00571389.xml').iter()
# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
and this is what i'm using to flatten json :
from collections import defaultdict
import pandas as pd
import json
def flatten_json(nested_json, exclude=['']):
out = {}
def flatten(x, name='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude: flatten(x[a], name + a + '.')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
f = open('employee_data.json')
this_dict = json.load(f)
df = pd.DataFrame([flatten_json(x) for x in this_dict[list(this_dict.keys())[0]]])
i need to know how to go from a dataframe to the original structure of the files, help please?
edit:
this is the example of the json file i'm using:
{"features": [{"candidate": {"first_name": "Margaret", "last_name": "Mcdonald", "skills": ["skLearn", "Java", "R", "SQL", "Spark", "C++"], "state": "AL", "specialty": "Database", "experience": "Mid", "relocation": "no"}}, {"candidate": {"first_name": "Michael", "last_name": "Carter", "skills": ["TensorFlow", "R", "Spark", "MongoDB", "C++", "SQL"], "state": "AR", "specialty": "Statistics", "experience": "Senior", "relocation": "yes"}}]}
and this is the columns after i flatten them:
candidate.first_name
candidate.last_name
candidate.skills.0
candidate.skills.1
candidate.skills.2
candidate.skills.3
candidate.skills.4
candidate.skills.5
candidate.state
candidate.specialty
candidate.experience
candidate.relocation
candidate.skills.6
candidate.skills.7
candidate.skills.8
Ok, this was not easy and I should have guided you instead of coding it for you, but here is what I've done:
json = {"features": [{"candidate": {"first_name": "Margaret", "last_name": "Mcdonald", "skills": ["skLearn", "Java", "R", "SQL", "Spark", "C++"], "state": "AL", "specialty": "Database", "experience": "Mid", "relocation": "no"}}, {"candidate": {"first_name": "Michael", "last_name": "Carter", "skills": ["TensorFlow", "R", "Spark", "MongoDB", "C++", "SQL"], "state": "AR", "specialty": "Statistics", "experience": "Senior", "relocation": "yes"}}]}
df = pd.DataFrame([flatten_json(x) for x in json[list(json.keys())[0]]])
import re
header = df.columns
print(header)
regex = r'(\w+)\.(\w+)\.?(\d+)?'
m=re.findall(regex,'\n'.join(header))
def make_json(json,feature,pos,value):
if pos+1 == len(feature):
json[feature[pos]] = value
return json
elif feature[pos+1] == '':
json[feature[pos]] = value
return json
elif feature[pos+1].isdigit():
if feature[pos+1] == '0':
json[feature[pos]] = [value]
return json
else:
json[feature[pos]].append(value)
return json
else:
if feature[pos] not in json:
json[feature[pos]] = make_json({},feature,pos+1,value)
return json
else:
json[feature[pos]] = make_json(json[feature[pos]],feature,pos+1,value)
return json
json = {'features': []}
for row in range(len(df)):
cadidate = {}
for col, feature in enumerate(m):
cadidate = make_json(cadidate,feature,0,df.iloc[row][header[col]])
json['features'].append(cadidate)
print(json)
You see I wanted to make it in a recursive way so it can work for more complex json, as long you define the regex right. For your specific example it could be simpler.

Editing Json Program

My code gets a json path file, open/parses it and prints out desired values with help of a csv mapping file set up (knows what key words to look for and what name to print values out as).
Some json files, however, have multiple values for example, a json file with key "Affiliate" will have more key/value pairs inside of it instead of just having a value.
How can I parse within a key like this one and print out the 'true' value vs the 'false' ones? Currently my code would print out the entire array of key value pairs within that target key.
Example json:
"Affiliate": [
{
"ov": true,
"value": "United States",
"lookupCode": "US"
},
{
"ov": false,
"value": "France",
"lookupCode": "FR"
}
]
My code:
import json
import csv
output_dict = {}
#maps csv and json information
def findValue(json_obj, target_key, output_key):
for key in json_obj:
if isinstance(json_obj[key], dict):
findValue(json_obj[key], target_key, output_key)
else:
if target_key == key:
output_dict[output_key] = json_obj[key]
#Opens and parses json file
file = open('source_data.json', 'r')
json_read = file.read()
obj = json.loads(json_read)
#Opens and parses csv file (mapping)
with open('inputoutput.csv') as csvfile:
fr = csv.reader(csvfile)
for row in fr:
findValue(obj, row[0], row[1])
#creates/writes into json file
with open("output.json", "w") as out:
json.dump(output_dict, out, indent=4)
So you'll need to change the way that the mapping CSV is structured, as you'll need variables to determine which criteria to meet, and which value to return when the criteria is met...
Please note that with the logic implemented below, if there are 2 list items in Affiliate that both have the key ov set to true that only the last one will be added (dict keys are unique). You could put a return where I commented in the code, but then it would only use the first one of course.
I've restructured the CSV as below:
inputoutput.csv
Affiliate,Cntr,ov,true,value
Sample1,Output1,,,
Sample2,Output2,criteria2,true,returnvalue
The JSON I used as the source data is this one:
source_data.json
{
"Affiliate": [
{
"ov": true,
"value": "United States",
"lookupCode": "US"
},
{
"ov": false,
"value": "France",
"lookupCode": "FR"
}
],
"Sample1": "im a value",
"Sample2": [
{
"criteria2": false,
"returnvalue": "i am not a return value"
},
{
"criteria2": true,
"returnvalue": "i am a return value"
}
]
}
The actual code is below, note that I commented a bit on my choices.
main.py
import json
import csv
output_dict = {}
def str2bool(input: str) -> bool:
"""simple check to see if a str is a bool"""
# shamelessly stolen from:
# https://stackoverflow.com/a/715468/9267296
return input.lower() in ("yes", "true", "t", "1")
def findValue(
json_obj,
target_key,
output_key,
criteria_key=None,
criteria_value=False,
return_key="",
):
"""maps csv and json information"""
# ^^ use PEP standard for docstrings:
# https://www.python.org/dev/peps/pep-0257/#id16
# you need to global the output_dict to avoid weirdness
# see https://www.w3schools.com/python/gloss_python_global_scope.asp
global output_dict
for key in json_obj:
if isinstance(json_obj[key], dict):
findValue(json_obj[key], target_key, output_key)
# in this case I advise to use "elif" instead of the "else: if..."
elif target_key == key:
# so this is the actual logic change.
if isinstance(json_obj[key], list):
for item in json_obj[key]:
if (
criteria_key != None
and criteria_key in item
and item[criteria_key] == criteria_value
):
output_dict[output_key] = item[return_key]
# here you could put a return
else:
# this part doesn't change
output_dict[output_key] = json_obj[key]
# since we found the key and added in the output_dict
# you can return here to slightly speed up the total
# execution time
return
# Opens and parses json file
with open("source_data.json") as sourcefile:
json_obj = json.load(sourcefile)
# Opens and parses csv file (mapping)
with open("inputoutput.csv") as csvfile:
fr = csv.reader(csvfile)
for row in fr:
# this check is to determine if you need to add criteria
# row[2] would be the key to check
# row[3] would be the value that the key need to have
# row[4] would be the key for which to return the value
if row[2] != "":
findValue(json_obj, row[0], row[1], row[2], str2bool(row[3]), row[4])
else:
findValue(json_obj, row[0], row[1])
# Creates/writes into json file
with open("output.json", "w") as out:
json.dump(output_dict, out, indent=4)
running the above code with the input files I provided, results in the following file:
output.json
{
"Cntr": "United States",
"Output1": "im a value",
"Output2": "i am a return value"
}
I know there are ways to optimize this, but I wanted to keep it close to the original. You might need to play with the exact way you add stuff to output_dict to get the exact output JSON you want...

How to make a single list from lists when lists are results of a function?

I have json file like:
{
"level0name":{
"level1name":[
{
"notkey":null,
"key":"Some text 626 dollars."
},
{
"notkey":null,
"key":"Some text 3213 dollars."
}
]
}
}
and python code that parse json with regex
import json
import re
path = 'name.json'
def parser():
with open(path, 'r') as jfile:
data = json.loads(jfile.read())
for i in data["level0name"]["level1name"]:
try:
all_messages = (i['key'])
a = re.findall(u'[0-9]{1,}\sdollars.', all_messages)
for i in a:
print(i)
except KeyError:
continue
parser()
Result of function gives me many lists like, and can't merge it.
[625 dollars]
[3213 dollars]
[121 dollars]
[692 dollars]
How can i get single list? Maybe I doing something wrong while parsing?
Just need a single comma-separated list. Like:
[625, 3213, 121, 692]
You can slightly change your regex so you include a positive lookahead along with a list:
import json
import re
path = 'name.json'
def parser():
with open(path, 'r') as jfile:
data = json.loads(jfile.read())
final_list = [] #for collecting the data
for i in data["level0name"]["level1name"]:
try:
all_messages = (i['key'])
a = re.findall(u'[0-9]{1,}(?=\sdollars\.)', all_messages)
if a:
final_list.append(int(a[0])) #append to final_list
except KeyError:
continue
parser()
Or, you can read the file and apply regex:
import re
final_data = map(int, re.findall('\d+(?=\sdollars\.)', open('path.json').read()))
Using a slightly longer JSON input file:
{
"level0name":{
"level1name":[
{
"notkey":null,
"key":"Some text 626 dollars."
},
{
"notkey":null,
"key":"Some text 3213 dollars."
},
{
"notkey":null,
"key":"Some text 121 dollars."
},
{
"notkey":null,
"key":"Some text 692 dollars."
}
]
}
}
Along with this code:
import json
import re
path = 'name.json'
def parser():
results = []
with open(path, 'r') as jfile:
data = json.loads(jfile.read())
for i in data["level0name"]["level1name"]:
try:
match = re.search(r'([0-9]{1,})\sdollars.', i['key'])
if match:
results.append(match.group(1))
except KeyError:
continue
return results
print(parser()) # -> ['626', '3213', '121', '692']
Seems to do what you want. Note how a pair of parentheses were added to the regex pattern to indicate the portion (aka group) of the pattern has characters of interest. These are capturing parentheses as opposed to non-capturing parentheses, (>:...), in the documentation—in other words regular parentheses are the kind that do capturing.

Dictionary with same keys in python

I am trying to create a json object using Dictionary in python. As far as I understand the key part needs to be unique but in my case the array has multiple items with the same key so looks like Dictionary will not work for me here. Trying to understand my options here? Finally I will be saving this json object into a json file on the server.
data = {}
data['key1'] = hostname
for line in pipe.stdout:
parts = line.split() # split line into parts
if len(parts) > 1: # if at least 2 parts/columns
data['package'] = { 'name': parts[0], 'installed': parts[1], 'available': parts[2]}
print(json.dumps(data, indent=4))
Expected Json Output
{
"key1": "xyz-abc-m001",
"package": [
{ "name":"abc", "installed":"1:1", "available":"1:1.2." },
{ "name":"xyz", "installed":"2.02", "available":"2.02" },
{ "name":"zyc", "installed":"1.17.1", "available":"1.17.1" }
]
}
data = {}
data['key1'] = hostname
data['package'] = []
for line in pipe.stdout:
parts = line.split() # split line into parts
if len(parts) > 1: # if at least 2 parts/columns
data['package'].append({ 'name': parts[0], 'installed': parts[1], 'available': parts[2]})

Formatting a string in required format in Python

I have a data in format:
id1 id2 value
Something like
1 234 0.2
1 235 0.1
and so on.
I want to convert it in json format:
{
"nodes": [ {"name":"1"}, #first element
{"name":"234"}, #second element
{"name":"235"} #third element
] ,
"links":[{"source":1,"target":2,"value":0.2},
{"source":1,"target":3,"value":0.1}
]
}
So, from the original data to above format.. the nodes contain all the set of (distinct) names present in the original data and the links are basically the line number of source and target in the values list returned by nodes.
For example:
1 234 0.2
1 is in the first element in the list of values holded by the key "nodes"
234 is the second element in the list of values holded by the key "nodes"
Hence the link dictionary is {"source":1,"target":2,"value":0.2}
How do i do this efficiently in python.. I am sure there should be better way than what I am doing which is so messy :(
Here is what I am doing
from collections import defaultdict
def open_file(filename,output=None):
f = open(filename,"r")
offset = 3429
data_dict = {}
node_list = []
node_dict = {}
link_list = []
num_lines = 0
line_ids = []
for line in f:
line = line.strip()
tokens = line.split()
mod_wid = int(tokens[1]) + offset
if not node_dict.has_key(tokens[0]):
d = {"name": tokens[0],"group":1}
node_list.append(d)
node_dict[tokens[0]] = True
line_ids.append(tokens[0])
if not node_dict.has_key(mod_wid):
d = {"name": str(mod_wid),"group":1}
node_list.append(d)
node_dict[mod_wid] = True
line_ids.append(mod_wid)
link_d = {"source": line_ids.index(tokens[0]),"target":line_ids.index(mod_wid),"value":tokens[2]}
link_list.append(link_d)
if num_lines > 10000:
break
num_lines +=1
data_dict = {"nodes":node_list, "links":link_list}
print "{\n"
for k,v in data_dict.items():
print '"'+k +'"' +":\n [ \n "
for each_v in v:
print each_v ,","
print "\n],"
print "}"
open_file("lda_input.tsv")
I'm assuming by "efficiently" you're talking about programmer efficiency—how easy it is to read, maintain, and code the logic—rather than runtime speed efficiency. If you're worried about the latter, you're probably worried for no reason. (But the code below will probably be faster anyway.)
The key to coming up with a better solution is to think more abstractly. Think about rows in a CSV file, not lines in a text file; create a dict that can be rendered in JSON rather than trying to generate JSON via string processing; wrap things up in functions if you want to do them repeatedly; etc. Something like this:
import csv
import json
import sys
def parse(inpath, namedict):
lastname = [0]
def lookup_name(name):
try:
print('Looking up {} in {}'.format(name, names))
return namedict[name]
except KeyError:
lastname[0] += 1
print('Adding {} as {}'.format(name, lastname[0]))
namedict[name] = lastname[0]
return lastname[0]
with open(inpath) as f:
reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
for id1, id2, value in reader:
yield {'source': lookup_name(id1),
'target': lookup_name(id2),
'value': value}
for inpath in sys.argv[1:]:
names = {}
links = list(parse(inpath, names))
nodes = [{'name': name} for name in names]
outpath = inpath + '.json'
with open(outpath, 'w') as f:
json.dump({'nodes': nodes, 'links': links}, f, indent=4)
Don't construct the JSON manually. Make it out of an existing Python object with the json module:
def parse(data):
nodes = set()
links = set()
for line in data.split('\n'):
fields = line.split()
id1, id2 = map(int, fields[:2])
value = float(fields[2])
nodes.update((id1, id2))
links.add((id1, id2, value))
return {
'nodes': [{
'name': node
} for node in nodes],
'links': [{
'source': link[0],
'target': link[1],
'value': link[2]
} for link in links]
}
Now, you can use json.dumps to get a string:
>>> import json
>>> data = '1 234 0.2\n1 235 0.1'
>>> parsed = parse(data)
>>> parsed
{'links': [{'source': 1, 'target': 235, 'value': 0.1},
{'source': 1, 'target': 234, 'value': 0.2}],
'nodes': [{'name': 1}, {'name': 234}, {'name': 235}]}
>>> json.dumps(parsed)
'{"nodes": [{"name": 1}, {"name": 234}, {"name": 235}], "links": [{"source": 1, "target": 235, "value": 0.1}, {"source": 1, "target": 234, "value": 0.2}]}'

Categories