Convert Csv to JSON with nested array - python

I have a CSV file
group, first, last
fans, John, Smith
fans, Alice, White
students, Ben, Smith
students, Joan, Carpenter
...
The Output JSON file needs this format:
[
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
},
{
"group" : "students",
"user" : [
{
"first" : "Ben",
"last" : "Smith"
},
{
"first" : "Joan",
"last" : "Carpenter"
}
]
}
]

Short answer
Use itertools.groupby, as described in the documentation.
Long answer
This is a multi-step process.
Start by getting your CSV into a list of dict:
from csv import DictReader
with open('data.csv') as csvfile:
r = DictReader(csvfile, skipinitialspace=True)
data = [dict(d) for d in r]
groupby needs sorted data, so define a function to get the key, and pass it in like so:
def keyfunc(x):
return x['group']
data = sorted(data, key=keyfunc)
Last, call groupby, providing your sorted data and your key function:
from itertools import groupby
groups = []
for k, g in groupby(data, keyfunc):
groups.append({
"group": k,
"user": [{k:v for k, v in d.items() if k != 'group'} for d in list(g)]
})
This will iterate over your data, and every time the key changes, it drops into the for block and executes that code, providing k (the key for that group) and g (the dict objects that belong to it). Here we just store those in a list for later.
In this example, the user key uses some pretty dense comprehensions to remove the group key from every row of user. If you can live with that little bit of extra data, that whole line can be simplified as:
"user": list(g)
The result looks like this:
[
{
"group": "fans",
"user": [
{
"first": "John",
"last": "Smith"
},
{
"first": "Alice",
"last": "White"
}
]
},
{
"group": "students",
"user": [
{
"first": "Ben",
"last": "Smith"
},
{
"first": "Joan",
"last": "Carpenter"
}
]
}
]

Related

Delete item from JSON based on key Python

I have a large JSON file that needs cutting, I'm trying to delete the following items: "owner", "ticker", "comment" and "ptr_link" as keys.
JSON file:
{
"transactions": {
"0": [
{
"transaction_date": "11/29/2022",
"owner": "Spouse",
"ticker": "WIW",
"asset_description": "Western Asset Inflation-Linked Opportunities & Inc",
"asset_type": "Stock",
"type": "Sale (Full)",
"amount": "$1,001 - $15,000",
"comment": "--",
"ptr_link": "https://efdsearch.senate.gov/search/view/ptr/5ac4d053-0258-4531-af39-8a8067f0d085/"
},
{
"transaction_date": "11/29/2022",
"owner": "Spouse",
"ticker": "GBIL",
"asset_description": "Goldman Sachs Access Treasury 0-1 Year ETF",
"asset_type": "Other Securities",
"type": "Purchase",
"amount": "$1,001 - $15,000",
"comment": "--",
"ptr_link": "https://efdsearch.senate.gov/search/view/ptr/5ac4d053-0258-4531-af39-8a8067f0d085/"
}
]
}
}
The "0" that holds this list can range upto the 60's so I need to collectively access all of them rather than writing for specifically this list. The same applies for the dictionaries that hold the keys/values, as there could be numerous amounts, so I can't input [0] or [1] etc.
this is my code, I'm trying to filter to the according object and simply delete the keys. Although I need to do this collectively as mentioned.
import json
data = json.load(open("xxxtester.json"))
data1 = data['transactions']
data2 = data1['0'][0]
for i in data2:
del data2['owner']
for i in data2:
del data2['ticker']
for i in data2:
del data2['comment']
for i in data2:
del data2['ptr_link']
open("xxxtester.json", "w").write(json.dumps(data, indent=4))
Try:
import json
with open("your_data.json", "r") as f_in:
data = json.load(f_in)
to_delete = {"owner", "ticker", "comment", "ptr_link"}
for k in data["transactions"]:
data["transactions"][k] = [
{kk: vv for kk, vv in d.items() if kk not in to_delete}
for d in data["transactions"][k]
]
print(data)
Prints:
{
"transactions": {
"0": [
{
"transaction_date": "11/29/2022",
"asset_description": "Western Asset Inflation-Linked Opportunities & Inc",
"asset_type": "Stock",
"type": "Sale (Full)",
"amount": "$1,001 - $15,000",
},
{
"transaction_date": "11/29/2022",
"asset_description": "Goldman Sachs Access Treasury 0-1 Year ETF",
"asset_type": "Other Securities",
"type": "Purchase",
"amount": "$1,001 - $15,000",
},
]
}
}
To save back as Json:
with open("output.json", "w") as f_out:
json.dump(data, f_out, indent=4)
If you just want to remove some keys from each dictionary in list lets try this
data = json.load(open("xxxtester.json"))
for_delete = ["owner", "ticker", "comment", "ptr_link"]
for d in data['transactions']['0']:
for key in for_delete:
if key in d:
d.pop(key)
open("xxxtester.json", "w").write(
json.dumps(data, indent=4))

Pythonic way to transform/flatten JSON containing nested table-as-list-of-dicts structures

Suppose I have a table represented in JSON as a list of dicts, where the keys of each item are the same:
J = [
{
"symbol": "ETHBTC",
"name": "Ethereum",
:
},
{
"symbol": "LTC",
"name": "LiteCoin"
:
},
And suppose I require efficient lookup, e.g. symbols['ETHBTC']['name']
I can transform with symbols = { item['name']: item for item in J }, producing:
{
"ETHBTC": {
"symbol": "ETHBTC",
"name": "Ethereum",
:
},
"LTCBTC": {
"symbol": "LTCBTC",
"name": "LiteCoin",
:
},
(Ideally I would also remove the now redundant symbol field).
However, what if each item itself contains a "table-as-list-of-dicts"?
Here's a fuller minimal example (I've removed lines not pertinent to the problem):
J = {
"symbols": [
{
"symbol":"ETHBTC",
"filters":[
{
"filterType":"PRICE_FILTER",
"minPrice":"0.00000100",
},
{
"filterType":"PERCENT_PRICE",
"multiplierUp":"5",
},
],
},
{
"symbol":"LTCBTC",
"filters":[
{
"filterType":"PRICE_FILTER",
"minPrice":"0.00000100",
},
{
"filterType":"PERCENT_PRICE",
"multiplierUp":"5",
},
],
}
]
}
So the challenge is to transform this structure into:
J = {
"symbols": {
"ETHBTC": {
"filters": {
"PRICE_FILTER": {
"minPrice": "0.00000100",
:
}
I can write a flatten function:
def flatten(L:list, key) -> dict:
def remove_key_from(D):
del D[key]
return D
return { D[key]: remove_key_from(D) for D in L }
Then I can flatten the outer list and loop through each key/val in the resulting dict, flattening val['filters']:
J['symbols'] = flatten(J['symbols'], key="symbol")
for symbol, D in J['symbols'].items():
D['filters'] = flatten(D['filters'], key="filterType")
Is it possible to improve upon this using glom (or otherwise)?
Initial transform has no performance constraint, but I require efficient lookup.
I don't know if you'd call it pythonic but you could make your function more generic using recursion and dropping key as argument. Since you already suppose that your lists contain dictionaries you could benefit from python dynamic typing by taking any kind of input:
from pprint import pprint
def flatten_rec(I) -> dict:
if isinstance(I, dict):
I = {k: flatten_rec(v) for k,v in I.items()}
elif isinstance(I, list):
I = { list(D.values())[0]: {k:flatten_rec(v) for k,v in list(D.items())[1:]} for D in I }
return I
pprint(flatten_rec(J))
Output:
{'symbols': {'ETHBTC': {'filters': {'PERCENT_PRICE': {'multiplierUp': '5'},
'PRICE_FILTER': {'minPrice': '0.00000100'}}},
'LTCBTC': {'filters': {'PERCENT_PRICE': {'multiplierUp': '5'},
'PRICE_FILTER': {'minPrice': '0.00000100'}}}}}
Since you have different transformation rules for different keys, you can keep a list of the key names that require "grouping" on:
t = ['symbol', 'filterType']
def transform(d):
if (m:={a:b for a, b in d.items() if a in t}):
return {[*m.values()][0]:transform({a:b for a, b in d.items() if a not in m})}
return {a:b if not isinstance(b, list) else {x:y for j in b for x, y in transform(j).items()} for a, b in d.items()}
import json
print(json.dumps(transform(J), indent=4))
{
"symbols": {
"ETHBTC": {
"filters": {
"PRICE_FILTER": {
"minPrice": "0.00000100"
},
"PERCENT_PRICE": {
"multiplierUp": "5"
}
}
},
"LTCBTC": {
"filters": {
"PRICE_FILTER": {
"minPrice": "0.00000100"
},
"PERCENT_PRICE": {
"multiplierUp": "5"
}
}
}
}
}

Creating a deeply nested dictionary from a csv file using Python [duplicate]

This question already has an answer here:
Python: Dynamically update a dictionary with varying variable "depth"
(1 answer)
Closed 3 years ago.
I am creating a multi level nested dictionary by reading from a large csv file. The content the files are in the following format, which store relevant information pertaining a unique book. We can assume each row has 6 columns(author, title, year, category, url, citations); all column entries have identical formatting. For example:
Author,Title,Year,Category,Url,Citations
"jk rowling, etc....",goblet of fire,1973,magic: fantasy: english literature,http://doi.acm.org/10.1145/800010.808066,6
"Weiner, Leonard H.",cracking the coding interview,1973,LA: assessment: other,http://doi.acm.org/10.1145/800010.808105,2
"Tolkien",hobbit,1953,magic: fantasy: medieval,http://doi.acm.org/10.1145/800010.808066,6
I want the output to match how each row in the csv file is parsed, similar to the following:
*(note: the # of nested dictionaries is dependent on the book categories under the category header of the csv. Keys are based on successive categories (order matters), separated by the ':' delimiter. Think of the ordering of categories per row in the csv file as the path directory; multiple files can have the same path directory up to a certain point or they can have the same path directory and be placed in the same folder.
results = {'1973':{
"magic": {
"fantasy": {
"English literature": {
"name": "goblet of fire",
"citations": 6,
"url": "http://doi.acm.org/10.1145/800010.808066"
}
},
"medieval": {
"name": "The Hobbit",
"citations": 7,
"url": "http://doi.acm.org/10.1145/800fdfdffd010.808066"
}
}
},
'1953':{
"la": {
"assessment": {
"other": {
"name": "cracking the coding interview",
"citations": 6,
"url": "http://doi.acm.org/10.1145/800010.808105"
}
}
}
}
}
Obviously some books will have share common successive categories together like in the example I showed above. Some books might also share the exact same successive categories. I think I should recursively iterate through the string of categories per row in the csv, either creating new sub dicts that deviate from a preexisting category order, then creating a dictionary representation of the book once there are no more successive categories to check. I'm just not sure exactly how to start.
Here's what I have so far, it's just a standard setup of reading csv files:
with open(DATA_FILE, 'r') as data_file:
data = csv.reader(data_file)
Essentially, I want to create a tree representation of this csv using nested dictionaries, the relative category path (i.e. magic:fantasy:etc...), determining which subtree to traverse/create.If two or more books have the same consecutive path, I want to make all those books leafs of their respective key, instead of overriding each book(leaf) whenever a new book has an identical category path. Leafs represent a dictionary representation of the books mentioned per row in the csv.
You can group your data by category (using a simple dictionary, as you mentioned that you cannot use any modules other than csv) and then apply recursion:
import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
def group(d):
_d = {}
for a, *b in d:
if a[0] not in _d:
_d[a[0]] = [[a[1:], *b]]
else:
_d[a[0]].append([a[1:], *b])
r = {a:{'books':[{'name':c[-2], 'citations':c[2], 'url':c[1], 'author':c[3]} for c in b if not c[0]], **(lambda x:{} if not x else group(x))([c for c in b if c[0]])} for a, b in _d.items()}
return {a:{c:d for c, d in b.items() if d} for a, b in r.items()}
import json
print(json.dumps(group(new_data), indent=4))
Output:
{
"magic": {
"fantasy": {
"english literature": {
"books": [
{
"name": "goblet of fire",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "jk rowling, etc...."
}
]
},
"medieval": {
"books": [
{
"name": "hobbit",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "Tolkien"
}
]
}
}
},
"LA": {
"assessment": {
"other": {
"books": [
{
"name": "cracking the coding interview",
"citations": "2",
"url": "http://doi.acm.org/10.1145/800010.808105",
"author": "Weiner, Leonard H."
}
]
}
}
}
}
Edit: grouping by publication date:
import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
_data = {}
for i in new_data:
if i[-1] not in _data:
_data[i[-1]] = [i]
else:
_data[i[-1]].append(i)
final_result = {a:group(b) for a, b in _data.items()}
Output:
{
"1973": {
"magic": {
"fantasy": {
"english literature": {
"books": [
{
"name": "goblet of fire",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "jk rowling, etc...."
}
]
}
}
},
"LA": {
"assessment": {
"other": {
"books": [
{
"name": "cracking the coding interview",
"citations": "2",
"url": "http://doi.acm.org/10.1145/800010.808105",
"author": "Weiner, Leonard H."
}
]
}
}
}
},
"1953": {
"magic": {
"fantasy": {
"medieval": {
"books": [
{
"name": "hobbit",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "Tolkien"
}
]
}
}
}
}
}
Separate categories by their nest
Parse CSV to pandas dataframe
Groupby by category in a loop
use to_dict() to convert to dict in a groupby loop
You can do something like the following:
import pandas as pd
df = pd.read_csv('yourcsv.csv', sep=',')
Next, you want to isolate the Category column and split its content with columns:
cols_no_categ = list(df.columns)
cols_no_categ.remove('Category')
category = df['Category']
DICT = {}
for c in category:
dicto = df[df.Category == c, cols_no_categ].to_dict()
s = c.split(': ')
DICT[s[0]][s[1]][s[2]] = dicto

Extracting data from JSON depending on other parameters

What are the options for extracting value from JSON depending on other parameters (using python)? For example, JSON:
"list": [
{
"name": "value",
"id": "123456789"
},
{
"name": "needed-value",
"id": "987654321"
}
]
When using json_name["list"][0]["id"] it obviously returns 123456789. Is there a way to indicate "name" value "needed-value" so i could get 987654321 in return?
For example:
import json as j
s = '''
{
"list": [
{
"name": "value",
"id": "123456789"
},
{
"name": "needed-value",
"id": "987654321"
}
]
}
'''
js = j.loads(s)
print [x["id"] for x in js["list"] if x["name"] == "needed-value"]
The best way to handle this is to refactor the json as a single dictionary. Since "name" and "id" are redundant you can make the dictionary with the value from "name" as the key and the value from "id" as the value.
import json
j = '''{
"list":[
{
"name": "value",
"id": "123456789"
},{
"name": "needed-value",
"id": "987654321"
}
]
}'''
jlist = json.loads(j)['list']
d = {jd['name']: jd['id'] for jd in jlist}
print(d) ##{'value': '123456789', 'needed-value': '987654321'}
Now you can iterate the items like you normally would from a dictionary.
for k, v in d.items():
print(k, v)
# value 123456789
# needed-value 987654321
And since the names are now hashed, you can check membership more efficiently than continually querying the list.
assert 'needed-value' in d
jsn = {
"list": [
{
"name": "value",
"id": "123456789"
},
{
"name": "needed-value",
"id": "987654321"
}
]
}
def get_id(list, name):
for el in list:
if el['name'] == name:
yield el['id']
print(list(get_id(jsn['list'], 'needed-value')))
Python innately treats JSON as a list of dictionaries. With this in mind, you can call the index of the list you need to be returned since you know it's location in the list (and child dictionary).
In your case, I would use list[1]["id"]
If, however, you don't know where the position of your needed value is within the list, the you can run an old fashioned for loop this way:
for user in list:
if user["name"] == "needed_value":
return user["id"]
This is assuming you only have one unique needed_value in your list.

Merge two json files containing dict & list into single json using python?

I'm trying to merge two JSON files into a single JSON using python.
File1:
{
"key1": "protocol1",
"key2": [
{
"name": "user.name",
"value": "user#EXAMPLE123.COM"
},
{
"name": "user.shortname",
"value": "user"
},
{
"name": "proxyuser.hosts",
"value": "*"
},
{
"name": "kb.groups",
"value": "hadoop,users,localusers"
},
{
"name": "proxy.groups",
"value": "group1, group2, group3"
},
{
"name": "internal.user.groups",
"value": "group1, group2"
}
]
}
File2:
{
"key1": "protocol1",
"key2": [
{
"name": "user.name",
"value": "user#EXAMPLE456.COM"
},
{
"name": "user.shortname",
"value": "user"
},
{
"name": "proxyuser.hosts",
"value": "*"
},
{
"name": "kb.groups",
"value": ""
},
{
"name": "proxy.groups",
"value": "group3, group4, group5"
},
{
"name": "internal.groups",
"value": "none"
}
]
}
Final expected result:
{
"key1": "protocol1",
"key2": [
{
"name": "user.name",
"value": "user#EXAMPLE123.COM, user#EXAMPLE456.COM"
},
{
"name": "user.shortname",
"value": "user"
},
{
"name": "proxyuser.hosts",
"value": "*"
},
{
"name": "kb.groups",
"value": "hadoop,users,localusers"
},
{
"name": "proxy.groups",
"value": "group1, group2, group3, group4, group5"
},
{
"name": "internal.user.groups",
"value": "group1, group2"
},
{
"name": "internal.groups",
"value": "none"
}
]
}
I need to merge based on below rules:
If the 'name' key within the list(key2) match in both the files then concatenate the values.
e.g.
File1:
"key2": [{"name" : "firstname", "value" : "bob"}]
File2:
"key2": [{"name" : "firstname", "value" : "charlie"}]
Final output:
"key2": [{"name" : "firstname", "value" : "bob, charlie"}]
Some considerations while appending the values:
If both files contain duplicate value(s) in 'value', final result should only be the union of the values.
If any of 'value' contains ' * ', then final value should be ' * '.
If 'name' key in 2nd JSON file is not present in 1st file, add it to first file.
I've written a python script to load the two JSON files and merge them but it seems to just concatenate everything into the first JSON file.
def merge(a, b):
"merges b into a"
for key in b:
if key in a:# if key is in both a and b
if key == "key1":
pass
elif key == "key2":
for d1, d2 in zip(a[key], b[key]):
for key, value in d1.items():
if value != d2[key]:
a.append({"name": d2[key], "value": d2["value"]})
else:
a[key] = a[key]+ b[key]
else: # if the key is not in dict a , add it to dict a
a.update({key:b[key]})
return a
Can someone point out how I can compare the value for the "name" section with the list for key2 in both the files and concatenate the values in "value"?
Here's a solution that runs in linear time using a dictionary to quickly look up an item in a given a name key. Dictionary b's key2 list is iterated through once and a modified in constant time as required. Sets are used to eliminate duplicates and handle asterisks.
def merge(a, b):
lookup = {o['name']: o for o in a['key2']}
for e in a['key2']:
e['value'] = set([x.strip() for x in e['value'].split(",")])
for e in b['key2']:
if e['name'] in lookup:
lookup[e['name']]['value'].update([x.strip() for x in e['value'].split(",")])
else:
e['value'] = set([x.strip() for x in e['value'].split(",")])
a['key2'].append(e)
for e in a['key2']:
if "*" in e['value']:
e['value'] = "*"
else:
e['value'] = ", ".join(sorted(list(e['value'])))
Sample output:
key1:
protocol1
key2:
{'name': 'user.name', 'value': 'user#EXAMPLE123.COM, user#EXAMPLE456.COM'}
{'name': 'user.shortname', 'value': 'user'}
{'name': 'proxyuser.hosts', 'value': '*'}
{'name': 'kb.groups', 'value': ', hadoop, localusers, users'}
{'name': 'proxy.groups', 'value': 'group1, group2, group3, group4, group5'}
{'name': 'internal.user.groups', 'value': 'group1, group2'}
{'name': 'internal.groups', 'value': 'none'}
Order of elements in a["key2"] and b["key2"] is not guaranteed to be the same, so you should build a mapping from the "name" value to the index in a["key2"], and then browse b["key2"] comparing each "name" value to that dict.
Code could be:
def merge(a, b):
"merges b into a"
for key in b:
if key in a:# if key is in both a and b
if key == "key2":
# build a mapping from names from a[key2] to the member index
akey2 = { d["name"]: i for i,d in enumerate(a[key]) }
for d2 in b[key]: # browse b["key2"]
if d2["name"] in akey2: # a name from a["key2"] matches
a[key][akey2[d2["name"]]]["value"] += ", " + d2["value"]
else:
a[key].append(d2) # when no match
else: # if the key is not in dict a , add it to dict a
a[key] = b[key]
return a
You can then test it:
a = {"key1": "value1",
"key2": [{"name" : "firstname", "value" : "bob"}]
}
b = {"key1": "value2",
"key2": [{"name" : "firstname", "value" : "charlie"},
{"name" : "foo", "value": "bar"}]
}
merge(a, b)
pprint.pprint(a)
gives as expected:
{'key1': 'value1',
'key2': [{'name': 'firstname', 'value': 'bob, charlie'},
{'name': 'foo', 'value': 'bar'}]}
Just loop through the keys if its not in the new dict add it if it is merge the two values
d1 = {"name" : "firstname", "value" : "bob"}
d2 = {"name" : "firstname", "value" : "charlie"}
d3 = {}
for i in d1:
for j in d2:
if i not in d3:
d3[i] = d1[i]
else:
d3[i] = '{}, {}'.format(d1[i], d2[i])
print(d3)
(xenial)vash#localhost:~/python/stack_overflow$ python3.7 formats.py
{'name': 'firstname, firstname', 'value': 'bob, charlie'}

Categories