I'm trying to convert an array with a dictionary to a flattened dictionary and export it to a JSON file. I have an initial tab-delimited file, and have tried multiple ways but not coming to the final result. If there is more than one row present then save these as arrays in the dictionary
Name file code file_location
TESTLIB1 443 123 location1
TESTLIB2 444 124 location2
Current Output:
{'library': 'TESTLIB2', 'file': '444', 'code': '124', 'file_location': 'location2'}
Desired Output if num_lines > 1:
{'library': ['TEST1', 'TEST2'], 'file': ['443', '444'], 'code': ['123', 123], 'file_location': ['location1', 'location2]}
Code Snippet
data_dict = {}
with open('file.tmp') as input:
reader = csv.DictReader(input, delimiter='\t')
num_lines = sum(1 for line in open('write_object.tmp'))
for row in reader:
data_dict.update(row)
if num_lines > 1:
data_dict.update(row)
with open('output.json', 'w') as output:
output.write(json.dumps(data_dict))
print(data_dict)
create list for each column and iterate to append row by row
import csv
import json
# read file
d = {}
with open('write_object.tmp') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
for head in headers:
d[head] = []
for row in reader:
for i, head in enumerate(headers):
d[head].append(row[i])
# save as json file
with open('output.json', 'w') as f:
json.dump(d, f)
output:
{'Name': ['TESTLIB1', 'TESTLIB2'],
'file': ['443', '444'],
'code': ['123', '124'],
'file_location': ['location1', 'location2']}
from collections import defaultdict
data_dict = defaultdict(list)
with open('input-file') as inp:
for row in csv.DictReader(inp, delimiter='\t'):
for key, val in row.items():
data_dict[key].append(val)
print(data_dict)
# output
{'Name': ['TESTLIB1', 'TESTLIB2'],
'file': ['443', '444'],
'code': ['123', '124'],
'file_location': ['location1', 'location2']}
I'm working with a JSON array payload where I want to extract it into a separate object for processing downstream.
The payload is dynamic and can have multiple nested levels in the JSON array, but the first level will always have an id field which is the unique identifier.
[{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':1}
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':2}
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':3},
{'field1': [],
'field2': {'field2_1': string,
'field2_2': 'string',
'field2_3': 'string'},
'field3': '<html> strings <html>'},
'id':4}]
The payload is not limited to this structure in terms of more fields or more nested fields with a different type of data. But the id field will always be attached to each of the objects in the payload. I want to create a dictionary(open to other suggestions for the data type) with the id field and everything else in that object as a cleaned-up string, without any of the brackets or HTML tags, etc.
The output should be(depending on data type) something like this:
{1: string string string strings,
2: string string string strings,
3: string string string strings,
4: string string string strings}
This is a very generic example. I'm having trouble navigating the JSON array with all the nesting and content and would just like to extract the id and the rest of the content in a clean manner. Any help is appreciated!
You can use beautifulsoup to clean the string form all tags. For example:
from bs4 import BeautifulSoup
lst = [{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':1},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':2},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':3},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':4}]
def flatten(d):
if isinstance(d, dict):
for v in d.values():
yield from flatten(v)
elif isinstance(d, list):
for v in d:
yield from flatten(v)
elif isinstance(d, str):
yield d
out = {}
for d in lst:
out[d['id']] = ' '.join(map(str.strip, BeautifulSoup(' '.join(flatten(d)), 'html.parser').find_all(text=True)))
print(out)
Prints:
{1: 'string1 string2 string3 strings4', 2: 'string1 string2 string3 strings4', 3: 'string1 string2 string3 strings4', 4: 'string1 string2 string3 strings4'}
I have a CSV files which has a header like this:
cpus/0/compatible clocks/HSE/compatible ../frequency memories/flash/compatible ../address ../size [and so on...]
I'm able to parse that header into a nested dictionaries which may look like this:
{'clocks': {'HSE': {'compatible': '[1]',
'frequency': '[2]'}},
'cpus': {'0': {'compatible': '[0]'}},
'memories': {'bkpsram': {'address': '[13]',
'compatible': '[12]',
'size': '[14]'},
'ccm': {'address': '[7]',
'compatible': '[6]',
'size': '[8]'},
'flash': {'address': '[4]',
'compatible': '[3]',
'size': '[5]'},
'sram': {'address': '[10]',
'compatible': '[9]',
'size': '[11]'}},
'pin-controller': {'GPIOA': {'enabled': '[16]'},
'GPIOB': {'enabled': '[17]'},
'GPIOC': {'enabled': '[18]'},
'GPIOD': {'enabled': '[19]'},
'GPIOE': {'enabled': '[20]'},
'GPIOF': {'enabled': '[21]'},
'GPIOG': {'enabled': '[22]'},
'GPIOH': {'enabled': '[23]'},
'GPIOI': {'enabled': '[24]'},
'GPIOJ': {'enabled': '[25]'},
'GPIOK': {'enabled': '[26]'},
'compatible': '[15]'}}
(it is a dict object, printed with pprint())
The values of keys which look like '[<number>]' reflect the index of column in the CSV file from which the data should be loaded.
As I mainly use C/C++ I would actually love to have pointers/references in Python, as then I would just put a pointer to a list element in each value and for each row I could modify list contents, but I think there's no way to obtain such behaviour easily in Python.
So now I plan to dump this dictionary into a string and perform following 3 modifications in a row:
replace { with {{,
replace } with }},
replace '[<number>]' with {<number>}.
After that I will be able to "load" the data with something like this ast.literal_eval(dictAsStr.format(*rowFromCsv)), but it seems like a waste of time to convert the whole dict to a string and then back to a dict...
Am I missing some other obvious solution here? The format of the CSV and the way I load the header is not fixed, I may alter that easily, but I would really like a solution which would not boil down to "visit each key recursively and load appropriate value from current row manually".
From the CSV file I load each row as a list of strings, for example:
['["ARM,Cortex-M4", "ARM,ARMv7-M"]',
'["ST,STM32-HSE", "fixed-clock"]',
'0',
'["on-chip-flash"]',
'0x8000000',
'131072',
'',
'',
'',
'["on-chip-ram"]',
'0x20000000',
'65536',
'',
'',
'',
'["ST,STM32-GPIOv2-pin-controller"]',
'False',
'False',
'False',
'',
'',
'',
'',
'False',
'',
'',
'']
Now I would like to insert the values from each loaded row (list of strings) into appropriate keys in the nested dictionary, so following with the examples above I would like to get:
{'clocks': {'HSE': {'compatible': '["ST,STM32-HSE", "fixed-clock"]',
'frequency': '0'}},
'cpus': {'0': {'compatible': '["ARM,Cortex-M4", "ARM,ARMv7-M"]'}},
'memories': {'bkpsram': {'address': '',
'compatible': '',
'size': ''},
'ccm': {'address': '',
'compatible': '',
'size': ''},
'flash': {'address': '0x8000000',
'compatible': '["on-chip-flash"]',
'size': '131072'},
'sram': {'address': '0x20000000',
'compatible': '["on-chip-ram"]',
'size': '65536'}},
'pin-controller': {'GPIOA': {'enabled': 'False'},
'GPIOB': {'enabled': 'False'},
'GPIOC': {'enabled': 'False'},
'GPIOD': {'enabled': ''},
'GPIOE': {'enabled': ''},
'GPIOF': {'enabled': ''},
'GPIOG': {'enabled': ''},
'GPIOH': {'enabled': 'False'},
'GPIOI': {'enabled': ''},
'GPIOJ': {'enabled': ''},
'GPIOK': {'enabled': ''},
'compatible': '["ST,STM32-GPIOv2-pin-controller"]'}}
For completeness, here are a few first lines from the CSV file I would like to load. The first column is not part of the dictionary presented above, as it is used for indexing.
chip,cpus/0/compatible,clocks/HSE/compatible,../frequency,memories/flash/compatible,../address,../size,memories/ccm/compatible,../address,../size,memories/sram/compatible,../address,../size,memories/bkpsram/compatible,../address,../size,pin-controller/compatible,pin-controller/GPIOA/enabled,pin-controller/GPIOB/enabled,pin-controller/GPIOC/enabled,pin-controller/GPIOD/enabled,pin-controller/GPIOE/enabled,pin-controller/GPIOF/enabled,pin-controller/GPIOG/enabled,pin-controller/GPIOH/enabled,pin-controller/GPIOI/enabled,pin-controller/GPIOJ/enabled,pin-controller/GPIOK/enabled
STM32F401CB,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,131072,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CC,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,262144,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CD,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,393216,,,,"[""on-chip-ram""]",0x20000000,98304,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
The code used to parse the header:
import csv
with open("some-path-to-CSV-file") as csvFile:
csvReader = csv.reader(csvFile)
header = next(csvReader)
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = '[{}]'.format(index)
What about just using the actual row index (as integer) as value in the "parsed" header, ie:
{'clocks': {'HSE': {'compatible': 1,
'frequency': 2}},
# etc
Then using recursion on a parsed header copy to populate it from the row values ?:
import csv
import sys
import copy
import pprint
def parse_header(header):
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = index
return dictionary
def _rparse(d, k, v, row):
if isinstance(v, dict):
for subk, subv in v.items():
_rparse(v, subk, subv, row)
elif isinstance(v, int):
d[k] = row[v]
else:
raise ValueError("'v' should be either a dict or an int (got : %s(%s))" % (type(v), v))
def parse_row(header, row):
struct = copy.deepcopy(header)
for k, v in struct.items():
_rparse(struct, k, v, row)
return struct
def main(*args):
path = args[0]
with open(path) as f:
reader = csv.reader(f)
header = parse_header(next(reader))
results = [parse_row(header, row[1:]) for row in reader]
pprint.pprint(results)
if __name__ == "__main__":
main(*sys.argv[1:])
Another solution (that might actually be faster) would be to build a reverse mapping with row indices as keys and dict "path" as values ie:
{0: ("cpus", "0", "compatible"),
1: ("clocks", "HSE", "compatible"),
2: ("clocks", "HSE", "frequency"),
# etc
}
and then:
def parse_row(template, map, row):
# 'template' is your parsed header dict
struct = copy.deepcopy(template)
target = struct
for index, path in map.items():
for key in path[:-1]:
target = target[key]
target[key[-1] = row[index]
Oh and yes, as an added bonus, you may want to use ast.literal_eval() to turn your values into proper python types:
>>> import ast
>>> ast.literal_eval("False")
False
>>> ast.literal_eval('["on-chip-flash"]')
['on-chip-flash']
>>> ast.literal_eval('0x8000000')
134217728
>>> ast.literal_eval('["ARM,Cortex-M4", "ARM,ARMv7-M"]')
['ARM,Cortex-M4', 'ARM,ARMv7-M']
>>> ast.literal_eval("this should fail")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
this should fail
^
SyntaxError: invalid syntax
>>> def to_python(value):
... try:
... return ast.literal_eval(value)
... except Exception as e:
... return value
...
>>> to_python('["on-chip-flash"]')
['on-chip-flash']
>>> to_python('wtf')
'wtf'
>>>
Is there a more elegant way in python to create a dictionary from a list and a cs line besides a loop?
my_master_list = ["ABC", "DEF", "GHI"]
my_list = ["field1", "field2", "field3"]
my_line = "test1,test2,test3"
my_dict = {}
for x in my_master_list:
my_dict[x] = {}
line_parts = my_line.split(",")
n = 0
for y in my_list:
my_dict[x][y] = line_parts[n]
n +=1
print my_dict
# {'ABC': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}, 'GHI': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}, 'DEF': {'field2': 'test2', 'field3': 'test3', 'field1': 'test1'}}
You can use zip with a dictionary comprehension:
# construct the inner dictionary
d = dict(zip(my_list, my_line.split(",")))
# construct the outer dictionary, if you don't want to make copies, you can use
# {master_key: d ... } directly here just keep in mind they are referring to the same
# object in this way
{master_key: d.copy() for master_key in my_master_list}
#{'ABC': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'},
# 'DEF': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'},
# 'GHI': {'field1': 'test1', 'field2': 'test2', 'field3': 'test3'}}
d = {x:dict(zip(my_list, my_line.split(','))) for x in my_master_list}
^ ^ ^
| | [1]--- creates a list from the string
| |
| [2]--- creates a tuple from two lists
|
[3]--- creates a dictionary from the tuples (key, value)
^
|
[4] The overall expression is a dictionary comprehension.
Read about dict comprehensions in PEP274.
This question already has answers here:
How do I read and write CSV files with Python?
(7 answers)
Closed 3 months ago.
"Type","Name","Description","Designation","First-term assessment","Second-term assessment","Total"
"Subject","Nick","D1234","F4321",10,19,29
"Unit","HTML","D1234-1","F4321",18,,
"Topic","Tags","First Term","F4321",18,,
"Subtopic","Review of representation of HTML",,,,,
All the above are the value from an excel sheet , which is converted to csv and that is the one shown above
The header as you notice contains seven coulmns,the data below them vary,
I have this script to generate these from python script,the script is below
from django.db import transaction
import sys
import csv
import StringIO
file = sys.argv[1]
no_cols_flag=0
flag=0
header_arr=[]
print file
f = open(file, 'r')
while (f.readline() != ""):
for i in [line.split(',') for line in open(file)]: # split on the separator
print "==========================================================="
row_flag=0
row_d=""
for j in i: # for each token in the split string
row_flag=1
print j
if j:
no_cols_flag=no_cols_flag+1
data=j.strip()
print j
break
How to modify the above script to say that this data belongs to a particular column header..
thanks..
You're importing the csv module but never use it. Why?
If you do
import csv
reader = csv.reader(open(file, "rb"), dialect="excel") # Python 2.x
# Python 3: reader = csv.reader(open(file, newline=""), dialect="excel")
you get a reader object that will contain all you need; the first row will contain the headers, and the subsequent rows will contain the data in the corresponding places.
Even better might be (if I understand you correctly):
import csv
reader = csv.DictReader(open(file, "rb"), dialect="excel") # Python 2.x
# Python 3: reader = csv.DictReader(open(file, newline=""), dialect="excel")
This DictReader can be iterated over, returning a sequence of dicts that use the column header as keys and the following data as values, so
for row in reader:
print(row)
will output
{'Name': 'Nick', 'Designation': 'F4321', 'Type': 'Subject', 'Total': '29', 'First-term assessment': '10', 'Second-term assessment': '19', 'Description': 'D1234'}
{'Name': 'HTML', 'Designation': 'F4321', 'Type': 'Unit', 'Total': '', 'First-term assessment': '18', 'Second-term assessment': '', 'Description': 'D1234-1'}
{'Name': 'Tags', 'Designation': 'F4321', 'Type': 'Topic', 'Total': '', 'First-term assessment': '18', 'Second-term assessment': '', 'Description': 'First Term'}
{'Name': 'Review of representation of HTML', 'Designation': '', 'Type': 'Subtopic', 'Total': '', 'First-term assessment': '', 'Second-term assessment': '', 'Description': ''}