Attempting to output the following json as a csv - python

Json object (output):
[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139],
[152532], [397538], [1420]]
<<< CODE REmoved >>>
Desired output:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

From your data it looks like non-bracketed items should be considered as values of the first column (i.e. a key) and bracketed ones should be considered as values for the second column, using the key that precedes them. You can do this purely in a procedural fashion:
import csv
import json
src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
with open('output.csv', 'w', newline='') as f: # Python 2.x: open('output.csv', 'wb')
writer = csv.writer(f) # create a simple CSV writer
current_key = None # a container for the last seen / cached 'key'
for element in json.loads(src): # parse the structure and iterate over it
if isinstance(element, list): # if the element is a 'list'
writer.writerow((current_key, element[0])) # write to csv w/ cached key
else:
current_key = element # cache the element as the key for following entries
Which should produce an output.csv containing:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

itertools.groupby is a little challenging for Python beginners, but is very handy for walking through a collection of items and working with them in groups. In this case, we group by items that are or are not Python lists.
From each group of nested ints, we'll create one or more entries in our accumulator list.
Once the accumulator list is loaded, the code below just prints out the results, easily converted to writing to a file.
import ast
from itertools import groupby
from collections import namedtuple
# this may be JSON, but it's also an ordinary Python nested list of ints, so safely parseable using
# ast.literal_eval()
text = "[424783, [198184], [605], [644], [296], [2048], 424694, [369192], [10139], [152532], [397538], [1420]]"
items = ast.literal_eval(text)
# a namedtuple to hold each record, and a list to accumulate them
DataRow = namedtuple("DataRow", "old_id new_id")
accumulator = []
# use groupby to process the entries in groups, depending on whether the items are lists or not
key = None
for is_data, values in groupby(items, key=lambda x: isinstance(x, list)):
if not is_data:
# the sole value the next record key
key = list(values)[0]
else:
# the values are the collection of lists until the next key
accumulator.extend(DataRow(key, v[0]) for v in values)
# dump out as csv
for item in accumulator:
print("{old_id},{new_id}".format_map(item._asdict()))
Prints:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

I think using itertools.groupby() would be a good approach since grouping items is main thing that needs to be done to accomplish what you want.
Here's a fairly simple way of using it:
import csv
from itertools import groupby
import json
json_src = '''[424783, [198184], [605], [644], [296], [2048],
424694, [369192], [10139], [152532], [397538], [1420]]'''
def xyz():
return json.loads(json_src)
def abc():
json_processed = xyz()
output_filename = 'y.csv'
with open(output_filename, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for is_list, items in groupby(json_processed, key=lambda v: isinstance(v, list)):
if is_list:
new_ids = [item[0] for item in items]
else:
old_id = next(items)
continue
for new_id in new_ids:
writer.writerow([old_id, new_id])
abc()
Contents of csv file produced:
424783,198184
424783,605
424783,644
424783,296
424783,2048
424694,369192
424694,10139
424694,152532
424694,397538
424694,1420

Related

iterate over large JSON file using python and create a reduced JSON output only with specific values

How to iterate over JSON using python with children nodes?
Here is a way that you could read in the data, clean it (remove the vulnerabilities with false), sort it, and then download it as a new file.
import json
def base_score(item): # sorting function used in .sort()
# https://stackoverflow.com/questions/3121979/how-to-sort-a-list-tuple-of-lists-tuples-by-the-element-at-a-given-index
if 'baseMetricV3' not in item['impact']:
return (0, item['cve']['CVE_data_meta']['ID']) # no values are at a 0, therefore will sort by ID
return (item['impact']['baseMetricV3']['cvssV3']['baseScore'], item['cve']['CVE_data_meta']['ID']) # will also sort by ID if there are scores that are the same
with open('nvdcve-1.1-2022.json', 'r') as file: # read in the file and load it as a json format (similar to python dictionaries)
dict_data = json.load(file)
for CVE_Item in dict_data['CVE_Items']:
for node in CVE_Item['configurations']['nodes']:
# https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating
node['cpe_match'][:] = [item for item in node['cpe_match'] if item['vulnerable']] # removing items while iterating through
if node['children']: # look at the children to see if they have any false vulnerable items and remove
for child_node in node['children']:
child_node['cpe_match'][:] = [item for item in child_node['cpe_match'] if item['vulnerable']] # removing items while iterating through
dict_data['CVE_Items'].sort(reverse=True, key=base_score) # sort the data and have it in descending order.
with open('cleaned_nvdcve-1.1-2022.json','w') as f: # write the file to the current working directory.
f.write(json.dumps(dict_data))
This solution should solve your initial problem (remove configurations with 'vulnerable:false'). I used the sample json data you provided in the question.
import json
with open('data.json','r') as f:
data = json.load(f)
nodes = data.get('CVE_Items')[0].get('configurations').get('nodes')[0].get('cpe_match')
for index,node in enumerate(nodes):
if not node.get('vulnerable'):
nodes.pop(index)
with open('new_data.json','w') as f:
f.write(json.dumps(data))

Is there a way to add a file which has similar ids to a dictionary?

I am retrieving grades from a file and I want to insert them to a dictionary with the id as key, what is the best way to do this? Below is the code.
from HW08_Swayam_Shah import file_reader
import os
from collections import defaultdict
def grades(path):
l= defaultdict()
g = {"A":4.0,"A-":3.75,"B+":3.25,"B":3.0,"B-":2.75,"C+":2.25,"C-":0,"D+":0,"D":0,"D-":0,"F":0}
for id, course, grade, prof_id in file_reader(
os.path.join(path, "g.txt"), fields=4, sep='|', header=False):
for k,v in g.items():
if grade==k:
l[id].append(v)
return l
x = grades("C://Users/Documents/Downloads")
print(x)
Below is the input file I am using:
10103|SSW 567|A|98765
10103|SSW 564|A-|98764
10103|SSW 687|B|98764
As you can see the first field is similar for all columns, but I need this to be my key. Obviously the dictionary will throw a key error, but when the key is same I want it to add it to the next value. Something like:
{10103:{A,A-,B}}
How can I achieve this?
You need to specify the factory for the defaultdict:
l = defaultdict(list)
This will create an empty list if the dictionary item doesn't exist, then append() will work.
You don't need to loop over the dictionary g. Since grade is a key, just use g[grade] to get the value.
def grades(path):
l= defaultdict(list)
g = {"A":4.0,"A-":3.75,"B+":3.25,"B":3.0,"B-":2.75,"C+":2.25,"C-":0,"D+":0,"D":0,"D-":0,"F":0}
for id, course, grade, prof_id in file_reader(
os.path.join(path, "g.txt"), fields=4, sep='|', header=False):
grade_numeric = g[grade]
l[id].append(grade_numeric)
return l

Take 2 key values from list of python dicts & make new list/tuple/array/dictionary with each index containing 2 key values from 1st listed dict

I have a list of dictionaries in a json file.
I have iterated through the list and each dictionary to obtain two specific key:value pairs from each dictionary for each element.
i.e. List[dictionary{i(key_x:value_x, key_y:value_y)}]
My question is now:
How do I place these two new key: value pairs in a new list/dictionary/array/tuple, representing the two key: value pairs extracted for each listed element in the original?
To be clear:
ORIGINAL_LIST (i.e. with each element being a nested dictionary) =
[{"a":{"blah":"blah",
"key_1":value_a1,
"key_2":value_a2,
"key_3":value_a3,
"key_4":value_a4,
"key_5":value_a5,},
"b":"something_a"},
{"a":{"blah":"blah",
"key_1":value_b1,
"key_2":value_b2,
"key_3":value_b3,
"key_4":value_b4,
"key_5":value_b5,},
"b":"something_b"}]
So my code so far is:
import json
from collections import *
from pprint import pprint
json_file = "/some/path/to/json/file"
with open(json_file) as json_data:
data = json.load(json_data)
json_data.close()
for i in data:
event = dict(i)
event_key_b = event.get('b')
event_key_2 = event.get('key_2')
print(event_key_b)#print value of "b" for each nested dict for 'i'
print(event_key_2)#print value of "key_2" for each nested dict for 'i'
To be clear:
FINAL_LIST(i.e. with each element being a nested dictionary) =
[{"b":"something_a", "key_2":value_2},
{"b":"something_b", "key_2":value_2}]
So I have an answer to getting the keys into individual dictionaries, as follows in the code below. The only problem is that the value for 'key_2' in the original json dictionaries is either an int value or it is "" for values which are 0. My script just returns 'None' for all instances of value_2 for key_2. How can I get it to read the appropriate values for 'value_2'? I want to only return dictionaries for cases where 'value_2' > 0 (i.e. where value_2 != "")
Below is the current code:
import json
from pprint import pprint
json_file = "/some/path/to/json/file"
with open(json_file) as json_data:
data = json.load(json_data)
json_data.close()
for i in data:
event_key_b = event.get('b')
for x in i:
event_key_2 = event.get('key_2')
x = {'b' : something_b, 'key_2' : value_2}
print(x)
Also, if there are any more elegant solutions anyone can think of I would really be interested in learning them ... Some of the json files I'm looking at can range from 200 dictionary entries in the original list to 2,000,000. I'm planning to feed my parsed results into a message queue for processing by a different service and any efficiencies in the code will help for scalability in processing. Also if anyone has any recommendations to give on Redis vs. RabbitMQ, I'd really appreciate it

How do I iterate through nested dictionaries in a list of dictionaries?

Still new to Python and need a little help here. I've found some answers for iterating through a list of dictionaries but not for nested dictionaries in a list of dictionaries.
Here is the a rough structure of a single dictionary within the dictionary list
[{ 'a':'1',
'b':'2',
'c':'3',
'd':{ 'ab':'12',
'cd':'34',
'ef':'56'},
'e':'4',
'f':'etc...'
}]
dict_list = [{ 'a':'1', 'b':'2', 'c':'3', 'd':{ 'ab':'12','cd':'34', 'ef':'56'}, 'e':'4', 'f':'etc...'}, { 'a':'2', 'b':'3', 'c':'4', 'd':{ 'ab':'23','cd':'45', 'ef':'67'}, 'e':'5', 'f':'etcx2...'},{},........,{}]
That's more or less what I am looking at although there are some keys with lists as values instead of a dictionary but I don't think I need to worry about them right now although code that would catch those would be great.
Here is what I have so far which does a great job of iterating through the json and returning all the values for each 'high level' key.
import ujson as json
with open('test.json', 'r') as f:
json_text = f.read()
dict_list = json.loads(json_text)
for dic in dict_list:
for val in dic.values():
print(val)
Here is the first set of values that are returned when that loop runs
1
2
3
{'ab':'12','cd':'34','ef':'56'}
4
etc...
What I need to be able to do pick specific values from the top level and go one level deeper and grab specific values in that nested dictionary and append them to a list(s). I'm sure I am missing a simple solution. Maybe I'm looking at multiple loops?
Following the ducktype style encouraged with Python, just guess everything has a .values member, and catch it if they do not:
import ujson as json
with open('test.json', 'r') as f:
json_text = f.read()
dict_list = json.loads(json_text)
for dic in dict_list:
for val in dic.values():
try:
for l2_val in val.values():
print(l2_val)
except AttributeError:
print(val)
Bazingaa's solution would be faster if inner dictionaries are expected to be rare.
Of course, any more "deep" and you would need some recursion probably:
def print_dict(d):
for val in d.values():
try:
print_dict(val)
except AttributeError:
print(val)
How about checking for the instance type using isinstance (of course only works for one level deeper). Might not be the best way though
for dic in dict_list:
for val in dic.values():
if not isinstance(val, dict):
print(val)
else:
for val2 in val.values():
print (val2)
# 1
# 2
# 3
# 12
# 34
# 56
# 4
# etc...
# 2
# 3

How to duplicate a python DictReader object?

I'm currently trying to modify a DictReader object to strip all the spaces for every cell in the csv. I have this function:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
value.strip()
return csv_reader
However, the issue with this function is that the reader that is returned has already been iterated through, so I can't re-iterate through it (as I would be able to if I just called csv.DictReader(input_file). I want to be able to create a new object that is exactly like the DictReader (i.e., has the fieldnames attribute too), but with all fields stripped of white space. Any tips on how I can accomplish this?
Two things: firstly, the reader is a lazy iterator object which is exhausted after one full run (meaning it will be empty once you return it at the end of your function!), so you have to either collect the modified rows in a list and return that list in the end or make the function a generator producing the modified rows. Secondly, str.strip() does not modify strings in-place (strings are immutable), but returns a new stripped string, so you have to rebind that new value to the old key:
def read_the_csv(input_file):
csv_reader = csv.DictReader(input_file)
for row in csv_reader:
for key, value in row.items():
row[key] = value.strip() # reassign
yield row
Now you can use that generator function like you did the DictReader:
reader = read_the_csv(input_file)
for row in reader:
# process data which is already stripped
I prefer using inheritance, make a subclass of DictReader as follows:
from csv import DictReader
from collections import OrderedDict
class MyDictReader(DictReader):
def __next__(self):
return OrderedDict({k: v.strip()
for k, v in super().__next__().items()})
Usage, just as DictReader:
with open('../data/risk_level_model_5.csv') as input_file:
for row in MyDictReader(input_file):
print(row)

Categories