I have a CSV files which has a header like this:
cpus/0/compatible clocks/HSE/compatible ../frequency memories/flash/compatible ../address ../size [and so on...]
I'm able to parse that header into a nested dictionaries which may look like this:
{'clocks': {'HSE': {'compatible': '[1]',
'frequency': '[2]'}},
'cpus': {'0': {'compatible': '[0]'}},
'memories': {'bkpsram': {'address': '[13]',
'compatible': '[12]',
'size': '[14]'},
'ccm': {'address': '[7]',
'compatible': '[6]',
'size': '[8]'},
'flash': {'address': '[4]',
'compatible': '[3]',
'size': '[5]'},
'sram': {'address': '[10]',
'compatible': '[9]',
'size': '[11]'}},
'pin-controller': {'GPIOA': {'enabled': '[16]'},
'GPIOB': {'enabled': '[17]'},
'GPIOC': {'enabled': '[18]'},
'GPIOD': {'enabled': '[19]'},
'GPIOE': {'enabled': '[20]'},
'GPIOF': {'enabled': '[21]'},
'GPIOG': {'enabled': '[22]'},
'GPIOH': {'enabled': '[23]'},
'GPIOI': {'enabled': '[24]'},
'GPIOJ': {'enabled': '[25]'},
'GPIOK': {'enabled': '[26]'},
'compatible': '[15]'}}
(it is a dict object, printed with pprint())
The values of keys which look like '[<number>]' reflect the index of column in the CSV file from which the data should be loaded.
As I mainly use C/C++ I would actually love to have pointers/references in Python, as then I would just put a pointer to a list element in each value and for each row I could modify list contents, but I think there's no way to obtain such behaviour easily in Python.
So now I plan to dump this dictionary into a string and perform following 3 modifications in a row:
replace { with {{,
replace } with }},
replace '[<number>]' with {<number>}.
After that I will be able to "load" the data with something like this ast.literal_eval(dictAsStr.format(*rowFromCsv)), but it seems like a waste of time to convert the whole dict to a string and then back to a dict...
Am I missing some other obvious solution here? The format of the CSV and the way I load the header is not fixed, I may alter that easily, but I would really like a solution which would not boil down to "visit each key recursively and load appropriate value from current row manually".
From the CSV file I load each row as a list of strings, for example:
['["ARM,Cortex-M4", "ARM,ARMv7-M"]',
'["ST,STM32-HSE", "fixed-clock"]',
'0',
'["on-chip-flash"]',
'0x8000000',
'131072',
'',
'',
'',
'["on-chip-ram"]',
'0x20000000',
'65536',
'',
'',
'',
'["ST,STM32-GPIOv2-pin-controller"]',
'False',
'False',
'False',
'',
'',
'',
'',
'False',
'',
'',
'']
Now I would like to insert the values from each loaded row (list of strings) into appropriate keys in the nested dictionary, so following with the examples above I would like to get:
{'clocks': {'HSE': {'compatible': '["ST,STM32-HSE", "fixed-clock"]',
'frequency': '0'}},
'cpus': {'0': {'compatible': '["ARM,Cortex-M4", "ARM,ARMv7-M"]'}},
'memories': {'bkpsram': {'address': '',
'compatible': '',
'size': ''},
'ccm': {'address': '',
'compatible': '',
'size': ''},
'flash': {'address': '0x8000000',
'compatible': '["on-chip-flash"]',
'size': '131072'},
'sram': {'address': '0x20000000',
'compatible': '["on-chip-ram"]',
'size': '65536'}},
'pin-controller': {'GPIOA': {'enabled': 'False'},
'GPIOB': {'enabled': 'False'},
'GPIOC': {'enabled': 'False'},
'GPIOD': {'enabled': ''},
'GPIOE': {'enabled': ''},
'GPIOF': {'enabled': ''},
'GPIOG': {'enabled': ''},
'GPIOH': {'enabled': 'False'},
'GPIOI': {'enabled': ''},
'GPIOJ': {'enabled': ''},
'GPIOK': {'enabled': ''},
'compatible': '["ST,STM32-GPIOv2-pin-controller"]'}}
For completeness, here are a few first lines from the CSV file I would like to load. The first column is not part of the dictionary presented above, as it is used for indexing.
chip,cpus/0/compatible,clocks/HSE/compatible,../frequency,memories/flash/compatible,../address,../size,memories/ccm/compatible,../address,../size,memories/sram/compatible,../address,../size,memories/bkpsram/compatible,../address,../size,pin-controller/compatible,pin-controller/GPIOA/enabled,pin-controller/GPIOB/enabled,pin-controller/GPIOC/enabled,pin-controller/GPIOD/enabled,pin-controller/GPIOE/enabled,pin-controller/GPIOF/enabled,pin-controller/GPIOG/enabled,pin-controller/GPIOH/enabled,pin-controller/GPIOI/enabled,pin-controller/GPIOJ/enabled,pin-controller/GPIOK/enabled
STM32F401CB,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,131072,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CC,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,262144,,,,"[""on-chip-ram""]",0x20000000,65536,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
STM32F401CD,"[""ARM,Cortex-M4"", ""ARM,ARMv7-M""]","[""ST,STM32-HSE"", ""fixed-clock""]",0,"[""on-chip-flash""]",0x8000000,393216,,,,"[""on-chip-ram""]",0x20000000,98304,,,,"[""ST,STM32-GPIOv2-pin-controller""]",False,False,False,,,,,False,,,
The code used to parse the header:
import csv
with open("some-path-to-CSV-file") as csvFile:
csvReader = csv.reader(csvFile)
header = next(csvReader)
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = '[{}]'.format(index)
What about just using the actual row index (as integer) as value in the "parsed" header, ie:
{'clocks': {'HSE': {'compatible': 1,
'frequency': 2}},
# etc
Then using recursion on a parsed header copy to populate it from the row values ?:
import csv
import sys
import copy
import pprint
def parse_header(header):
previousKeyElements = header[1].split('/')
dictionary = {}
for index, key in enumerate(header[1:]):
keyElements = key.split('/')
i = 0
while keyElements[i] == '..':
i += 1
keyElements[0:i] = previousKeyElements[0:-i]
previousKeyElements = keyElements
node = dictionary
for keyElement in keyElements[:-1]:
node = node.setdefault(keyElement, {})
node[keyElements[-1]] = index
return dictionary
def _rparse(d, k, v, row):
if isinstance(v, dict):
for subk, subv in v.items():
_rparse(v, subk, subv, row)
elif isinstance(v, int):
d[k] = row[v]
else:
raise ValueError("'v' should be either a dict or an int (got : %s(%s))" % (type(v), v))
def parse_row(header, row):
struct = copy.deepcopy(header)
for k, v in struct.items():
_rparse(struct, k, v, row)
return struct
def main(*args):
path = args[0]
with open(path) as f:
reader = csv.reader(f)
header = parse_header(next(reader))
results = [parse_row(header, row[1:]) for row in reader]
pprint.pprint(results)
if __name__ == "__main__":
main(*sys.argv[1:])
Another solution (that might actually be faster) would be to build a reverse mapping with row indices as keys and dict "path" as values ie:
{0: ("cpus", "0", "compatible"),
1: ("clocks", "HSE", "compatible"),
2: ("clocks", "HSE", "frequency"),
# etc
}
and then:
def parse_row(template, map, row):
# 'template' is your parsed header dict
struct = copy.deepcopy(template)
target = struct
for index, path in map.items():
for key in path[:-1]:
target = target[key]
target[key[-1] = row[index]
Oh and yes, as an added bonus, you may want to use ast.literal_eval() to turn your values into proper python types:
>>> import ast
>>> ast.literal_eval("False")
False
>>> ast.literal_eval('["on-chip-flash"]')
['on-chip-flash']
>>> ast.literal_eval('0x8000000')
134217728
>>> ast.literal_eval('["ARM,Cortex-M4", "ARM,ARMv7-M"]')
['ARM,Cortex-M4', 'ARM,ARMv7-M']
>>> ast.literal_eval("this should fail")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ast.py", line 49, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/usr/lib/python2.7/ast.py", line 37, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
this should fail
^
SyntaxError: invalid syntax
>>> def to_python(value):
... try:
... return ast.literal_eval(value)
... except Exception as e:
... return value
...
>>> to_python('["on-chip-flash"]')
['on-chip-flash']
>>> to_python('wtf')
'wtf'
>>>
Related
I have a problem. I have a dict my_Dict. This is somewhat nested. However, I would like to 'clean up' the dict my_Dict, by this I mean that I would like to separate all nested ones and also generate a unique ID so that I can later find the corresponding object again.
For example, I have detail: {...}, this nested, should later map an independent dict my_Detail_Dict and in addition, detail should receive a unique ID within my_Dict. Unfortunately, my list that I give out is empty. How can I remove my slaughtered keys and give them an ID?
my_Dict = {
'_key': '1',
'group': 'test',
'data': {},
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': {
'selector': {
'number': '12312',
'isTrue': True,
'requirements': [{
'type': 'customer',
'requirement': '1'}]
}
}
}
def nested_dict(my_Dict):
my_new_dict_list = []
for key in my_Dict.keys():
#print(f"Looking for {key}")
if isinstance(my_Dict[key], dict):
print(f"{key} is nested")
# Add id to nested stuff
my_Dict[key]["__id"] = 1
my_nested_Dict = my_Dict[key]
# Delete all nested from the key
del my_Dict[key]
# Add id to key, but not the nested stuff
my_Dict[key] = 1
my_new_dict_list.append(my_Dict[key])
my_new_dict_list.append(my_Dict)
return my_new_dict_list
nested_dict(my_Dict)
[OUT] []
# What I want
[my_Dict, my_Details_Dict, my_Data_Dict]
What I have
{'_key': '1',
'group': 'test',
'data': {},
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': {'selector': {'number': '12312',
'isTrue': True,
'requirements': [{'type': 'customer', 'requirement': '1'}]}}}
What I want
my_Dict = {'_key': '1',
'group': 'test',
'data': 18,
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': 22}
my_Data_Dict = {'__id': 18}
my_Detail_Dict = {'selector': {'number': '12312',
'isTrue': True,
'requirements': [{'type': 'customer', 'requirement': '1'}]}, '__id': 22}
The following code snippet will solve what you are trying to do:
my_Dict = {
'_key': '1',
'group': 'test',
'data': {},
'type': '',
'code': '007',
'conType': '1',
'flag': None,
'createdAt': '2021',
'currency': 'EUR',
'detail': {
'selector': {
'number': '12312',
'isTrue': True,
'requirements': [{
'type': 'customer',
'requirement': '1'}]
}
}
}
def nested_dict(my_Dict):
# Initializing a dictionary that will store all the nested dictionaries
my_new_dict = {}
idx = 0
for key in my_Dict.keys():
# Checking which keys are nested i.e are dictionaries
if isinstance(my_Dict[key], dict):
# Generating ID
idx += 1
# Adding generated ID as another key
my_Dict[key]["__id"] = idx
# Adding nested key with the ID to the new dictionary
my_new_dict[key] = my_Dict[key]
# Replacing nested key value with the generated ID
my_Dict[key] = idx
# Returning new dictionary containing all nested dictionaries with ID
return my_new_dict
result = nested_dict(my_Dict)
print(my_Dict)
# Iterating through dictionary to get all nested dictionaries
for item in result.items():
print(item)
If I understand you correctly, you wish to automatically make each nested dictionary it's own variable, and remove it from the main dictionary.
Finding the nested dictionaries and removing them from the main dictionary is not so difficult. However, automatically assigning them to a variable is not recommended for various reasons. Instead, what I would do is store all these dictionaries in a list, and then assign them manually to a variable.
# Prepare a list to store data in
inidividual_dicts = []
id_index = 1
for key in my_Dict.keys():
# For each key, we get the current value
value = my_Dict[key]
# Determine if the current value is a dictionary. If so, then it's a nested dict
if isinstance(value, dict):
print(key + " is a nested dict")
# Get the nested dictionary, and replace it with the ID
dict_value = my_Dict[key]
my_Dict[key] = id_index
# Add the id to previously nested dictionary
dict_value['__id'] = id_index
id_index = id_index + 1 # increase for next nested dic
inidividual_dicts.append(dict_value) # store it as a new dictionary
# Manually write out variables names, and assign the nested dictionaries to it.
[my_Details_Dict, my_Data_Dict] = inidividual_dicts
I am working on a coding challenge for self-development and I came across a question where I am given an input like this:
add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}
add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}
add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}
add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}
get {"location":{"state":"WA"},"active":true}
get {"id":1}
get {"active":true}
delete {"active":true}
get {}
And what I am doing is adding the entries that start with add to a list called database = []:
json_input = []
database = []
for line in sys.stdin:
json_input.append(line.split("', "))
for i in range(0, len(json_input)):
if json_input[i][0] == 'add':
database.append(json_input[i][1])
What I want to do is to print out every entry that matches what follows get and delete every entry that matches what follows delete. This is where I am stuck. Currently, this is what json_input() looks like. database is empty:
[
['add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}\n'],
['add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}\n'],
['add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}\n'],
['add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}\n'],
['get {"location":{"state":"WA"},"active":true}\n'], ['get {"id":1}\n'],
['get {"active":true}\n'], ['delete {"active":true}\n'],
['get {}']
]
Perhaps an easy-to-read way to handle this would be a simple class that maintains a list of records. You can add methods for the various commands you want to handle. Then it's just a matter of defining the methods and processing the input to pass to the methods. Here's a possible way (without any frills like error checking):
import json
raw_data = '''add {"id":1,"last":"Doe","first":"John","location":{"city":"Oakland","state":"CA","postalCode":"94607"},"active":true}
add {"id":2,"last":"Doe","first":"Jane","location":{"city":"San Francisco","state":"CA","postalCode":"94105"},"active":true}
add {"id":3,"last":"Black","first":"Jim","location":{"city":"Spokane","state":"WA","postalCode":"99207"},"active":true}
add {"id":4,"last":"Frost","first":"Jack","location":{"city":"Seattle","state":"WA","postalCode":"98204"},"active":false}
get {"location":{"state":"WA"},"active":true}
get {"id":1}
get {"active":true}
delete {"active":true}
get {}'''
class Data:
#staticmethod
def matches(obj, query):
if not isinstance(query, dict):
return obj == query
return all(Data.matches(obj.get(key), q) for key, q in query.items())
def __init__(self):
self.data = []
def add(self, record):
self.data.append(record)
def get(self, query):
for item in self.data:
if (Data.matches(item, query)):
print(item)
def delete(self, query):
self.data = [record for record in self.data if not Data.matches(record, query)]
data = Data()
for line in raw_data.split('\n'):
command, line = line.split(None, 1)
command = getattr(data, command)
command(json.loads(line))
This will print the records from WA then the active:True records. Then after deleting the True records it will print everything (the result of the {} query), which is the only one left -- the active:False record:
{'id': 3, 'last': 'Black', 'first': 'Jim', 'location': {'city': 'Spokane', 'state': 'WA', 'postalCode': '99207'}, 'active': True}
{'id': 1, 'last': 'Doe', 'first': 'John', 'location': {'city': 'Oakland', 'state': 'CA', 'postalCode': '94607'}, 'active': True}
{'id': 1, 'last': 'Doe', 'first': 'John', 'location': {'city': 'Oakland', 'state': 'CA', 'postalCode': '94607'}, 'active': True}
{'id': 2, 'last': 'Doe', 'first': 'Jane', 'location': {'city': 'San Francisco', 'state': 'CA', 'postalCode': '94105'}, 'active': True}
{'id': 3, 'last': 'Black', 'first': 'Jim', 'location': {'city': 'Spokane', 'state': 'WA', 'postalCode': '99207'}, 'active': True}
{'id': 4, 'last': 'Frost', 'first': 'Jack', 'location': {'city': 'Seattle', 'state': 'WA', 'postalCode': '98204'}, 'active': False}
If this were a test or a serious coding challenge, you would probably want to look carefully at matches() to make sure it properly handles edge cases (I didn't do that).
I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!
I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
I have a python dictionary and I would like to find and replace part of the characters in the values of the dictionary. I'm using python 2.7.
My dictionary is
data1 = {'customer_order': {'id': '20'},
'patient':
{'birthdate': None,
'medical_proc': None,
'medical_ref': 'HG_CTRL12',
'name': 'Patient_96',
'sex': None},
'physician_name': 'John Doe'
}
I would like to change the underscore to backslash underscore only in the values of the dictionary, in this case only for Patient_96 and HG_CTRL12.
I would like to change it to the following:
data1 = {'customer_order': {'id': '20'},
'patient':
{'birthdate': None,
'medical_proc': None,
'medical_ref': 'HG\_CTRL12',
'name': 'Patient\_96',
'sex': None},
'physician_name': 'John Doe'
}
Thank you for your help
This function recursively replaces the underscore in the values of the dictionary with replace_char:
def replace_underscores(a_dict, replace_char):
for k, v in a_dict.items():
if not isinstance(v, dict):
if v and '_' in v:
a_dict[k] = v.replace('_', replace_char)
else:
replace_underscores(v, replace_char)
More on isinstance() here.
>>> for i in data1:
... if type(data1[i]) is str:
... if data1[i]:
... data1[i] = data1[i].replace('_','\_')
... elif type(data1[i]) is dict:
... for j in data1[i]:
... if data1[i][j]:
... data1[i][j] = data1[i][j].replace('_','\_')
...
>>>
>>>
>>> data1
{'physician_name': 'John Doe', 'customer_order': {'id': '20'}, 'patient': {'medical_ref': 'HG\\_CTRL12', 'medical_proc': None, 'name': 'Patient\\_96', 'birthdate': None, 'sex': None}}
I'm having problems getting my head around this Python data structure:
data = {'nmap': {'command_line': u'ls',
'scaninfo': {u'tcp': {'method': u'connect',
'services': u'80,443'}},
'scanstats': {'downhosts': u'0',
'elapsed': u'1.18',
'timestr': u'Wed Mar 19 21:37:54 2014',
'totalhosts': u'1',
'uphosts': u'1'}},
'scan': {u'url': {'addresses': {u'ipv6': u'2001:470:0:63::2'},
'hostname': u'abc.net',
'status': {'reason': u'syn-ack',
'state': u'up'},
u'tcp': {80: {'conf': u'3',
'cpe': '',
'extrainfo': '',
'name': u'http',
'product': '',
'reason': u'syn-ack',
'state': u'open',
'version': ''},
443: {'conf': u'3',
'cpe': '',
'extrainfo': '',
'name': u'https',
'product': '',
'reason': u'syn-ack',
'script': {
u'ssl-cert': u'place holder'},
'state': u'open',
'version': ''}},
'vendor': {}
}
}
}
Basically I need to iterate over the 'tcp' key values and extract the contents of the 'script' item if it exists.
This is what I've tried:
items = data["scan"]
for item in items['url']['tcp']:
if t["script"] is not None:
print t
However I can't seem to get it to work.
This will find any dictionary items with the key 'script' anywhere in the data structure:
def find_key(data, search_key, out=None):
"""Find all values from a nested dictionary for a given key."""
if out is None:
out = []
if isinstance(data, dict):
if search_key in data:
out.append(data[search_key])
for key in data:
find_key(data[key], search_key, out)
return out
For your data, I get:
>>> find_key(data, 'script')
[{'ssl-cert': 'place holder'}]
To find the ports, too, modify slightly:
tcp_dicts = find_key(data, 'tcp') # find all values for key 'tcp'
ports = [] # list to hold ports
for d in tcp_dicts: # iterate through values for key 'tcp'
if all(isinstance(port, int) for port in d): # ensure all are port numbers
for port in d:
ports.append((port,
d[port].get('script'))) # extract number and script
Now you get something like:
[(80, None), (443, {'ssl-cert': 'place holder'})]
data['scan']['url']['tcp'] is a dictionary, so when you just iterate over it, you will get the keys but not the values. If you want to iterate over the values, you have to do so:
for t in data['scan']['url']['tcp'].values():
if 'script' in t and t['script'] is not None:
print(t)
If you need the key as well, iterate over the items instead:
for k, t in data['scan']['url']['tcp'].items():
if 'script' in t and t['script'] is not None:
print(k, t)
You also need to change your test to check 'script' in t first, otherwise accessing t['script'] will raise a key error.
Don't you mean if item["script"]?
Really though if the key has a chance to not exist, use the get method provided by dict.
So try instead
items = data["scan"]
for item in items['url']['tcp']:
script = item.get('script')
if script:
print script