I have a JSON object in Python created through requests built in .json() function.
Here is a simplified sample of what I'm doing:
data = session.get(url)
obj = data.json()
s3object = s3.Object(s3_bucket, output_file)
s3object.put(Body=(bytes(json.dumps(obj).encode('UTF-8'))))
Example obj:
{'id': 'fab779b7-2586-4895-9f3b-c9518f34e028', 'project_id': 'a1a73e68-9943-4584-9d59-cc84a0d3e92b', 'created_at': '2017-10-23 02:57:03 -0700', 'sections': [{'section_name': '', 'items': [{'id': 'ffadc652-dd36-4b9f-817c-6539a4b462ab', 'created_at': '2017-10-23 03:36:13 -0700', 'updated_at': '2017-10-23 03:38:32 -0700', 'created_by': 'paul', 'question_text': 'Drawing Ref(s)', 'spec_ref': '', 'display_number': null, 'response': '', 'comment': 'see attached mh309', 'position': 1, 'is_conforming': 'N/A', 'display_type': 'text'}]}]}
I need to replace any occurrence of the string "N/A" with "Not Applicable" anywhere it appears regardless of its key or location before I upload the JSON to S3. I cannot use local disk writes hence the reason this is done this way.
Is this possible?
My original plan was to turn it to a string and just replace before turning back, is this inefficient?
Thanks,
As mentioned in the comments, obj is a dict. One way to replace N/A with Not Applicable regardless of location is to convert it to a string, use string.replace and convert it back to dict for further processing
import json
#Original dict with N/A
obj = {'id': 'fab779b7-2586-4895-9f3b-c9518f34e028', 'project_id': 'a1a73e68-9943-4584-9d59-cc84a0d3e92b', 'created_at': '2017-10-23 02:57:03 -0700', 'sections': [{'section_name': '', 'items': [{'id': 'ffadc652-dd36-4b9f-817c-6539a4b462ab', 'created_at': '2017-10-23 03:36:13 -0700', 'updated_at': '2017-10-23 03:38:32 -0700', 'created_by': 'paul', 'question_text': 'Drawing Ref(s)', 'spec_ref': '', 'display_number': None, 'response': '', 'comment': 'see attached mh309', 'position': 1, 'is_conforming': 'N/A', 'display_type': 'text'}]}]}
#Convert to string and replace
obj_str = json.dumps(obj).replace('N/A', 'Not Applicable')
#Get obj back with replacement
obj = json.loads(obj_str)
Although #Devesh Kumar Singh's answer works with the sample json data in your question, converting the whole thing to a string, and then doing a wholesale bulk replace of the substring seems possibly error-prone because potentially it might change it in portions other than only in the values associated with dictionary keys.
To avoid that I would suggest using the following, which is more selective even though it takes a few more lines of code:
import json
def replace_NA(obj):
def decode_dict(a_dict):
for key, value in a_dict.items():
try:
a_dict[key] = value.replace('N/A', 'Not Applicable')
except AttributeError:
pass
return a_dict
return json.loads(json.dumps(obj), object_hook=decode_dict)
obj = {'id': 'fab779b7-2586-4895-9f3b-c9518f34e028', 'project_id': 'a1a73e68-9943-4584-9d59-cc84a0d3e92b', 'created_at': '2017-10-23 02:57:03 -0700', 'sections': [{'section_name': '', 'items': [{'id': 'ffadc652-dd36-4b9f-817c-6539a4b462ab', 'created_at': '2017-10-23 03:36:13 -0700', 'updated_at': '2017-10-23 03:38:32 -0700', 'created_by': 'paul', 'question_text': 'Drawing Ref(s)', 'spec_ref': '', 'display_number': None, 'response': '', 'comment': 'see attached mh309', 'position': 1, 'is_conforming': 'N/A', 'display_type': 'text'}]}]}
obj = replace_NA(obj)
I guess the Object you've pasted here must be of dict type, you can check it as if "type(json_object) is class dict". With that assumption youcan do it as:-
keys = json_object.keys()
for i in keys:
if json_object[i]=="N/A":
json_object[i]="Not Available"
Hope it helps!
Related
I'd like to add
"5f6c" as a key with values as photo_dict = {'caption': 'eb test', 'photo_id':'330da114-e41e-4cee-ba15-f9632'} into the below using python. I am not sure how to go about it
{'record': {'status': 'bad', 'form_id': '16bba1', 'project_id': None, 'form_values': {'5121':
'yes', '8339': 'ZTVPNG', '6cd3': '234624', '6b5b': '105626', 'e1f6': '[]', '5f6c': [{'id':
'f6efe67d7c5', 'created_at': '1614189636', 'updated_at': '1614189636', 'form_values': {'4ba6':
'Gaaaaah!'}}}
Such that the dictionary becomes
{'record': {'status': 'bad', 'form_id': '16bba1', 'project_id': None, 'form_values': {'5121':
'yes', '8339': 'ZTVPNG', '6cd3': '234624', '6b5b': '105626', 'e1f6': '[]', '5f6c': [{'id':
'f6efe67d7c5', 'created_at': '1614189636', 'updated_at': '1614189636', 'form_values': {'4ba6':
'Gaaaaah!', '5f6c': [{'caption': eb test, 'photo_id': '330da114-e41e-4cee-ba15-f9632'}]}}}
So it seems your initial dictionary is incorrectly formed and does not fully match what you are asking in the question, I would start by updating it with a correct format .
That being said, I ll assume the format would be close to this:
{
"record":{
"status":"bad",
"form_id":"16bba1",
"project_id":"None",
"form_values":{
"5121":"yes",
"8339":"ZTVPNG",
"6cd3":"234624",
"6b5b":"105626",
"e1f6":"[]",
"5f6c":[
{
"id":"f6efe67d7c5",
"created_at":"1614189636",
"updated_at":"1614189636",
"form_values":{
"4ba6":"Gaaaaah!",
"5f6c": {
"caption":"eb test",
"photo_id":"330da114-e41e-4cee-ba15-f9632"
}
}
}
]
}
}
}
You want to modify 5f6c with value
{'caption': 'eb test', 'photo_id':'330da114-e41e-4cee-ba15-f9632'}
Its not very clear which key you actually want to modify as the key 5f6c can be found in two different place in your dict, but it seems what you are trying to do is just modify a dictionary.
So what you need to take away from this is that to modify a dictionary and list you simply do
myDict[key] = value
myList[indice] = value
You contact as much as you want the operation in the same line.
If we take your example from above, and try to modify the most narrowed value, it would give us
import json
records = json.loads(theString)
records['record']['form_values']['5f6c'][0]['form_values']['5f6c'] = {"caption":"eb test", "photo_id":"330da114-e41e-4cee-ba15-f9632" }
After correcting your formatting I would do this.
my_dict = {'record': {'status': 'bad',
'form_id': '16bba1',
'project_id': None,
'form_values': {'5121': 'yes',
'8339': 'ZTVPNG',
'6cd3': '234624',
'6b5b': '105626',
'e1f6': '[]',
'5f6c': [{'id': 'f6efe67d7c5',
'created_at': '1614189636',
'updated_at': '1614189636',
'form_values': {'4ba6': 'Gaaaaah!'}
}
]}
}
}
photo_dict = {'caption': 'eb test', 'photo_id':'330da114-e41e-4cee-ba15-f9632'}
my_dict['record']['form_values']['5f6c'][0]['form_values']['5f6c'] = photo_dict
By the way you can get tools in your IDE that will help you with formatting and handle it for you.
I have a file from an Open API Spec that I have been trying to access in a Jupyter notebook. It is a .yaml file. I was able to upload it into Jupyter and put it in the same folder as the notebook I'd like to use to access it. I am new to Jupyter and Python, so I'm sorry if this is a basic question. I found a forum that suggested this code to read the data (in my file: "openapi.yaml"):
import yaml
with open("openapi.yaml", 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
This seems to bring the data in, but it is a completely unstructured stream like so:
{'openapi': '3.0.0', 'info': {'title': 'XY Tracking API', 'version': '2.0', 'contact': {'name': 'Narrativa', 'url': 'http://link, 'email': '}, 'description': 'The XY Tracking Project collects information from different data sources to provide comprehensive data for the XYs, X-Y. Contact Support:'}, 'servers': [{'url': 'link'}], 'paths': {'/api': {'get': {'summary': 'Data by date range', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/covidtata'}}}}}, 'operationId': 'get-api', 'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_from', 'description': 'Date range beginig (YYYY-DD-MM)', 'required': True}, {'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_to', 'description': 'Date range ending (YYYY-DD-MM)'}], 'description': 'Returns the data for a specific date range.'}}, '/api/{date}': {'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'name': 'date', 'in': 'path', 'required': True}], 'get': {'summary': 'Data by date', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/data'}}}}}, 'operationId': 'get-api-date', 'description': 'Returns the data for a specific day.'}}, '/api/country/{country}': {'parameters': [{'schema': {'type': 'string', 'example': 'spain'}, 'name': 'country', 'in': 'path', 'required': True, 'example': 'spain'}, {'schema': {'type': 'strin
...etc.
I'd like to work through the data for analysis but can't seem to access it correctly. Any help would be extremely appreciated!!! Thank you so much for reading.
What you're seeing in the output is JSON. This is in a machine-readable format which doesn't need human-readable newlines or indentation. You should be able to work with this data just fine in your code.
Alternatively, you may want to consider another parser/emitter such as ruamel.yaml which can make dealing with YAML files considerably easier than the package you're currently importing. Print statements with this package can preserve lines and indentation for better readability.
I am fairly new to Python so my apologies if the terminology is mistaken. I assume it would be easier to explain through the code itself.
{'continue': {'rvcontinue': '20160625113031|17243371', 'continue': '||'},
'query': {'pages': {'4270': {'pageid': 4270,
'ns': 0,
'title': 'Bulgaristan',
'revisions': [{'revid': 17077223,
'parentid': 16909061,
'user': '85.103.140.217',
'anon': '',
'userid': 0,
'timestamp': '2016-05-11T15:30:31Z',
'comment': 'BULGARİSTAN',
'tags': ['visualeditor']},
{'revid': 17077230,
'parentid': 17077223,
'user': 'GurayKant',
'userid': 406350,
'timestamp': '2016-05-11T15:31:31Z',
'comment': '[[Özel:Katkılar/Muratero|Muratero]] ([[Kullanıcı_mesaj:Muratero|mesaj]]) tarafından yapılmış 16907788 numaralı değişiklikler geri getirildi. ([[VP:TW|TW]])',
'tags': []},
{'revid': 17079353,
'parentid': 17077230,
'user': '85.105.16.34',
'anon': '',
'userid': 0,
'timestamp': '2016-05-12T11:03:43Z',
'comment': 'Dipnotta 2001 sayımı verilmesine rağmen burada 2011 yazılmış.',
'tags': ['visualeditor']},
{'revid': 17085285,
'parentid': 17079353,
'user': 'İazak',
'userid': 200435,
'timestamp': '2016-05-14T09:36:18Z',
'comment': 'Gerekçe: Etnik dağılım kaynağı 2001, nüfus sayımı 2011 tarihli.',
'tags': []},
{'revid': 17109975,
'parentid': 17085285,
'user': 'Kudelski',
'userid': 167898,
'timestamp': '2016-05-21T13:14:44Z',
'comment': 'Düzeltme.',
'tags': []}]}}}}
From this code I want to get the values of 'pageid', 'title', 'revid', 'user', 'userid', 'timestamp', 'comment', 'tags'.
I can use y['query']['pages'] however I do not want to continue with '4270' since that number will be changing each time I run the API.
I hope it is explanatory enough! Thank you very much! I can give additional info if necessary!!
You can use list() to convert .keys() or .values() to list and get only first element.
data = {...your dictionary...}
item = list(data['query']['pages'].values())[0]
print('title:', item['title'])
print('pageid:', item['pageid'])
for x in item['revisions']:
print('---')
print('revid:', x['revid'])
print('user:', x['user'])
print('userid:', x['userid'])
print('timestamp:', x['timestamp'])
print('comment:', x['comment'])
print('tags:', x['tags'])
But if you have many pages in this directory then you should use for-loop
for key, item in data['query']['pages'].items():
print('key:', key)
print('title:', item['title'])
print('pageid:', item['pageid'])
for x in item['revisions']:
print('---')
print('revid:', x['revid'])
print('user:', x['user'])
print('userid:', x['userid'])
print('timestamp:', x['timestamp'])
print('comment:', x['comment'])
print('tags:', x['tags'])
I have a csv with 500+ rows where one column "_source" is stored as JSON. I want to extract that into a pandas dataframe. I need each key to be its own column. #I have a 1 mb Json file of online social media data that I need to convert the dictionary and key values into their own separate columns. The social media data is from Facebook,Twitter/web crawled... etc. There are approximately 528 separate rows of posts/tweets/text with each having many dictionaries inside dictionaries. I am attaching a few steps from my Jupyter notebook below to give a more complete understanding. need to turn all key value pairs for dictionaries inside dictionaries into columns inside a dataframe
Thank you so much this will be a huge help!!!
I have tried changing it to a dataframe by doing this
source = pd.DataFrame.from_dict(source, orient='columns')
And it returns something like this... I thought it might unpack the dictionary but it did not.
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
below is the shape
#source.shape (528, 1)
below is what the an actual "_source" row looks like stretched out. There are many dictionaries and key:value pairs where each key needs to be its own column. Thanks! The actual links have been altered/scrambled for privacy reasons.
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
before you post make sure the actual code works for the data attached. Thanks!
The below code I tried but it did not work there was a syntax error that I could not figure out.
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
Whoever can help me with this will be a saint!
I had to do something like that a while back. Basically I used a function that completely flattened out the json to identify the keys that would be turned into the columns, then iterated through the json to reconstruct a row and append each row into a "results" dataframe. So with the data you provided, it created 52 column row and looking through it, looks like it included all the keys into it's own column. Anything nested, for example: 'meta': {'rule_matcher':[{'atribs': {'website': ...]} should then have a column name meta.rule_matcher.atribs.website where the '.' denotes those nested keys
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
Code:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
I'm having problems getting my head around this Python data structure:
data = {'nmap': {'command_line': u'ls',
'scaninfo': {u'tcp': {'method': u'connect',
'services': u'80,443'}},
'scanstats': {'downhosts': u'0',
'elapsed': u'1.18',
'timestr': u'Wed Mar 19 21:37:54 2014',
'totalhosts': u'1',
'uphosts': u'1'}},
'scan': {u'url': {'addresses': {u'ipv6': u'2001:470:0:63::2'},
'hostname': u'abc.net',
'status': {'reason': u'syn-ack',
'state': u'up'},
u'tcp': {80: {'conf': u'3',
'cpe': '',
'extrainfo': '',
'name': u'http',
'product': '',
'reason': u'syn-ack',
'state': u'open',
'version': ''},
443: {'conf': u'3',
'cpe': '',
'extrainfo': '',
'name': u'https',
'product': '',
'reason': u'syn-ack',
'script': {
u'ssl-cert': u'place holder'},
'state': u'open',
'version': ''}},
'vendor': {}
}
}
}
Basically I need to iterate over the 'tcp' key values and extract the contents of the 'script' item if it exists.
This is what I've tried:
items = data["scan"]
for item in items['url']['tcp']:
if t["script"] is not None:
print t
However I can't seem to get it to work.
This will find any dictionary items with the key 'script' anywhere in the data structure:
def find_key(data, search_key, out=None):
"""Find all values from a nested dictionary for a given key."""
if out is None:
out = []
if isinstance(data, dict):
if search_key in data:
out.append(data[search_key])
for key in data:
find_key(data[key], search_key, out)
return out
For your data, I get:
>>> find_key(data, 'script')
[{'ssl-cert': 'place holder'}]
To find the ports, too, modify slightly:
tcp_dicts = find_key(data, 'tcp') # find all values for key 'tcp'
ports = [] # list to hold ports
for d in tcp_dicts: # iterate through values for key 'tcp'
if all(isinstance(port, int) for port in d): # ensure all are port numbers
for port in d:
ports.append((port,
d[port].get('script'))) # extract number and script
Now you get something like:
[(80, None), (443, {'ssl-cert': 'place holder'})]
data['scan']['url']['tcp'] is a dictionary, so when you just iterate over it, you will get the keys but not the values. If you want to iterate over the values, you have to do so:
for t in data['scan']['url']['tcp'].values():
if 'script' in t and t['script'] is not None:
print(t)
If you need the key as well, iterate over the items instead:
for k, t in data['scan']['url']['tcp'].items():
if 'script' in t and t['script'] is not None:
print(k, t)
You also need to change your test to check 'script' in t first, otherwise accessing t['script'] will raise a key error.
Don't you mean if item["script"]?
Really though if the key has a chance to not exist, use the get method provided by dict.
So try instead
items = data["scan"]
for item in items['url']['tcp']:
script = item.get('script')
if script:
print script