I need to convert a complex json file to csv using python, I tried a lot of codes without success, I came here for help,I updated the question, the JSON file is about a million,I need to convert them to csv format
csv file
{
"_id": {
"$oid": "2e3230"
},
"add": {
"address1": {
"address": "kvartira 14",
"zipcode": "10005",
},
"name": "Evgiya Kovava",
"address2": {
"country": "US",
"country_name": "NY",
}
}
}
{
"_id": {
"$oid": "2d118c8bo"
},
"add": {
"address1": {
"address": "kvartira 14",
"zipcode": "52805",
},
"name": "Eiya tceva",
"address2": {
"country": "US",
"country_name": "TX",
}
}
}
import pandas as pd
null = 'null'
data = {
"_id": {
"$oid": "2e3230s314i5dc07e118c8bo"
},
"add": {
"address": {
"address_type": "Door",
"address": "kvartira 14",
"city": "new york",
"region": null,
"zipcode": "10005",
},
"name": "Evgeniya Kovantceva",
"type": "Private person",
"code": null,
"additional_phone_nums": null,
"email": null,
"notifications": [],
"address": {
"address": "kvartira 14",
"city": "new york",
"region": null,
"zipcode": "10005",
"country": "US",
"country_name": "NY",
}
}
}
df = pd.json_normalize(data)
df.to_csv('yourpath.csv')
Beware the null value. The "address" nested dictionary comes inside "add" two times almost identical?
EDIT
Ok after your information it looks like json.JSONDecoder() is what you need.
Originally posted by #pschill on this link:
how to analyze json objects that are NOT separated by comma (preferably in Python)
I tried his code on your data:
import json
import pandas as pd
data = """{
"_id": {
"$oid": "2e3230"
},
"add": {
"address1": {
"address": "kvartira 14",
"zipcode": "10005"
},
"name": "Evgiya Kovava",
"address2": {
"country": "US",
"country_name": "NY"
}
}
}
{
"_id": {
"$oid": "2d118c8bo"
},
"add": {
"address1": {
"address": "kvartira 14",
"zipcode": "52805"
},
"name": "Eiya tceva",
"address2": {
"country": "US",
"country_name": "TX"
}
}
}"""
Keep in mind that your data also has trailing commas which makes the data unreadable (the last commas right before every closing bracket).
You have to remove them with some regex or another approach I am not familiar with. For the purpose of this answer I removed them manually.
So after that I tried this:
content = data
parsed_values = []
decoder = json.JSONDecoder()
while content:
value, new_start = decoder.raw_decode(content)
content = content[new_start:].strip()
# You can handle the value directly in this loop:
# print("Parsed:", value)
# Or you can store it in a container and use it later:
parsed_values.append(value)
which gave me an error but the list seems to get populated with all the values:
parsed_values
[{'_id': {'$oid': '2e3230'},
'add': {'address1': {'address': 'kvartira 14', 'zipcode': '10005'},
'name': 'Evgiya Kovava',
'address2': {'country': 'US', 'country_name': 'NY'}}},
{'_id': {'$oid': '2d118c8bo'},
'add': {'address1': {'address': 'kvartira 14', 'zipcode': '52805'},
'name': 'Eiya tceva',
'address2': {'country': 'US', 'country_name': 'TX'}}}]
next I did:
df = pd.json_normalize(parsed_values)
which worked fine.
You can always save that to a csv with:
df.to_csv('yourpath.csv')
Tell me if that helped.
Your json is quite problematic after all. Duplicate keys (problem), null value (unreadable), trailing commas (unreadable), not comma separated dicts... It didn't catch the eye at first :P
Related
Goal: To create a script that will take in nested JSON object as input and output a CSV file with all keys as rows in the CSV?
Example:
{
"Document": {
"DocumentType": 945,
"Version": "V007",
"ClientCode": "WI",
"Shipment": [
{
"ShipmentHeader": {
"ShipmentID": 123456789,
"OrderChannel": "Shopify",
"CustomerNumber": 234234,
"VendorID": "2343SDF",
"ShipViaCode": "FEDX2D",
"AsnDate": "2018-01-27",
"AsnTime": "09:30:47-08:00",
"ShipmentDate": "2018-01-23",
"ShipmentTime": "09:30:47-08:00",
"MBOL": 12345678901234568,
"BOL": 12345678901234566,
"ShippingNumber": "1ZTESTTEST",
"LoadID": 321456987,
"ShipmentWeight": 10,
"ShipmentCost": 2.3,
"CartonsTotal": 2,
"CartonPackagingCode": "CTN25",
"OrdersTotal": 2
},
"References": [
{
"Reference": {
"ReferenceQualifier": "TST",
"ReferenceText": "Testing text"
}
}
],
"Addresses": {
"Address": [
{
"AddressLocationQualifier": "ST",
"LocationNumber": 23234234,
"Name": "John Smith",
"Address1": "123 Main St",
"Address2": "Suite 12",
"City": "Hometown",
"State": "WA",
"Zip": 92345,
"Country": "USA"
},
{
"AddressLocationQualifier": "BT",
"LocationNumber": 2342342,
"Name": "Jane Smith",
"Address1": "345 Second Ave",
"Address2": "Building 32",
"City": "Sometown",
"State": "CA",
"Zip": "23665-0987",
"Country": "USA"
}
]
},
"Orders": {
"Order": [
{
"OrderHeader": {
"PurchaseOrderNumber": 23456342,
"RetailerPurchaseOrderNumber": 234234234,
"RetailerOrderNumber": 23423423,
"CustomerOrderNumber": 234234234,
"Department": 3333,
"Division": 23423,
"OrderWeight": 10.23,
"CartonsTotal": 2,
"QTYOrdered": 12,
"QTYShipped": 23
},
"Cartons": {
"Carton": [
{
"SSCC18": 12345678901234567000,
"TrackingNumber": "1ZTESTTESTTEST",
"CartonContentsQty": 10,
"CartonWeight": 10.23,
"LineItems": {
"LineItem": [
{
"LineNumber": 1,
"ItemNumber": 1234567890,
"UPC": 9876543212,
"QTYOrdered": 34,
"QTYShipped": 32,
"QTYUOM": "EA",
"Description": "Shoes",
"Style": "Tall",
"Size": 9.5,
"Color": "Bllack",
"RetailerItemNumber": 2342333,
"OuterPack": 10
},
{
"LineNumber": 2,
"ItemNumber": 987654321,
"UPC": 7654324567,
"QTYOrdered": 12,
"QTYShipped": 23,
"QTYUOM": "EA",
"Description": "Sunglasses",
"Style": "Short",
"Size": 10,
"Color": "White",
"RetailerItemNumber": 565465456,
"OuterPack": 12
}
]
}
}
]
}
}
]
}
}
]
}
}
In the above JSON Object, I want all the keys (nested included) in a List (Duplicates can be removed by using a set Data Structure). If Nested Key Occurs like in actual JSON they can be keys multiple times in the CSV !
I personally feel that recursion is a perfect application for this type of problem if the amount of nests you will encounter is unpredictable. Here I have written an example in Python of how you can utilise recursion to extract all keys. Cheers.
import json
row = ""
def extract_keys(data):
global row
if isinstance(data, dict):
for key, value in data.items():
row += key + "\n"
extract_keys(value)
elif isinstance(data, list):
for element in data:
extract_keys(element)
# MAIN
with open("input.json", "r") as rfile:
dicts = json.load(rfile)
extract_keys(dicts)
with open("output.csv", "w") as wfile:
wfile.write(row)
I'd like to preface by saying that I'm VERY new to python so apologize if the answer to this is obvious. :) I have a python script does a couple of API calls and returns json data. The output currently looks something similar to below:
IP Information is below:
{
"id": 318283,
"name": "Name",
"type": "IP4Address",
"properties": "VLAN=5|DeviceName=Device|Notes=This is a description address|Administration=Team that admins the system|Location-Code=Location|address=1.2.3.4|state=STATIC|"
}
IP Subnet Range Information is below:
{
"id": 118836,
"name": "VLAN Description",
"type": "IP4Network",
"properties": "Location-Code=Location|Notes=Description of the subnet|CIDR=1.2.3.0/25|allowDuplicateHost=disable|inheritAllowDuplicateHost=true|pingBeforeAssign=disable|inheritPingBeforeAssign=true|inheritDefaultDomains=true|defaultView=118346|inheritDefaultView=true|inheritDNSRestrictions=true|"
I'd like for the response to not have the "id" string and would like to split the various | delimited strings in properties into their own string so it looks like below:
{
"id": 318283,
"name": "Name",
"type": "IP4Address",
"properties":{
"VLAN": “5”,
"DeviceName": “Device”,
"Notes": “Description”,
"Administration": “Admin team”,
"Location-Code": “Location”,
"Address": “1.2.3.4”,
"State": “STATIC”
}
}
{
"id": 118836,
"name": "Subnet name",
"type": "IP4Network",
"properties":{
"Location-Code": "Location",
"Notes": "Subnet description. ",
"CIDR": "1.2.3.0/25",
"allowDuplicateHost": "disable",
"inheritAllowDuplicateHost": "true",
"pingBeforeAssign": "disable",
"inheritPingBeforeAssign": "true",
"inheritDefaultDomains": "true",
"defaultView": "118346",
"inheritDefaultView": "true",
"inheritDNSRestrictions": "true"
}
}
Any suggestions are greatly appreciated!
Here's a way to handle your pipe-separated properties:
a = { "id": 318283, "name": "Name", "type": "IP4Address", "properties": "VLAN=5|DeviceName=Device|Notes=This is a description address|Administration=Team that admins the system|Location-Code=Location|address=1.2.3.4|state=STATIC|" }
new_properties = {}
for key_val_pair in a['properties'].split("|"):
if key_val_pair.strip() == "":
continue
key, val = key_val_pair.split("=")
new_properties[key] = val
a["properties"] = new_properties
print(a)
Output:
{'id': 318283, 'name': 'Name', 'type': 'IP4Address', 'properties': {'VLAN': '5', 'DeviceName': 'Device', 'Notes': 'This is a description address', 'Administration': 'Team that admins the system', 'Location-Code': 'Location', 'address': '1.2.3.4', 'state': 'STATIC'}}
I have a JSON file with lots of data, and I want to keep only specific data.
I thought reading the file, get all the data I want and save as a new JSON.
The JSON is like this:
{
"event": [
{
"date": "2019-01-01",
"location": "world",
"url": "www.com",
"comments": "null",
"country": "china",
"genre": "blues"
},
{
"date": "2000-01-01",
"location": "street x",
"url": "www.cn",
"comments": "null",
"country":"turkey",
"genre": "reds"
},
{...
and I want it to be like this (with just date and url from each event.
{
"event": [
{
"date": "2019-01-01",
"url": "www.com"
},
{
"date": "2000-01-01",
"url": "www.cn"
},
{...
I can open the JSON and read from it using
with open('xx.json') as f:
data = json.load(f)
data2=data["events"]["date"]
But I still need to understand how to save the data I want in a new JSON keeping it's structure
You can use loop comprehension to loop over the events in and return a dictionary containing only the keys that you want.
data = { "event": [
{
"date": "2019-01-01",
"location": "world",
"url": "www.com",
"comments": None,
"country": "china",
"genre": "blues",
},
{
"date": "2000-01-01",
"location": "street x",
"url": "www.cn",
"comments": None,
"country" :"turkey",
"genre":"reds",
}
]}
# List comprehension
data["event"] = [{"date": x["date"], "url": x["url"]} for x in data["event"]]
Alternatively, you can map a function over the events list
keys_to_keep = ["date", "url"]
def subset_dict(d):
return {x: d[x] for x in keys_to_keep}
data["event"] = list(map(subset_dict, data["event"]))
[
{
"id": 1,
"name": "Lea",
"username": "Bret",
"email": "hhaa#gma",
"address": {
"street": "Light",
"suite": "Apt. 5",
"city": "Gwen",
"zipcode": "3874",
"geo": {
"lat": "-37.3159",
"lng": "81.1496"
}
},
"phone": "1-770",
"website": "hilde.org",
"company": {
"name": "Roma",
"catchPhrase": "net",
"bs": "markets"
}
},
{
"id": 2,
"name": "Er",
"username": "Ant",
"email": "Sh",
"address": {
"street": "Vis",
"suite": "89",
"city": "Wibrugh",
"zipcode": "905",
"geo": {
"lat": "-43.9509",
"lng": "-34.4618"
}
},
"phone": "010-69",
"website": "ansia.net",
"company": {
"name": "Deist",
"catchPhrase": "contingency",
"bs": " supply-chains"
}
}
]
I am getting this data from webscraping and I would like to store this data into netezza database. Can you Please give me sample code? Do I need to correct Json before? If yes, How would I do it?
And when I am trying to use items iterate in list, I am only getting the last user id details.
I would suggest a different approach, due to better scalability:
1) load the raw txt data into a (temporary) table with the ‘external table’ syntax of Netezza
2) use these functions to parse the Json data into table columns: https://developer.ibm.com/articles/i-json-table-trs/
I am having difficulty updating nested json structure in mongo.
I am using pymongo along with Mongoengine-Rest-framework.
Since this particular json has dynamic structure and is heavily nested, I chose to use pymongo over mongo-engine ORM.
The create, retrieve and delete operations faring fine.
I would like some suggestions on the updation issue.
Lets consider a sample object which is already present in mongo:
st1 = {
"name": "Some_name",
"details": {
"address1": {
"house_no": "731",
"street": "Some_street",
"city": "some_city"
"state": "some_state"
}
}
}
If I try to update st1 by adding address2 to the details by sending the json st2 in the update command with _id being the condition for updation,
st2 = {
"details": {
"address2": {
"house_no": "5102",
"street": "Some_street",
"city": "some_city"
"state": "some_state"
}
}
}
I get the following object st3 as result , in mongo,
st3 = {
"name": "Some_name",
"details": {
"address2": {
"house_no": " 5102",
"street": "Some_street",
"city": "some_city"
"state": "some_state"
}
}
}
instead of the expected st4 object.
st4 = {
"name": "Some_name",
"details": {
"address1": {
"house_no": "731",
"street": "Some_street",
"city": "some_city"
"state": "some_state"
},
"address2": {
"house_no": "5102",
"street": "Some_street",
"city": "some_city"
"state": "some_state"
}
}
}
my update command is:
result = collection.update_one({'_id': id}, doc)
where
id: _id of document
doc: (here) st2
collection: pymongo colllection object
The original JSON depth is 6 and the keys are dynamic. The updation will be needed at different depths.
First, change the object to update to this:
to_update = {
"house_no": "5102",
"street": "Some_street",
"city": "some_city",
"state": "some_state"
}
And then use it to update the specific part of the document you want:
collection.update_one({_id: id}, { '$set': {"details.address2" : to_update} });
use this to add address 2:
collection.update({'_id': ObjectId(doc_id)}, {'$set': {'details.%s' %
'address2': address2}}, upsert=True)
Checkout complete code :
import pymongo
from bson.objectid import ObjectId
data = {"name": "Some_name",
"details": {"address1": {"house_no": "731", "street": "Some_street", "city": "some_city", "state": "some_state"}}}
address2 = {"house_no": "731", "street": "Some_street", "city": "some_city", "state": "some_state"}
connect = pymongo.MongoClient('192.168.4.202', 20020)
database = connect['my_test']
collection = database['coll']
# # CREATE COLLECTIONS AND INSERT DATA
# _id = collection.insert(data)
# print _id
doc_id = '57568aa11ec52522343ee695'
collection.update({'_id': ObjectId(doc_id)}, {'$set': {'details.%s' % 'address2': address2}}, upsert=True)