Using Pandas to convert JSON to CSV with specific fields - python

I am currently trying to convert a JSON file to a CSV file using Pandas.
The codes that I'm using now are able to convert the JSON to a CSV file.
import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)
This is my JSON file:
{
"events": [
{
"raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"logtypes": [
"json"
],
"timestamp": 1537190572023,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Disabled camera with QR scan on by 80801234 at Area A\n",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
},
{
"raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"logtypes": [
"json"
],
"timestamp": 1537190528619,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Employee number saved successfully.",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
}
]
}
But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.
I have tried a variety of ways:
df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'
df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers
Where did i went wrong?

I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".
Assume data is
data = json.load(open('out1.json'))['events']
Look at the first entry
data[0]['timestamp']
1537190572023
json_normalize wants this to be a list
[{'timestamp': 1537190572023}]
Create augmented data2
I don't actually recommend this approach.
If we create data2 accordingly:
data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]
We can use json_normalize
json_normalize(
data2, 'timestamp',
[['event', 'json', 'level'], ['event', 'json', 'message']]
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
Comprehension
I think it's simpler to just do
pd.DataFrame([
(d['timestamp'],
d['event']['json']['level'],
d['event']['json']['message'])
for d in data
], columns=['timestamp', 'level', 'message'])
timestamp level message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
json_normalize
But without the fancy arguments
json_normalize(data).pipe(
lambda d: d[['timestamp']].join(
d.filter(like='event.json')
)
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.

Related

Getting this error when running the python code in DATABRICKS

THE ERROR
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
THE CODE
from pyspark.sql.functions import explode, col
# Read the JSON file from Databricks storage
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 8))
d2 = dict(itertools.islice(data.items(), 8, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/new_test_header.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/new_test_detail.json")
THE SAMPLE JSON FILE OF LARGE JSON FILE
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"plan_name": "launched",
"plan_id_type": "hios",
"plan_id": "1111111111",
"plan_market_type": "individual",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Hi, I am trying to divide a big json file into two format which is done by the above code. But it is failing saying to cache i used
.cache() at the last of the loading file but still getting this error. Kindly please let me know how can i solve this error.
I am able to resolve this error buy changing the this
df_json = spark.read.json("/mnt/BigData_JSONFiles/new_test.json")
to this
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/new_test.json")

Extracting data from JSON log

I am a beginner when it comes to programming. I'm trying to extract elements from a JSON log file, but I get an error and I don't know how to deal with it.
import json
with open("/Users/milosz/Desktop/logi.json") as f:
data = json.load(f)
print(type(data['Objects']))
print(data)
for object in data ['Objects']:
print(object)
Error:
File "/Users/milosz/PycharmProjects/JsonDataExtracter/Program/Python Exracter.py", line 4, in <module>
print(type(data['Objects']))
TypeError: list indices must be integers or slices, not str
Process finished with exit code 1
I am sending the log below.
{
"_id": "635bd4bfc594743ce9b1a5a3",
"dateStart": "2022-10-28T13:09:28.609Z",
"dateFinish": "2022-10-28T13:10:23.698Z",
"method": "customer.file.upsert",
"request": {
"Objects": [
{
"ERPId": "6915",
"B24Id": 403772,
"FileName": "B2B000202",
"FileContent": "JVBERi0xLjMNJeLjz9MN",
"B24EntityId": 3334
}
]
Following up on the guidance from #accdias, here is a code snippet that closes the gaps in your JSON snippet and demonstrates how to access the Objects section:
import json
json_string = """
{
"_id": "635bd4bfc594743ce9b1a5a3",
"dateStart": "2022-10-28T13:09:28.609Z",
"dateFinish": "2022-10-28T13:10:23.698Z",
"method": "customer.file.upsert",
"request": {
"Objects": [
{
"ERPId": "6915",
"B24Id": 403772,
"FileName": "B2B000202",
"FileContent": "JVBERi0xLjMNJeLjz9MN",
"B24EntityId": 3334
}
]
}
}
"""
json_dict = json.loads(json_string)
print(json_dict["request"]["Objects"])
Output:
[{'ERPId': '6915', 'B24Id': 403772, 'FileName': 'B2B000202', 'FileContent': 'JVBERi0xLjMNJeLjz9MN', 'B24EntityId': 3334}]

Python - How to retrieve element from json

Aloha,
My python routine will retrieve json from site, then check the file and download another json given the first answer and eventually download a zip.
The first json file gives information about doc.
Here's an example :
[
{
"id": "d9789918772f935b2d686f523d066a7b",
"originalName": "130010259_AC2_R44_20200101",
"type": "SUP",
"status": "document.deleted",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2020-06-25T14:56:27+02:00",
"updateDate": "2021-01-19T14:33:35+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20200101",
"legalControlStatus": 101
},
{
"id": "6a9013bdde6acfa632861aeb1a02942b",
"originalName": "130010259_AC2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-18T16:37:01+01:00",
"updateDate": "2021-01-19T14:33:29+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "efd51feaf35b12248966cb82f603e403",
"originalName": "130010259_PM2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_PM2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.6535762,
47.665021,
7.9509455,
49.907347
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-28T09:52:31+01:00",
"updateDate": "2021-01-28T18:53:34+01:00",
"fileIdentifier": "SUP-PM2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "2e1b6104fdc09c84077d54fd9e74a7a7",
"originalName": "444619258_I4_R44_20210211",
"type": "SUP",
"status": "document.pre_production",
"legalStatus": "APPROVED",
"name": "444619258_SUP_R44_I4",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
2.8698336,
47.3373246,
8.0881368,
50.3796449
],
"documentSource": "UPLOAD",
"uploadDate": "2021-04-19T10:20:20+02:00",
"updateDate": "2021-04-19T14:46:21+02:00",
"fileIdentifier": "SUP-I4-R44-444619258-20210211",
"legalControlStatus": 100
}
]
What I try to do is to retrieve "id" from this json file. (ex. "id": "2e1b6104fdc09c84077d54fd9e74a7a7",)
I've tried
import json
from jsonpath_rw import jsonpath, parse
import jsonpath_rw_ext as jp
with open('C:/temp/gpu/SUP/20210419/SUPGE.json') as f:
d = json.load(f)
data = json.dumps(d)
print("oriName: {}".format( jp.match1("$.id[*]",data) ) )
It doesn't work In fact, I'm not sure how jsonpath-rw is intended to work. Thankfully there was this blogpost But I'm still stuck.
Does anyone have a clue ?
With the id, I'll be able to download another json and in this json there'll be an archiveUrl to get the zipfile.
Thanks in advance.
import json
file = open('SUPGE.json')
with file as f:
d = json.load(f)
for i in d:
print(i.get('id'))
this will give you id only.
d9789918772f935b2d686f523d066a7b
6a9013bdde6acfa632861aeb1a02942b
efd51feaf35b12248966cb82f603e403
2e1b6104fdc09c84077d54fd9e74a7a7
Ok.
Here's what I've done.
import json
import urllib
# not sure it's the best way to load json from url, but it works fine
# and I could test most of code if needed.
def getResponse(url):
operUrl = urllib.request.urlopen(url)
if(operUrl.getcode()==200):
data = operUrl.read()
jsonData = json.loads(data)
else:
print("Erreur reçue", operUrl.getcode())
return jsonData
# Here I get the json from the url. *
# That part will be in the final script a parameter,
# because I got lot of territory to control
d = getResponse('https://www.geoportail-urbanisme.gouv.fr/api/document?documentFamily=SUP&grid=R44&legalStatus=APPROVED')
for i in d:
if i['status'] == 'document.production' :
print('id du doc en production :',i.get('id'))
# here we parse the id to fetch the whole document.
# Same server, same API but different url
_URL = 'https://www.geoportail-urbanisme.gouv.fr/api/document/' + i.get('id')+'/details'
d2 = getResponse(_URL)
print('archive',d2['archiveUrl'])
urllib.request.urlretrieve(d2['archiveUrl'], 'c:/temp/gpu/SUP/'+d2['metadata']+'.zip' )
# I used wget in the past and loved the progression bar.
# Maybe I'd switch to wget because of it.
# Works fine.
Thanks for your answer. I'm delighted to see that even with only the json library you could do amazing things. Just normal stuff. But amazing.
Feel free to comment if you think I've missed smthg.

Dictionary length is equal to 3 but when trying to access an index receiving KeyError

I am attempting to parse a json response that looks like this:
{
"links": {
"next": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-08&end_date=2015-09-09&detailed=false&api_key=xxx",
"prev": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-06&end_date=2015-09-07&detailed=false&api_key=xxx",
"self": "http://www.neowsapp.com/rest/v1/feed?start_date=2015-09-07&end_date=2015-09-08&detailed=false&api_key=xxx"
},
"element_count": 22,
"near_earth_objects": {
"2015-09-08": [
{
"links": {
"self": "http://www.neowsapp.com/rest/v1/neo/3726710?api_key=xxx"
},
"id": "3726710",
"neo_reference_id": "3726710",
"name": "(2015 RC)",
"nasa_jpl_url": "http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=3726710",
"absolute_magnitude_h": 24.3,
"estimated_diameter": {
"kilometers": {
"estimated_diameter_min": 0.0366906138,
"estimated_diameter_max": 0.0820427065
},
"meters": {
"estimated_diameter_min": 36.6906137531,
"estimated_diameter_max": 82.0427064882
},
"miles": {
"estimated_diameter_min": 0.0227984834,
"estimated_diameter_max": 0.0509789586
},
"feet": {
"estimated_diameter_min": 120.3760332259,
"estimated_diameter_max": 269.1689931548
}
},
"is_potentially_hazardous_asteroid": false,
"close_approach_data": [
{
"close_approach_date": "2015-09-08",
"close_approach_date_full": "2015-Sep-08 09:45",
"epoch_date_close_approach": 1441705500000,
"relative_velocity": {
"kilometers_per_second": "19.4850295284",
"kilometers_per_hour": "70146.106302123",
"miles_per_hour": "43586.0625520053"
},
"miss_distance": {
"astronomical": "0.0269230459",
"lunar": "10.4730648551",
"kilometers": "4027630.320552233",
"miles": "2502653.4316094954"
},
"orbiting_body": "Earth"
}
],
"is_sentry_object": false
},
}
I am trying to figure out how to parse through to get "miss_distance" dictionary values ? I am unable to wrap my head around it.
Here is what I have been able to do so far:
After I get a Response object from request.get()
response = request.get(url
I convert the response object to json object
data = response.json() #this returns dictionary object
I try to parse the first level of the dictionary:
for i in data:
if i == "near_earth_objects":
dataset1 = data["near_earth_objects"]["2015-09-08"]
#this returns the next object which is of type list
Please someone can explain me :
1. How to decipher this response in the first place.
2. How can I move forward in parsing the response object and get to miss_distance dictionary ?
Please any pointers/help is appreciated.
Thank you
Your data will will have multiple dictionaries for the each date, near earth object, and close approach:
near_earth_objects = data['near_earth_objects']
for date in near_earth_objects:
objects = near_earth_objects[date]
for object in objects:
close_approach_data = object['close_approach_data']
for close_approach in close_approach_data:
print(close_approach['miss_distance'])
The code below gives you a table of date, miss_distances for every object for every date
import json
raw_json = '''
{
"near_earth_objects": {
"2015-09-08": [
{
"close_approach_data": [
{
"miss_distance": {
"astronomical": "0.0269230459",
"lunar": "10.4730648551",
"kilometers": "4027630.320552233",
"miles": "2502653.4316094954"
},
"orbiting_body": "Earth"
}
]
}
]
}
}
'''
if __name__ == "__main__":
parsed = json.loads(raw_json)
# assuming this json includes more than one near_earch_object spread across dates
near_objects = []
for date, near_objs in parsed['near_earth_objects'].items():
for obj in near_objs:
for appr in obj['close_approach_data']:
o = {
'date': date,
'miss_distances': appr['miss_distance']
}
near_objects.append(o)
print(near_objects)
output:
[
{'date': '2015-09-08',
'miss_distances': {
'astronomical': '0.0269230459',
'lunar': '10.4730648551',
'kilometers': '4027630.320552233',
'miles': '2502653.4316094954'
}
}
]

Add missing fields with null values as per position mentioned in the config file in Python while parsing the JSON file data

I Have a config file
Position,ColumnName
1,TXS_ID
4,TXX_NAME
8,AGE
As per the above position i have 1 , 4, 8 --- we have only 3 columns are available. In between 1 & 4 we don't have 2,3 position where i want to fill them with Null Values .
As per the above config file i am trying to parse the data from a Json file by using Python but i have a scenario where i need to define the columns on the base of position as mentioned above. When python script is running if the "TXS_ID" is available it should pick the data from the JSON file & as i dont have 2& 3 fields i want to keep them as Null.
Sample output file
TSX_ID,,,TXX_NAME,,,,AGE
10000,,,AAAAAAAAA,,,,40
As per the config file i specify , data should be extracted from Json file and if the position is missing as per above example then it should be filling with nulls. Please help me if there is any possibility i can achieve.
Below is the sample Json File.
{
"entities": [
{
"id": "XXXXXXXXXXXXXXX",
"data": {
"attributes": {
"TSX_ID": {
"values": [
{
"value": 10000
}
]
},
"TXX_NAME": {
"values": [
{
"value": "AAAAAAAAA"
}
]
},
"AGE": {
"values": [
{
"value": "40"
}
]
}
}
}
}
]
}
Assuming that the config file line 1,TXS_ID has a typo and is actually 1,TSX_ID, this program works with your sample data (see explanations in comments):
import pandas
# read the "config file" into a Series of the "ColumnName"s:
config = pandas.read_csv('config', index_col='Position', squeeze=True)
maxdex = config.index[-1] # get the maximum Position
# fill the Positions missing in the "config file" with empty "ColumnName"s:
config = config.reindex(range(1, maxdex+1), fill_value='')
import json
sample = json.load(open('sample.json'))
# create an empty DataFrame with the desired columns:
output = pandas.DataFrame(columns=config.values)
# now insert the nested JSON data values into the given columns:
for a in config.values:
if a: # only if not an empty column name, of course
output[a] = [av['value'] for e in sample['entities']
for av in e['data']['attributes'][a]['values']]
output.to_csv('output.csv', index=False)

Categories