Converting large nested JSON to CSV - python

I am fairly new to Python and have several nested JSON files I need to convert to CSV's.
The structure of these is the following:
{'Z': {'#SchemaVersion': 9,
'FD': [{'FDRecord': [{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5815D34615C15690936B822714009468',
'MsecTime': 5012,
'RecId': 4},
'UniqueCaseId': '5389F346136315497325122714009468'}},
{'NewCase': {'#TdrRecVer': 5,
'CaseLabel': '',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819D346166610458312622714009468',
'MsecTime': 9459,
'RecId': 4},
'UniqueCaseId': '5819F346148627009653284714009468'}},
{'AnnotationEvt': {'#EvtName': 'New',
'#TdrRecVer': 1,
'DevEvtCode': 13,
'Payload': '0 0 0 0',
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5899D34616BC1000938B824538219968',
'MsecTime': 7853,
'RecId': 8},
'TreatmentSummary': 1,
'XidCode': '0000000B'}},
{'TrendRpt': {'#TdrRecVer': 9,
'CntId': 0,
'DevEvtCode': 30,
'StdHdr': {'DevDateTime': '2000-05-02T10:43:18',
'ElapsedTime': 0,
'GUID': '5819C34616781004852225698409468',
'MsecTime': 4052,
'RecId': 12}, ...
My problem is, most examples online show how to read in a very small json and write it out to a csv by explicitly stating the keys or field names when creating a csv. My files are far too large to do this, some of them being over 40MB.
I tried following another person's example (below) from online, but did not succeed:
import json
import csv
with open('path-where-json-located') as file:
data_parse = json.load(file)
data = data_parse['Z']
data_writer = open('path-for-csv', 'w')
csvwriter = csv.writer(data_writer)
count=0
for i in data:
if count == 0:
header = i.keys()
csvwriter.writerow(header)
count+=1
csvwriter.writerow(i.values())
data_writer.close()
When I run this, I get the following error:
AttributeError: 'str' object has no attribute 'keys'
I understand that for some reason it is treating the key I want to pull as a string object, but I do not know how to get around this and correctly parse this into a csv.

Related

How to save empty string as "" and None as nothing (not even quotes) in csv using Python

I am trying to select specific columns from a MySQL table into csv. This table has integers, varchar, NULL and empty strings. I want to differentiate between NULL and empty strings in specific way.
I am using mysql.connector module which gives me list of tuples as result of SQL. A sample row converted into tuple is:
DB Row: '786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, 2022-18-05 00:00:00, NULL, NULL
Tuple: '786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, datetime.datetime(2022, 5, 18, 0, 0), None, None
Now I want to write this entire list of tuples to a csv file and ensure that empty string is written in csv as "" and None is written in csv as nothing, not even quotes.
So the result I want is:
786a1f32e9e4|ABC|Abcdefgh|ABC2Y1Z|1.2.3-4-x202203100026-ABC2Y1Z-envr|ABC|""|active|1|0|2022-05-18 00:00:00||
OR
"786a1f32e9e4"|"ABC"|"Abcdefgh"|"ABC2Y1Z"|"1.2.3-4-x202203100026-ABC2Y1Z-envr"|"ABC"|""|"active"|"1"|"0"|"2022-05-18 00:00:00"||
I tried using csv:
import csv
import datetime
myresult = [('786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, datetime.datetime(2022, 5, 18, 0, 0), None, None)]
with open('so_ex_1.csv', 'w') as f:
write = csv.writer(f)
write.writerows(myresult)
pandas:
import pandas as pd
import datetime
myresult = [('786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, datetime.datetime(2022, 5, 18, 0, 0), None, None)]
df = pd.DataFrame(myresult)
df.to_csv('so_ex_2.csv')
numpy:
import numpy as np
import datetime
myresult = [('786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, datetime.datetime(2022, 5, 18, 0, 0), None, None)]
np.savetxt('so_ex_3.csv', myresult, delimiter = "|", fmt = '% s')
But can't find right parameters to set so I get the desired result.
Closest I could get was using pandas and csv modules with code:
import pandas as pd
import datetime
import csv
myresult = [('786a1f32e9e4', 'ABC', 'Abcdefgh', 'ABC2Y1Z', '1.2.3-4-x202203100026-ABC2Y1Z-envr', 'ABC', '', 'active', 1, 0, datetime.datetime(2022, 5, 18, 0, 0), None, None)]
df = pd.DataFrame(myresult)
df.to_csv(r'so_ex_4.csv', index=False, sep='|', quoting=csv.QUOTE_ALL, na_rep='None') # Option 5
But even here I am unable to find a value that I can give to na_rep option so it doesn't print anything for None. So I end up getting following in csv:
"786a1f32e9e4"|"ABC"|"Abcdefgh"|"ABC2Y1Z"|"1.2.3-4-x202203100026-ABC2Y1Z-envr"|"ABC"|""|"active"|"1"|"0"|"2022-05-18"|"None"|"None"
Is there any way to get result in format I want using Python without having to do row level operations? The reason I want to avoid using row level operations is that sql can pull large amount of data so row level operations can be time consuming.

How to format query results as CSV?

My goal: Automate the operation of executing a query and output the results into a csv.
I have been successful in obtaining the query results using Python (this is my first project ever in Python). I am trying to format these results as a csv but am completely lost. It's basically just creating 2 massive rows with all the data not parsed out. The .txt and .csv results are attached (I obtained these by simply calling the query and entering "file name > results.txt" or "file name > results.csv".
txt results: {'data': {'get_result': {'job_id': None, 'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', '__typename': 'get_result_response'}}} {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99', 'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c', 'error': None, 'runtime': 157, 'generated_at': '2022-04-07T20:14:36.693419+00:00', 'columns': ['project_name', 'leaderboard_date', 'volume_30day', 'transactions_30day', 'floor_price', 'median_price', 'unique_holders', 'rank', 'custom_sort_order'], '__typename': 'query_results'}], 'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA', 'floor_price': 0.375, 'leaderboard_date': '2022-04-07', 'median_price': 343.4, 'project_name': 'Terraforms by Mathcastles', 'rank': 1, 'transactions_30day': 2774, 'unique_holders': 2179, 'volume_30day': 744611.6252}, '__typename': 'get_result_template'}, {'data': {'custom_sort_order': 'AB', 'floor_price': 4.69471, 'leaderboard_date': '2022-04-07', 'median_price': 6.5, 'project_name': 'Meebits', 'rank': 2, 'transactions_30day': 4153, 'unique_holders': 6200, 'volume_30day': 163520.7377371168}, '__typename': 'get_result_template'}, etc. (repeats for 100s of rows)..
Your results text string actually contains two dictionaries separated by a space character.
Here's a formatted version of what's in each of them:
dict1 = {'data': {'get_result': {'job_id': None,
'result_id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'__typename': 'get_result_response'}}}
dict2 = {'data': {'query_results': [{'id': '72a17fd2-e63c-4732-805a-ad6a7b980a99',
'job_id': '05eb2527-2ca0-4dd1-b6da-96fb5aa2e67c',
'error': None,
'runtime': 157,
'generated_at': '2022-04-07T20:14:36.693419+00:00',
'columns': ['project_name',
'leaderboard_date',
'volume_30day',
'transactions_30day',
'floor_price',
'median_price',
'unique_holders',
'rank',
'custom_sort_order'],
'__typename': 'query_results'}],
'get_result_by_result_id': [{'data': {'custom_sort_order': 'AA',
'floor_price': 0.375,
'leaderboard_date': '2022-04-07',
'median_price': 343.4,
'project_name': 'Terraforms by Mathcastles',
'rank': 1,
'transactions_30day': 2774,
'unique_holders': 2179,
'volume_30day': 744611.6252},
'__typename': 'get_result_template'},
{'data': {'custom_sort_order': 'AB',
'floor_price': 4.69471,
'leaderboard_date': '2022-04-07',
'median_price': 6.5,
'project_name': 'Meebits',
'rank': 2,
'transactions_30day': 4153,
'unique_holders': 6200,
'volume_30day': 163520.7377371168},
'__typename': 'get_result_template'},
]}}
(BTW I formatting them using the pprint module. This is often a good first step when dealing with these kinds of problems — so you know what you're dealing with.)
Ignoring the first one completely and all but the repetitive data in the second — which is what I assume is all you really want — you could create a CSV file from the nested dictionary values in the dict2['data']['get_result_by_result_id'] list. Here's how that could be done using the csv.DictWriter class:
import csv
from pprint import pprint # If needed.
output_filepath = 'query_results.csv'
# Determine CSV fieldnames based on keys of first dictionary.
fieldnames = dict2['data']['get_result_by_result_id'][0]['data'].keys()
with open(output_filepath, 'w', newline='') as outp:
writer = csv.DictWriter(outp, delimiter=',', fieldnames=fieldnames)
writer.writeheader() # Optional.
for result in dict2['data']['get_result_by_result_id']:
# pprint(result['data'], sort_dicts=False)
writer.writerow(result['data'])
print('fini')
Using the test data, here's the contents of the 'query_results.csv' file it created:
custom_sort_order,floor_price,leaderboard_date,median_price,project_name,rank,transactions_30day,unique_holders,volume_30day
AA,0.375,2022-04-07,343.4,Terraforms by Mathcastles,1,2774,2179,744611.6252
AB,4.69471,2022-04-07,6.5,Meebits,2,4153,6200,163520.7377371168
It appears you have the data in a python dictionary. The google sheet says access denied so I can't see the whole data.
But essentially you want to convert the dictionary data to a csv file.
At the bare bones you can use code like this to get where you need to. For your example you'll need to drill down to where the rows actually are.
import csv
new_path = open("mytest.csv", "w")
file_dictionary = {"oliva":199,"james":145,"potter":187}
z = csv.writer(new_path)
for new_k, new_v in file_dictionary.items():
z.writerow([new_k, new_v])
new_path.close()
This guide should help you out.
https://pythonguides.com/python-dictionary-to-csv/
if I understand your question right, you should construct a dataframe format with your results and then save the dataframe in .csv format. Pandas library is usefull and easy to use.

pandas to_json() redundant backslashes

I have a '.csv' file containing data about movies and I'm trying to reformat it as a JSON file to use it in MongoDB. So I loaded that csv file to a pandas DataFrame and then used to_json method to write it back.
here is how one row in DataFrame looks like:
In [43]: result.iloc[0]
Out[43]:
title Avatar
release_date 2009
cast [{"cast_id": 242, "character": "Jake Sully", "...
crew [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object
but when pandas writes it back, it becomes like this:
{ "title":"Avatar",
"release_date":"2009",
"cast":"[{\"cast_id\": 242, \"character\": \"Jake Sully\", \"credit_id\": \"5602a8a7c3a3685532001c9a\", \"gender\": 2,...]",
"crew":"[{\"credit_id\": \"52fe48009251416c750aca23\", \"department\": \"Editing\", \"gender\": 0, \"id\": 1721,...]"
}
As you can see, 'cast' ans 'crew' are lists and they have tons of redundant backslashes. These backslashes appear in MongoDB collections and make it impossible to extract data from these two fields.
How can I solve this problem other than replacing \" with "?
P.S.1: this is how I save the DataFrame as JSON:
result.to_json('result.json', orient='records', lines=True)
UPDATE 1:
Apparently pandas is doing just fine and the problem is caused by the original csv files.
here is how they look like:
movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3},...]"
I tried to replace "" with " (and I really wanted to avoid this hack):
sed -i 's/\"\"/\"/g'
And of course it caused problems in some lines of data when reading it as csv again:
ParserError: Error tokenizing data. C error: Expected 1501 fields in line 4, saw 1513
So we can conclude it's not safe to do such blind replacement. Any idea?
P.S.2: I'm using kaggle's 5000 movie dataset: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
I had the same issue : the solution is in 3 steps
1- Data-frame form csv or in my case from xlsx:
excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)
2- convert to json (if you have date in your data)
json_str = excel_df.to_json(orient='records' ,date_format='iso')
3-The most important thing : json.loads **** this is it !
parsed = json.loads(json_str)
4- (facultative) you can write or send the json file :
for example : write locally
with open(out, 'w') as json_file:
json_file.write(json.dumps({"data": parsed}, indent=4 ))
more info :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
Pandas is escaping the " character because it thinks the values in the json columns are text. To get the desired behaviour, simply parse the values in the json column as json.
let the file data.csv have the following content (with quotes escaped).
# data.csv
movie_id,title,cast
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}]"
read this into a dataframe, then apply the json.loads function & write out to a file as json.
df = pd.read_csv('data.csv')
df.cast = df.cast.apply(json.loads)
df.to_json('data.json', orient='records', lines=True)
The output is a properly formatted json (extra newlines added by me)
# data.json
{"movie_id":19995,
"title":"Avatar",
"cast":[{"cast_id":242,"character":"Jake Sully","credit_id":"5602a8a7c3a3685532001c9a","gender":2,"id":65731,"name":"Sam Worthington","order":0},
{"cast_id":3,"character":"Neytiri","credit_id":"52fe48009251416c750ac9cb","gender":1,"id":8691,"name":"Zoe Saldana","order":1},
{"cast_id":25,"character":"Dr. Grace Augustine","credit_id":"52fe48009251416c750aca39","gender":1,"id":10205,"name":"Sigourney Weaver","order":2},
{"cast_id":4,"character":"Col. Quaritch","credit_id":"52fe48009251416c750ac9cf","gender":2,"id":32747,"name":"Stephen Lang","order":3}]
}

How to convert SQLAlchemy data into JSON format?

I am working on SQLAlchemy and want to fetch the data from database and convert the same into JSON format.
I have below code :
db_string = "postgres://user:pwd#10.**.**.***:####/demo_db"
Base = declarative_base()
db = create_engine(db_string)
record = db.execute("SELECT name, columndata, gridname, ownerid, issystem, ispublic, isactive FROM col.layout WHERE (ispublic=1 AND isactive=1) OR (isactive=1 AND ispublic=1 AND ownerid=ownerid);")
for row in record:
result.append(row)
print(result)
Data is coming in this format:
[('layout-1', {'theme': 'blue', 'sorting': 'price_down', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RealTimeGrid', 1, 0, 1, 1), ('layout-2', {'theme': 'orange', 'sorting': 'price_up', 'filtering': ['FX Rate', 'Start Price']}, 'RealBalancing Grid', 2, 0, 1, 1), ('layout-3', {'theme': 'red', 'sorting': 'mv_price', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RT', 3, 0, 1, 1)]
But I am facing a lot of issues to convert the above result into JSON Format. Please suggest.
Your data is basically a list of tuples.
like first tuple is like
('layout-3',
{'filtering': ['Sub Strategye', 'PM Strategy'],
'sorting': 'mv_price',
'theme': 'red'},
'RT',
3,
0,
1,
1)
if you want to convert whole data as it is to json, you can use json module dumps function
import json
jsn_data = json.dumps(data)
Your list of tuple is converted to json
[["layout-1", {"theme": "blue", "sorting": "price_down", "filtering": ["Sub Strategye", "PM Strategy"]}, "RealTimeGrid", 1, 0, 1, 1], ["layout-2", {"theme": "orange", "sorting": "price_up", "filtering": ["FX Rate", "Start Price"]}, "RealBalancing Grid", 2, 0, 1, 1], ["layout-3", {"theme": "red", "sorting": "mv_price", "filtering": ["Sub Strategye", "PM Strategy"]}, "RT", 3, 0, 1, 1]]
but If you need json formate as key value pair , first need to convert the result in python dictionary then use json.dumps(dictionary_Var)
What you want to accomplish is called "serialization".
You can follow Sudhanshu Patel's answer if you just want to dump json into response.
However if you intend to produce a more sophisticated application, consider using a serialization library. You'll be able to input data from request into db, check if input data is in the right format, and send response in a standarised format.
Check these libraries:
Marshmallow
Python's own Pickle

Getting total word count from a string series in a Pandas Data Frame

I have a 2-column Pandas data frame composed of tweets: the second column is the tweets themselves. I want to get a word count of all the tweets together.
Data Frame looks like so:
RT #PaulHaleAndMom: Four Hours After #Piedmont...
RT #NatPoliceAssoc: Body camera video shows At...
RT #XLNB: When Spanish Drake and Jamaican Drak...
😫😫 I almost cried this morning. My babies are ...
#SebastianDanzig Hey Bassy are tickets and VIP...
The following is giving me the count by row.
wc_DF = tweets_DF['text'].apply(lambda x: Counter(x.lower().split()))
i.e.
{'rt': 1, '#paulhaleandmom:': 1, 'four': 1, 'h...
What would be a good vectorized implementation in Pandas be for this?
Another solution if you want to stay in pandas assuming your string Series is referenced as tweets_DF['text']:
words = tweets_DF['text'].str.split()
word_counts = pd.value_counts(words.apply(pd.Series).stack())
words will be a Series of lists and you can convert to a DataFrame by running apply over the Series with the Series constructor. After that you can convert back into a (multi-indexed) Series containing each word as its value using stack(). Finally you can use value_counts(..) to count the observations returning a Series indexed on the word and counts as the values.
Use sum and Counter
c = [ "RT #PaulHaleAndMom: Four Hours After #Piedmont...", "RT #NatPoliceAssoc: Body camera video shows At...","RT #XLNB: When Spanish Drake and Jamaican Drak..."]
from collections import Counter
Counter(pd.Series(c).str.split().sum())
Counter({'RT': 3,
'#PaulHaleAndMom:': 1,
'Four': 1,
'Hours': 1,
'After': 1,
'#Piedmont...': 1,
'#NatPoliceAssoc:': 1,
'Body': 1,
'camera': 1,
'video': 1,
'shows': 1,
'At...': 1,
'#XLNB:': 1,
'When': 1,
'Spanish': 1,
'Drake': 1,
'and': 1,
'Jamaican': 1,
'Drak...': 1})

Categories