Changing complex JSON files to .csv - python

I've downloaded a JSON file with lots of data about football players and I want to get at the data in a .csv. I'm a newb at most of this!
You can find the raw file here: https://raw.githubusercontent.com/llimllib/fantasypl_stats/8ba3e796fc3e73c43921da44d4344c08ce1d7031/data/players.1440000356.json
In the past I've used this code to export some of the data into a .csv using some python code (I think!) in command prompt:
import csv
import json
json_data = open("file.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
Sadly, when I do this now in command prompt, i get the following error:
Traceback (most recent call last): <br/>
File"fix_hist.py", line 12 (module) <br/>
fh = data[i]["fixture_history"] <br/>
TypeError: list indices must be integers, not str
Can this be fixed or is there another way I can grab some of the data and convert it to .csv? Specifically the 'Fixture History'? and then 'First'Name', 'type_name' etc.

I'd advise using pandas.
Pandas has a function for parsing JSON files. pd.read_json()
docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
This will read the JSON file directly into a dataframe

there's an improper tab on the for loop on line 11. if you update your code to look like the following, it should run without errors
import csv
import json
json_data = open("players.json")
data = json.load(json_data)
f = csv.writer(open("fix_hists.csv","wb+"))
arr = []
for i in data:
fh = data[i]["fixture_history"]
array = fh["all"]
for j in array:
try:
j.insert(0,str(data[i]["first_name"]))
except:
j.insert(0,'error')
try:
j.insert(1,data[i]["web_name"])
except:
j.insert(1,'error')
try:
f.writerow(j)
except:
f.writerow(['error','error'])
json_data.close()
make sure that you've named the JSON file players.json to match the name on line 4. also, make sure that the JSON file and this python file are in the same directory. you can run the python file in a development environment like PyCharm, or you can cd to the directory in a terminal/command prompt window and run it with python fileName.py. it will create a csv file in that directory called fix_hists.csv

Related

Pyarrow/Parquet - Cast all null columns to string during batch processing

There is a problem with my code that I can not solve for a while now.
I'm trying to convert a tar.gz compressed csv file to parquet. The file itself, when uncompressed, is about 700MB large. The processing is done in a memory-restricted system, so I have to process the file in batches.
I figured out how to read the tar.gz as a stream, extract the file I need and use pyarrow's open_csv() to read batches. From here, I want to save the data to a parquet file by writing in batches.
This is where the problem appears. The file itself has lots of columns that don't have any values. But once in a while, there is a single value that appears in line 500.000 or something, so pyarrow does not recognize the dtype properly. Most of the columns are therefore of dtype null. My idea is to modify the schema and cast these columns to string, so any values are valid. Modifying the schema works fine, but when I run the code, I get this error.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 38, in <module>
batch = reader.read_next_batch()
File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'
Line 38 is this one:
batch = reader.read_next_batch()
Does anyone have any idea how to enforce the schema to the batches so
Here is my code.
import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging
srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith("tar.gz"):
srcs.append(os.path.join(root, name))
for source_file_name in srcs:
file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))
# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
# unpack all files to temp_path
file_obj_in.seek(0)
with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
file_obj = tf.extractfile(f"{clean_file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
schema = reader.schema
null_cols = list()
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
null_cols.append(index)
with pq.ParquetWriter(target_file_name, schema) as writer:
while True:
try:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
writer.write_batch(batch)
except StopIteration:
break
Also, I could leave out this part:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
But then the error is like this (shortened), showing that the schema change works at least.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 39, in <module>
writer.write_batch(batch)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
self.write_table(table, row_group_size)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string
Thank you!
So I think I figured it out. Wanted to post it for those who have similar issues.
Also, thanks to all who had a look and helped!
I did a workaround to solve this, by reading the file two times.
In the first run I only read the first batch into stream to get the schema. Then, converted null columns to string and closed the stream (this is important if you use same variable name). After this you read the file again, but now passing the modified schema as a ReadOption to the reader. Thanks to #0x26res whose comment gave me the idea.
# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
read_options=csv.ReadOptions(block_size=16*1024*1024),
convert_options=csv.ConvertOptions(column_types=schema))

Converting cloud-init logs to json using a script

I am trying to convert the cloud-init logs to json, so that the filebeat can pick it up and send it to the Kibana. I want to do this by using a shell script or python script. Is there any script that converts such logs to json?
My python script is below
import json
import subprocess
filename = "/home/umesh/Downloads/scripts/cloud-init.log"
def convert_to_json_log(line):
""" convert each line to json format """
log = {}
log['msg'] = line
log['logger-name'] = 'cloud-init'
log['ServiceName'] = 'Contentprocessing'
return json.dumps(log)
def log_as_json(filename):
f = subprocess.Popen(['cat','-F',filename],
stdout=subprocess.PIPE,stderr=subprocess.PIPE)
while True:
line = f.stdout.readline()
log = convert_to_json_log(line)
print log
with open("/home/umesh/Downloads/outputs/cloud-init-json.log", 'a') as new:
new.write(log + '\n')
log_as_json(filename)
The scripts returns a file with json format, but the msg filed returns empty string. I want to convert each line of the log as message string.
Firstly, try reading the raw log file using python inbuilt functions rather than running os commands using subprocess, because:
It will be more portable (work across OS'es)
Faster and less prone to errors
Re-writing your log_as_json function as follows worked for me:
inputfile = "cloud-init.log"
outputfile = "cloud-init-json.log"
def log_as_json(filename):
# Open cloud-init log file for reading
with open(inputfile, 'r') as log:
# Open the output file to append json entries
with open(outputfile, 'a') as jsonlog:
# Read line by line
for line in log.readlines():
# Convert to json and write to file
jsonlog.write(convert_to_json(line)+"\n")
After taking some time on preparing the customised script finally i made the below script. It might be helpful to many others.
import json
def convert_to_json_log(line):
""" convert each line to json format """
log = {}
log['msg'] = json.dumps(line)
log['logger-name'] = 'cloud-init'
log['serviceName'] = 'content-processing'
return json.dumps(log)
# Open the file with read only permit
f = open('/var/log/cloud-init.log', "r")
# use readlines to read all lines in the file
# The variable "lines" is a list containing all lines in the file
lines = f.readlines()
# close the file after reading the lines.
f.close()
jsonData = ''
for line in lines:
jsonLine = convert_to_json_log(line)
jsonData = jsonData + "\n" + jsonLine;
with open("/var/log/cloud-init/cloud-init-json.log", 'w') as new:
new.write(jsonData)

Getting error while creating multiple file in python

I'm creating two files using python script, first file is JSON and second one is HTML file, my below is creating json file but while creating HTML file I'm getting error. Could someone help me to resolve the issue? I'm new to Python script so it would be really appreciated if you could suggest some solution
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import json
JsonResponse = '[{"status": "active", "due_date": null, "group": "later", "task_id": 73286}]'
def create(JsonResponse):
print JsonResponse
print 'creating new file'
try:
jsonFile = 'testFile.json'
file = open(jsonFile, 'w')
file.write(JsonResponse)
file.close()
with open('testFile.json') as json_data:
infoFromJson = json.load(json_data)
print infoFromJson
htmlReportFile = 'Report.html'
htmlfile = open(htmlReportFile, 'w')
htmlfile.write(infoFromJson)
htmlfile.close()
except:
print 'error occured'
sys.exit(0)
create(JsonResponse)
I used below online Python editor to execute my code:
https://www.tutorialspoint.com/execute_python_online.php
infoFromJson = json.load(json_data)
Here, json.load() will expect a valid json data as json_data. But the json_data you provided are not valid json, it's a simple string(Hello World!). So, you are getting the error.
ValueError: No JSON object could be decoded
Update:
In your code you should get the error:
TypeError: expected a character buffer object
That's because, the content you are writing to the file needs to be string, but in place of that, you have a list of dictionary.
Two way to solve this. Replace the line:
htmlfile.write(infoFromJson)
To either this:
htmlfile.write(str(infoFromJson))
To make infoFromJson a string.
Or use the dump utility of json module:
json.dump(infoFromJson, json_data)
If you delete Try...except statement, you will see errors below:
Traceback (most recent call last):
File "/Volumes/Ithink/wechatProjects/django_wx_joyme/app/test.py", line 26, in <module>
create(JsonResponse)
File "/Volumes/Ithink/wechatProjects/django_wx_joyme/app/test.py", line 22, in create
htmlfile.write(infoFromJson)
TypeError: expected a string or other character buffer object
Errors occurred because htmlfile.write need string type ,but infoFromJson is a list .
So,change htmlfile.write(infoFromJson) to htmlfile.write(str(infoFromJson)) will avoid errors!

json2html python lib is not working

I'm trying to create new json file with my custom json input and converting JSON to HTML format and saving into .html file. But I'm getting error while generating JSON and HTML file. Please find my below code - Not sure what I'm doing wrong here:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from json2html import *
import sys
import json
JsonResponse = {
"name": "json2html",
"description": "Converts JSON to HTML tabular representation"
}
def create(JsonResponse):
#print JsonResponse
print 'creating new file'
try:
jsonFile = 'testFile.json'
file = open(jsonFile, 'w')
file.write(JsonResponse)
file.close()
with open('testFile.json') as json_data:
infoFromJson = json.load(json_data)
scanOutput = json2html.convert(json=infoFromJson)
print scanOutput
htmlReportFile = 'Report.html'
htmlfile = open(htmlReportFile, 'w')
htmlfile.write(str(scanOutput))
htmlfile.close()
except:
print 'error occured'
sys.exit(0)
create(JsonResponse)
Can someone please help me resolve this issue.
Thanks!
First, get rid of your try / except. Using except without a type expression is almost always a bad idea. In this particular case, it prevented you from knowing what was actually wrong.
After we remove the bare except:, we get this useful error message:
Traceback (most recent call last):
File "x.py", line 31, in <module>
create(JsonResponse)
File "x.py", line 18, in create
file.write(JsonResponse)
TypeError: expected a character buffer object
Sure enough, JsonResponse isn't a character string (str), but is a dictionary. This is easy enough to fix:
file.write(json.dumps(JsonResponse))
Here is a create() subroutine with some other fixes I recommend. Note that writing the dumping the JSON followed immediately by loading the JSON is usually silly. I left it in assuming that your actual program does something slightly different.
def create(JsonResponse):
jsonFile = 'testFile.json'
with open(jsonFile, 'w') as json_data:
json.dump(JsonResponse, json_data)
with open('testFile.json') as json_data:
infoFromJson = json.load(json_data)
scanOutput = json2html.convert(json=infoFromJson)
htmlReportFile = 'Report.html'
with open(htmlReportFile, 'w') as htmlfile:
htmlfile.write(str(scanOutput))
The error is while writing to the JSON file. Instead of file.write(JsonResponse) you should use json.dump(JsonResponse,file). It will work.

Using Python to Write JSON data in CSV file, and repeat multiple times

Trying to accomplish the following
Every few seconds, have python pull unicode JSON data (THIS IS WORKING FINE)
Save one item of that json data by, opening that CSV file on the desktop, clearing it, writing in it, and closing it (THIS IS THE ISSUE - CSV FILE STOPS UPDATING)
Matlab will read the file process it (WORKS FINE)
Go back to step 1
Way I am currently attempting it
MATLAB CODE:
system('python /weather.py');
load_weather_matlab();
if final_weather > 30
disp ('sunny')
else
disp ('not sunny')
PYTHON CODE:
r = requests.post(api_url + 'days', json=day, auth=auth)
print r.json()
r_output = r.json()
weather = r_output['weatherA']
print weather
with open(CSV_FILE, "w+") as fp:
fp.close()
with open(CSV_FILE, "a") as fp:
fp.write("%s" % (weather))
fp.close()
MATLAB FUNCTION load_weather_matlab:
function [success] = load_weather_matlab();
global final_weather
load_weather(); % Import the CSV File
weather_temperature = transpose(weather_temperature);
final_weather = weather_temperature (1);
success = 1;
end
MATLAB FUCTION load_weather:
filename = '/Users/m/Desktop/CSV_FILE';
delimiter = ',';
formatSpec = '%f[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'EmptyValue' ,NaN, 'ReturnOnError', false);
fclose(fileID);
weather_temperature = dataArray{:, 1};
clearvars filename delimiter formatSpec fileID dataArray ans;
THE ERROR I GET IS THAT
1) The file on the desktop CSV_FILE ... stops updating...
2) Sometimes if the JSON data being pulled by python does not have the 'weather' data
then this is seen in MATLAB
Traceback (most recent call last):
File "/Users/m/Desktop/weather.py", line 106, in <module>
weather = r_output['weatherA']
KeyError: 'weatherA'
BUT OTHER TIMES (BEFORE IT STOPS UPDATING) it works.
This works for a couple of times, but then it stops. I am not sure why ? Sometimes, I get a KeyError when the 'weather' is not in the JSON, but this shouldn't just stop the file from updating correct?
any help appreciated
thanks

Categories