Converting xml to json for Mongo db

Converting xml to json for Mongo db - python

I am currently trying to convert an xml document with approx 2k records to json to upload to Mongo DB.
I have written a python script for the conversion but when I upload it to Mongo db the collection is reading this as one document with 2k sub arrays (objects) but I am trying to get 2k documents instead. My thoughts are it could be the python code? Can anyone help.
# Program to convert an xml
# file to json file
# import json module and xmltodict
# module provided by python
import json
import xmltodict
# open the input xml file and read
# data in form of python dictionary
# using xmltodict module
with open("test.xml") as xml_file:
data_dict = xmltodict.parse(xml_file.read())
# xml_file.close()
# generate the object using json.dumps()
# corresponding to json data
json_data = json.dumps(data_dict)
# Write the json data to output
# json file
with open("data.json", "w") as json_file:
json_file.write(json_data)
# json_file.close()

I am not sure why you would expect an XML-to-JSON converter to automatically split the XML at "record" boundaries. After all, XML doesn't have a built-in concept of "records" - that's something in the semantics of your vocabulary, not in the syntax of XML.
The easiest way to split an XML file into multiple files is with a simple XSLT 2.0+ stylesheet. If you use XSLT 3.0 then you can invoke the JSON conversion at the same time.

Here is my solution.
import xmltodict
import json
import pprint
# Open xml file
with open(r"test.xml", "rb") as xml_file:
# data_dict = xmltodict.parse(xml_file.read())
dict_data = xmltodict.parse(xml_file)
output_data = dict_data["root"]["course_listing"]
json_data = json.dumps(output_data, indent=2)
print(json_data)
with open("datanew.json", "w") as json_file:
json_file.write(json_data)

Related

How to run a json file in python

The goal is to open a json file or websites so that I can view earthquake data. I create a json function that use dictionary and a list but within the terminal an error appears as a invalid argument. What is the best way to open a json file using python?
import requests
`def earthquake_daily_summary():
req = requests.get("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson")
data = req.json() # The .json() function will convert the json data from the server to a dictionary
# Open json file
f = open('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson')
# returns Json oject as a dictionary
data = json.load(f)
# Iterating through the json
# list
for i in data['emp_details']:
print(i)
f.close()
print("\n=========== PROBLEM 5 TESTS ===========")
earthquake_daily_summary()`

You can immediately convert the response to json and read the data you need.
I didn't find the 'emp_details' key, so I replaced it with 'features'.
import requests
def earthquake_daily_summary():
data = requests.get("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson").json()
for row in data['features']:
print(row)
print("\n=========== PROBLEM 5 TESTS ===========")
earthquake_daily_summary()

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia

With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.

Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

how to use avro with python to serialize dictionary and write as bytes to bytesio to read and deserialize with schema correctly?

I want to use avro to serialize dictionary to produce bytestring, write it to io.BytesIO, read it and deserialize.
Q1: shall I load the schema from avro file as avro.schema.RecordSchema or can i load it from json file as json with json.load?
Q2: when BytesIO used shall I do seek(0) ?
Q3: I use BytesIO just so pass serialized bytestring to read it and deserialize. I want to do this in memory hence why I do not write / read file. is it ok?
import io
import json
import avro.io
import avro.schema
msg = {"name": "foo", "favorite_number": 1, "favorite_color": "pink"}
with open("schema", "rb") as f:
SCHEMA = avro.schema.parse(f.read())
writer = avro.io.DatumWriter(SCHEMA)
bytes_writer = io.BytesIO()
encoder = avro.io.BinaryEncoder(bytes_writer)
writer.write(msg, encoder)
b = bytes_writer.getvalue()
reader = avro.io.DatumReader(SCHEMA)
bytes_reader = io.BytesIO(b)
decoder = avro.io.BinaryDecoder(bytes_reader)
deserialized_json = reader.read(decoder)
EDIT:
Documentation contains the example with serde and file write/read.
https://avro.apache.org/docs/1.8.2/gettingstartedpython.pdf
They use DataFileWriter and it does
verify that the items we write are valid items and write the appropriate fields.
according to documentation. if I don't use it and use DatumWriter only to write to BytesIO am I doing all ok? The documentation says I can use DatumWriter separately.

How to parse JSON from URL and download CSV files?

I'm given a URL which contains some JSON text. In the text there are URL's for csv files. I'm trying to parse the JSON from the URL and download the CSV files. I am able to print out the JSON from the URL but do not know how to grab the CSV files from within.
import urllib, json
import urllib.request
with urllib.request.urlopen("http://staging.test.com/api/reports/68.json?auth_token=test") as url:
s = url.read()
print(s)
The above will print the JSON from the URL ( see below printout), there are URL's for csv files that I then need to download using python.
{"id":68,"name":"Carrier Rates","state":"complete","user_id":166,"data_set_id":7,"bounding_date":{"id":101,"start_date":"2019-01-01T00:00:00.000-05:00","end_date":"2999-12-31T00:00:00.000-05:00","bounding_field_id":322,"related_id":68,"related_type":"Reports::Report"},"results":[{"id":68,"created_at":"2019-07-26T15:29:40.872-04:00","version_name":"07/26/2019 03:29PM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.1dec2e6d-0c36-44b7-ab26-fd43fe710daf.csv"},{"id":67,"created_at":"2019-07-26T15:29:07.112-04:00","version_name":"07/26/2019 03:29PM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.3b02195e-c0a2-4abe-88f7-27d20ac76e07.csv"},{"id":35,"created_at":"2019-06-26T11:01:26.900-04:00","version_name":"06/26/2019 11:01AM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.a488c58d-5e04-4c28-a429-7167e9e8edaa.csv"},{"id":34,"created_at":"2019-06-26T10:57:51.396-04:00","version_name":"06/26/2019 10:57AM","content":"https://cloudtestlogistics-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.bf73db19-5604-4a1d-bc31-da6cf25742cc.csv"}]}

The following code can help you.
import json
import urllib.request
with urllib.request.urlopen("http://staging.test.com/api/reports/68.json?auth_token=test") as url:
s = url.read()
loadJson = json.load(s)
results = loadJson["results"]
csvLinks = []
for object in results:
csvlinks.append(object["content"])
Now you have a list of links to CSV files. Download them using urllib.

import json
from collections import namedtuple
#This is your "s" -- data = s
data = '{"name": "John Smith", "hometown": {"name": "New York", "id": 123}}'
# Parse JSON into an object with attributes corresponding to dict keys.
x = json.loads(data, object_hook=lambda d: namedtuple('X', d.keys())(*d.values()))
print x.name, x.hometown.name, x.hometown.id
This answer from: How to convert JSON data into a Python object loads Json into an object. Now access it via the key it was passed with in json.
print x.content
Of course you'll have to wiggle the code around to get it to work exactly how you want. I'm not really a python expert and have nothing to test with. But the idea is to just load it into a Tuple object and access it via the key.

import urllib, json
import urllib.request
with urllib.request.urlopen("http://staging.test.com/api/reports/68.json?auth_token=test") as url:
s = url.read()
# assuming here you got that json content
s='{"id":68,"name":"Carrier Rates","state":"complete","user_id":166,"data_set_id":7,"bounding_date":{"id":101,"start_date":"2019-01-01T00:00:00.000-05:00","end_date":"2999-12-31T00:00:00.000-05:00","bounding_field_id":322,"related_id":68,"related_type":"Reports::Report"},"results":[{"id":68,"created_at":"2019-07-26T15:29:40.872-04:00","version_name":"07/26/2019 03:29PM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.1dec2e6d-0c36-44b7-ab26-fd43fe710daf.csv"},{"id":67,"created_at":"2019-07-26T15:29:07.112-04:00","version_name":"07/26/2019 03:29PM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.3b02195e-c0a2-4abe-88f7-27d20ac76e07.csv"},{"id":35,"created_at":"2019-06-26T11:01:26.900-04:00","version_name":"06/26/2019 11:01AM","content":"https://test-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.a488c58d-5e04-4c28-a429-7167e9e8edaa.csv"},{"id":34,"created_at":"2019-06-26T10:57:51.396-04:00","version_name":"06/26/2019 10:57AM","content":"https://cloudtestlogistics-staging.s3.amazonaws.com/reports/manufacturer/carrier-test.bf73db19-5604-4a1d-bc31-da6cf25742cc.csv"}]}'
d=json.loads(s)
for f in d['results']:
# manage download here
csv_url= f['content']

Download JSON data and convert it to CSV using Python

I'm currently using Yahoo Pipes which provides me with a JSON file from an URL.
I would like to be able to fetch it and convert it into a CSV file, and I have no idea where to begin (I'm a complete beginner in Python).
How can I fetch the JSON data from the URL?
How can I transform it to CSV?
Thank you

import urllib2
import json
import csv
def getRows(data):
# ?? this totally depends on what's in your data
return []
url = "http://www.yahoo.com/something"
data = urllib2.urlopen(url).read()
data = json.loads(data)
fname = "mydata.csv"
with open(fname,'wb') as outf:
outcsv = csv.writer(outf)
outcsv.writerows(getRows(data))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting xml to json for Mongo db - python

Related

How to run a json file in python

Read JSON file correctly

how to use avro with python to serialize dictionary and write as bytes to bytesio to read and deserialize with schema correctly?

How to parse JSON from URL and download CSV files?

Download JSON data and convert it to CSV using Python

Categories

Resources