get data from multiple JSONs to database

get data from multiple JSONs to database - python

im new in python and i'm working on a project that i need to retrieve data from different JSONs and store it in a database (im using Postgresql). The JSONs url are all from the same website: https://www.ine.pt/xportal/xmain?xpid=INE&xpgid=ine_main
But they have different codes when it comes to select for example a year or a location.
This is an example of one JSON: https://www.ine.pt/ine/json_indicador/pindica.jsp?op=2&varcd=0010042&Dim1=S3A202006&Dim2=1701106&Dim3=T&lang=PT
every "Dim" in this url can have different codes. I need a way in python to get for instance 18 different cities to get all the data i need without doing it one by one.
This is an example of the JSON data:
[ {
"IndicadorCod" : "0010042",
"IndicadorDsg" : "Valor mediano de avaliação bancária (€/ m²) por Localização geográfica (Município - 2013) e Tipo de construção; Mensal - INE, Inquérito à avaliação bancária na habitação",
"MetaInfUrl" : "https://www.ine.pt/bddXplorer/htdocs/minfo.jsp?var_cd=0010042&lingua=PT",
"DataExtracao" : "2020-06-29T15:55:51.640+01:00",
"DataUltimoAtualizacao" : "2020-06-29",
"UltimoPref" : "Maio de 2020",
"Dados" : {
"202005" : [ {
"geocod" : "1701106",
"geodsg" : "Lisboa",
"dim_3" : "T",
"dim_3_t" : "Total",
"valor" : "3084"
} ]
}
} ]
Besides that i have another question. There is a field in this JSON which is the year and the month of the data. That field is the one below "Dados" and in this example it is "202005". How can i get this field since this is a label instead of a "value"? Basically i want to store the year, location and the field "valor" on a database.
Thank you for all the help!

For parsing JSON data in Python, you can use json library. What it does is, basically imports all JSON data to a Python dictionary(which is a data type similar to hash maps). After that you have to write parse functions for each JSON type.
For Example for you JSON:
import json
data = json.dump("JSON_file_path") # If you have a JSON file
data = json.dumps(json_data) # If you have JSON data as String
id = data["IndicadorCod"] # How you get data from dictionary using keys

Related

Unable to access dictionary values stored in pandas dataframe

I have read in a json file into my pandas dataframe which now looks like this:
document_nr doc_type doc_details.Summary.ID doc_details.Summary.date ....
209 202207220341 A None 07/22/2022
210 202207220217 B None 07/27/2022
211 202207220327 C None 07/29/2022
....
My issue is that I cannot access column values that are originally nested dictionaries. Example I can do print(df['document_nr']) without any problems but I cannot do print(df.doc_details.Summary.ID) as it gives me the error:
AttributeError: 'DataFrame' object has no attribute 'doc_details'
Likewise for print(df.doc_details.Summary.date).
I have also tried below code but got the same error.
df['summary'] = df['doc_details'].str.get("Summary")
I have no idea why it's giving me this issue. My original sample Json file looks like this:
[
{
"document_nr": "202207220914",
"doc_type": "A",
"doc_details": {
"Summary": {
"ID": null,
"date": "07/22/2022",
.....}
}
}
]
Please advise.

as whole doc_details.Summary.ID is actual name of your column, you should access this data like this:
print(df["doc_details.Summary.ID"])
When you are using dot notation df.doc_details.Summary.ID it's at first looking for property doc_details in df object.

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4

Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

Turning JSON into dataframe with pandas

I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.
I've run through many of the arguments in the pandas.read_json documentation, but haven't been able to identify a solution.
import pandas
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
pandas.read_json(json_file)
I'm trying to get a data frame but keep running into various error messages depending on the arguments I specify in read.json after I specify my file.

Because the JSON is not directly convertible to DataFrame. read_json works only with a few formats defined by the orient parameter. Your JSON doesn't follow any of the allowed formats so you need to manipulate the JSON before converting it to a data frame.
Let's take a high level look at your JSON:
{
"displayFieldName": ...,
"fieldAliases": {...},
"fields": {...},
"features": [...]
}
I'm gonna fathom a guess and assume the features node is what you want. Let's div deeper into features:
"features": [
{
"attributes": {
"OBJECTID": 1,
"SHELTER_ID": 223259,
...
}
},
{
"attributes": {
"OBJECTID": 2,
"SHELTER_ID": 223331,
...
}
},
...
]
features contains a list of objects, each having an attributes node. The data contained in the attributes node is what you actually want.
Here's the code
import pandas as pd
import json
from urllib.request import urlopen
json_file = "https://gis.fema.gov/arcgis/rest/services/NSS/OpenShelters/MapServer/0/query?where=1%3D1&outFields=*&returnGeometry=false&outSR=4326&f=json"
data = urlopen(json_file).read()
raw_json = json.loads(data)
formatted_json = [feature['attributes'] for feature in raw_json['features']]
formatted_json is now a list of dictionaries containing the data we are after. It is no longer JSON. To create the data frame:
df = pd.DataFrame(formatted_json)

how to get data by range time with timestamp format on mongodb?

how to sort data based on time timestamp format on mongodb data?
I have data with the format as below:
{
"_id" : ObjectId("5996562c31f238391609f526"),
"created_at" : 1502683719,
"uname" : "username_here",
"source" : "sourcer"
}
...
...
I want to find the data with "created_at" filter, I tried with a command like this:
db.getCollection('data').find({
'created_at':{
'$gte':1502683719,
'$lt':1494616578
}
)
the result, all data that has a value of more than or less than the data entered all out.
the format of "created_at" is integer.

Assuming your collection is users:
db.users.find().sort({created_at: -1})
Check out the documentation for more detailed options https://docs.mongodb.com/manual/reference/operator/meta/orderby/

Format JSON objects in python

I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.

You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

get data from multiple JSONs to database - python

Related

Unable to access dictionary values stored in pandas dataframe

How to access nested attribute without passing parent attribute in pyspark json

Turning JSON into dataframe with pandas

how to get data by range time with timestamp format on mongodb?

Format JSON objects in python

Categories

Resources