how to make json.loads() read a json string with the column names as the first element - python

I am serializing a datatable from a http get and for performance reasons would prefer to serialize it in a Names, Values structure, so that the first element contains the column names, can json.loads deal with this, if not is there another json parser that will?
{
"Names" : ["summaryDate","count"],
"Values" : [["2020-01-15T00:00:00",10],["2020-01-16T00:00:00",12],["2020-01-17T00:00:00",16]]
}
(this reduces the size to 20% of a standard JSON stream with the field names repeated for each 'row')

I did some digging and found ijson.
It lets you iterate over a json file and access its objects.
you can build you data structur like this(i was lazy and used pd):
import ijson
import pandas as pd
f= open("testjson.txt",'r')
f2= open("testjson.txt",'r')
names=[]
values=[]
names = ijson.items(f, 'Names.item')
values = ijson.items(f2, 'Values.item')
pd.DataFrame(values,columns=list(names))

you can load it in a dictionnary with json , and then apply this formula to obtain the dictionnary you want:
dic = {
"Names" : ["summaryDate","count"],
"Values" : [["2020-01-15T00:00:00",10],["2020-01-16T00:00:00",12],["2020-01-17T00:00:00",16]]
}
d=[ {'SummaryDate':summaryDate,'count':count} for summaryDate,count in dic['Values']]

Related

Convert specific columns to list and then create json

I have a spreadsheet like the following:
As you can see, there are multiple "tags" columns like this: "tags_0", "tags_1", "tags_2".
And they can be more.
I'm trying to find all the "tags", and put them inside a list using panda's data frame. And eventually, put them inside an array of "tags" inside a json file.
I thought of using regex, but I can't find a way to apply it.
This is the function I'm using to output the json file. I added the tags array for reference:
def convert_products():
read_exc = pd.read_excel('./data/products.xlsx')
df = pd.DataFrame(read_exc)
all_data = []
for i in range(len(df)):
js = {
"sku": df['sku'][i],
"brand": df['brand'][i],
"tags": [?]
}
all_data.append(js)
json_object = json.dumps(all_data, ensure_ascii=False, indent=2)
with open("./data/products.json", "w", encoding='utf-8') as outfile:
outfile.write(json_object)
How can I achieve this?
Thanks
You can achieve that in a much easier way by doing something like this...
df = pd.read_excel('your_file.xlsx')
tags_columns = [col for col in df.columns if col.startswith("tags_")]
df["tags"] = df[tags_columns].values.tolist()
df[["sku","brand","tags"]].to_json("test.json",orient="records")
You can try other json orientation if you want: ["index","columns","split","records","values","table"]. Check them in pandas documentation
First You can get all the columns as a list
list(df.columns.values)
Now you can search for all columns names which contains tags_ inside this list, once you get all the columns names which is for tags, you can loop through this list and retrieve specific tag value for specific row and put inside a list
And can pass into json object.
For each row in dataframe:
tagList =[]
for tagColumn in tagColumnList:
tagList.append(df[tagColumn][i])
.... Your code for creating json object...
Pass this tagList for tags key in json object
You are probably looking for filter:
out = pd.concat([df[['sku', 'brand']],
df.filter(regex='^tags_').agg(list, axis=1).rename('tags')],
axis=1).to_json(orient='records', indent=2)
print(out)
# Output
[
{
"sku":"ADX112",
"brand":"ADX",
"tags":[
"art",
"frame",
"painting"
]
}
]

Parse JSON string from Pyspark Dataframe

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:
{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}
I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data
cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)
I get a null dataframe each time I run the above code. Please help.
Tried below stuff and it didn't work:
PySpark: Read nested JSON from a String Type Column and create columns
Also tried to write it to a JSON file and read it. It didn't work as well:
reading a nested JSON file in pyspark
The null chars \u0000 affect the parsing of the JSON. You can replace them as well:
df = spark.read.json('path')
df2 = df.withColumn(
'cleansed_value',
F.regexp_replace('value','[\u0000/]','')
).withColumn(
'parsed',
F.from_json('cleansed_value','context string')
)
df2.show(20,0)
+-----------------------+------------------+------+
|value |cleansed_value |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+

Adding a parent in JSON in Python

I have a pandas dataframe with 2 columns as below:
Column1 Column2
122132 123123
213213 231233
231234 232133
I have converted this to json using the below code:
df_json = dataframe.to_json(orient='records')
OUTPUT JSON:
[{"Column1":"122132","Column2":"123123"},{"Column1":"213213","Column2":"231233"},{"Column1":"231234","Column2":"232133"}]
I need to add a parent to it so that the JSON looks like below:
{
"APP_Request":
[{"Column1":"122132","Column2":"123123"},{"Column1":"213213","Column2":"231233"},{"Column1":"231234","Column2":"232133"}]
}
How do I do this?
I tried the below but the JSON is not in the right format.
df_new = dict()
df_new["SFDC_Request"] = df_json
The OUTPUT of the above is below and is not the right format:
{'APP_Request': '[{"Column1":"122132","Column2":"123123"},{"Column1":"213213","Column2":"231233"},{"Column1":"231234","Column2":"232133"}]'}
The above json has single quote and square brackets inside the single quote which is a miss from the what I want to do. Is there a way to create the JSON in the format I need?
The simplest way to create a parent, a creating a new dict:
new_dict = {"APP_Request": df_json}
Because your df_json is a string you need to convert it to real json with json.loads().
So
new_dict = {"APP_Request": json.loads(df_json)}

Write Python3/Pandas dataframe to JSON with orient=records but without the array when there is only one record

I'm writing a very small Pandas dataframe to a JSON file. In fact, the Dataframe has only one row with two columns.
To build the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(dict({'date': '2020-10-05', 'ppm': 411.1}), orient='index').T
print(df)
prints
date ppm
0 2020-10-05 411.1
The desired json output is as follows:
{
"date": "2020-10-05",
"ppm": 411.1
}
but when writing the json with pandas, I can only print it as an array with one element, like so:
[
{
"date":"2020-10-05",
"ppm":411.1
}
]
I've currently hacked my code to convert the Dataframe to a dict, and then use the json module to write the file.
import json
data = df.to_dict(orient='records')
data = data[0] # keep the only element
with open('data.json', 'w') as fp:
json.dump(data, fp, indent=2)
Is there a native way with pandas' .to_json() to keep the only dictionary item if there is only one?
I am currently using .to_json() like this, which incorrectly prints the array with one dictionary item.
df.to_json('data.json', orient='index', indent = 2)
Python 3.8.6
Pandas 1.1.3
If you want to export only one row, use iloc:
print (df.iloc[0].to_dict())
#{'date': '2020-10-05', 'ppm': 411.1}

Converting responses returned from REST API into CSV

I am trying to write response returned from REST API into a csv file. As there are multiple requests, I am calling API on the requests one by one. So, there would be multiple responses. I am not able to achieve desired format.
Desired format :
name,job,id,createdAt
morpheus,leader,727,2018-10-12T12:04:39.234Z
Mark,SSE,925,2018-10-12T12:04:40.200Z
Taylor,SE,247,2018-10-12T12:04:41.115Z
Code :
import requests
url ='https://reqres.in/api/users'
data =[{
"name": "morpheus",
"job": "leader"
},
{"name":"Mark",
"job":"SSE"},
{"name":"Taylor",
"job":"SE"}
]
with open('response.csv','w') as f:
for element in data:
r=requests.post(url,json=element)
response = json.loads(r.text)
for key in response.keys():
#f.write("%s,%s"%(key,response[key]))
Python brings builtin support for csv reading and writing that allows you to define dialects with different separators and escaping logic.
Cell values containing the separator or newline or your escape chars need to be escaped or the resulting csv is broken - the csv module does that for you. You can choose between different formats (excel can be picky when loading csv), or define your own.
https://docs.python.org/3/library/csv.html#csv.writer
Assuming the data you are getting from the server has the exact keys you're looking for, something like this should work:
data = [] # Your data here.
url = 'https://reqres.in/api/users'
desired_columns = ['name', 'job', 'id', 'createdAt']
with open('response.csv', 'w') as f:
# First we need to write the column names to the file
f.write(','.join(desired_columns) + '\n')
for element in data:
r = requests.post(url, json=element)
response = json.loads(r.text)
# Here, I will assume response has 'name', 'job', 'id' and 'createdAt'
# as keys to the dictionary. We will save them to the list 'data_to_write'
# And then write that out the same way we did above.
data_to_write = []
for column in desired_columns:
data_to_write.append(response[column])
f.write(','.join(data_to_write) + '\n')

Categories