We are using below dataframe to create json file
Input file
import pandas as pd
import numpy as np
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=[np.NaN,np.NaN,"StringType",np.NaN,"BoolType","StringType"]
d1=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])
We have to build below 2 structure using above dataframe in dynamic way
we have fit above data in below format
for schema e.g::
StructType([StructField(Column_name,Datatype,True)])
for Data e.g::
F.struct(F.col(column_name)).alias(json_expected_name)
expected output structure for schema
StructType(
[
StructField("data",
StructType(
[
StructField(
"studentinfo",
StructType(
[
StructField("city",StringType(),True),
StructField("name",StructType(
[
StructField("id",
StructType(
[
StructField("grant",BoolType(),True)
])
)]
)
)
]
)
),
StructField("country",StringType(),True)
])
)
])
2)Expected data fetch
df.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)
We have to use for loop and add these kind of entry in (data.studentinfo.name.id)
data->studentinfo->name->id
Which I have already add in expected output structure
this is the resulting json. How you need to reassemble the json into a new hierarchial json structure that you desire. Action has the hierarchy elements to your tree and data type the type. I think you can assume null data types are numeric. The name datatype is wrong as null. It should be stringtype
import pandas as pd
import numpy as np
import json
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=["StructType","StructTypeType","StringType","NumberType","BoolType","StringType"]
df=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])
json_tree=df.to_json()
{
"data":{
"0":"DA_STinf",
"1":"DA_Stinf_NA",
"2":"DA_Stinf_city",
"3":"DA_Stinf_NA_ID",
"4":"DA_Stinf_NA_ID_GRANT",
"5":"DA_country"
},
"action":{
"0":"data.studentinfo",
"1":"data.studentinfo.name",
"2":"data.studentinfo.city",
"3":"data.studentinfo.name.id",
"4":"data.studentinfo.name.id.grant",
"5":"data.country"
},
"datatype":{
"0":"StructType",
"1":"StructType",
"2":"StringType",
"3":"NumericType",
"4":"BoolType",
"5":"StringType"
}
}
def convert_action_to_hierarchy(data):
data=json.loads(data)
action = data['action']
datatype_list = data['datatype']
result = {}
for i in range(len(action)):
action_list = action[str(i)].split('.')
temp = result
for j in range(len(action_list)):
datatype = datatype_list[str(j)]
result[action_list[j]]=(j,datatype)
return result
print(convert_action_to_hierarchy(json_tree))
output:
{'data': (0, 'StructType'), 'studentinfo': (1, 'StructType'), 'name': (2, 'StringType'), 'city': (2, 'StringType'), 'id': (3, 'NumberType'), 'grant': (4, 'BoolType'), 'country': (1, 'StringType')}
The number is the level in the hierarchy
Related
I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)
Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]
The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]
Im trying to flatten 2 columns from a table loaded into a dataframe as below:
u_group
t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
I want to separate them and get them as:
u_group.link
u_group.value
t_group.link
t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
I used the below code, but wasnt successful.
import ast
from pandas.io.json import json_normalize
df12 = spark.sql("""select u_group,t_group from tbl""")
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')
TypeError: 'Column' object is not callable
Kindly help or suggest if any other code would work better.
need simple example and code for answer
example:
data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}],
[{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])
output(df):
u t
0 {'link': 'A1', 'value': 'B1'} {'link': 'A2', 'value': 'B2'}
1 {'link': 'C1', 'value': 'D1'} {'link': 'C2', 'value': 'D2'}
use following code:
pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i + '_') for i in df.columns], axis=1)
output:
u_link u_value t_link t_value
0 A1 B1 A2 B2
1 C1 D1 C2 D2
Here are my 2 cents,
A simple way to achieve this using PYSPARK.
Create the dataframe as follows:
data = [
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
),
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
)
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])
Then use the from_json() to parse the dictionary and fetch the individual values as follows:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema_column = StructType([
StructField("link",StringType(),True),
StructField("value",StringType(),True),
])
df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
.withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
.withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
.withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
.withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
.withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
.drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')
Print the dataframe
df.show(truncate=False)
Please check the below image for your reference:
I am trying to request data from a third-party API, we got the API link like below:
GET /api/v4/dblines
Supposed we have to insert data from 2020-04-12 07:07:00, and we have original data before that DateTime, the API max/limit records are 1000, how to keep the script running and insert real-time data one by one all the time?
Below is the sample JSON data and my sample code:
from sqlalchemy import create_engine
import pandas as pd
import requests
# getting json using requests
# data = requests.get('`https://api.example.com/api/v4/dblines?a=ABC123&b=1min&startTime=1586646420000&limit=1000`').json()
# data example
data = [
[
1593649440000,
"2.9923453200",
"2.9923453200",
"2.0045299700",
"2.0045299700",
"2.2400009700",
1593649499999,
"2.0010870500",
2,
"2.0300009700",
"2.0001359600",
"0"
],
[
1593649500000,
"2.9923453297",
"2.9923453297",
"2.9923453297",
"2.9923453297",
"25.950000970",
1593649559999,
"2.1176054000",
4,
"25.950000970",
"2.1176054000",
"0"
]
]
# create df using json from API
df = pd.DataFrame(
data,
columns=[
# change aliases to column names here...
'start_time',
'alias1',
'alias2',
'alias3',
'alias4',
'alias5',
'end_time',
'alias6',
'alias7',
'alias8',
'alias9',
'alias10',
]
)
# drop unnecessary columns, duplicates, processing df, blabla etc...
# and init db connection
engine = create_engine('mysql+pymysql://{user}:{pw}#localhost/{db}'
.format(user='db_user',
pw='db_password',
db='db_name'))
# insert df into db using connection
# change if_exists if you need
df.to_sql(con=engine, name='table_name_here', if_exists='replace') # orginal data can't be replaced, how to change here?
I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.
Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')
My data structure is defined approximately as follows:
schema = StructType([
# ... fields skipped
StructField("extra_features",
ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)
])), nullable = False)],
)
Now, I'd like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"} exists in the array column. How do I do this?
Spark has a function array_contains that can be used to check the contents of an ArrayType column, but unfortunately it doesn't seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F
schema = StructType([StructField("extra_features", ArrayType(StructType([
StructField("key", StringType(), False),
StructField("value", StringType(), True)])),
False)])
df = spark.createDataFrame([
Row([{'key': 'a', 'value': '1'}]),
Row([{'key': 'b', 'value': '2'}])], schema)
# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())
df.where(contains_keyval(df.extra_features)).collect()
This results in:
[Row(extra_features=[Row(key=u'a', value=u'1')])]
You can also use the UDF to add another column that indicates whether the key-value pair is present:
df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()
results in:
[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]
Since Spark 2.4.0 you can use the functions exist.
Example with SparkSQL:
SELECT
EXISTS
(
ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
x -> x = named_struct("key": "a", "value": "1")
)
Example with PySpark:
df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')
Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...