Im trying to flatten 2 columns from a table loaded into a dataframe as below:
u_group
t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
I want to separate them and get them as:
u_group.link
u_group.value
t_group.link
t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
I used the below code, but wasnt successful.
import ast
from pandas.io.json import json_normalize
df12 = spark.sql("""select u_group,t_group from tbl""")
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')
TypeError: 'Column' object is not callable
Kindly help or suggest if any other code would work better.
need simple example and code for answer
example:
data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}],
[{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])
output(df):
u t
0 {'link': 'A1', 'value': 'B1'} {'link': 'A2', 'value': 'B2'}
1 {'link': 'C1', 'value': 'D1'} {'link': 'C2', 'value': 'D2'}
use following code:
pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i + '_') for i in df.columns], axis=1)
output:
u_link u_value t_link t_value
0 A1 B1 A2 B2
1 C1 D1 C2 D2
Here are my 2 cents,
A simple way to achieve this using PYSPARK.
Create the dataframe as follows:
data = [
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
),
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
)
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])
Then use the from_json() to parse the dictionary and fetch the individual values as follows:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema_column = StructType([
StructField("link",StringType(),True),
StructField("value",StringType(),True),
])
df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
.withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
.withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
.withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
.withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
.withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
.drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')
Print the dataframe
df.show(truncate=False)
Please check the below image for your reference:
Related
I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN
Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?
You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}
This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
I have two DataFrames, and I want to post these DataFrames as json (to the web service) but first I have to concatenate them as json.
#first df
input_df = pd.DataFrame()
input_df['first'] = ['a', 'b']
input_df['second'] = [1, 2]
#second df
customer_df = pd.DataFrame()
customer_df['first'] = ['c']
customer_df['second'] = [3]
For converting to json, I used following code for each DataFrame;
df.to_json(
path_or_buf='out.json',
orient='records', # other options are (split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’)
date_format='iso',
force_ascii=False,
default_handler=None,
lines=False,
indent=2
)
This code gives me the table like this: For ex, input_df export json
[
{
"first":"a",
"second":1
},
{
"first":"b",
"second":2
}
]
my desired output is like that:
{
"input": [
{
"first": "a",
"second": 1
},
{
"first": "b",
"second": 2
}
],
"customer": [
{
"first": "d",
"second": 3
}
]
}
How can I get this output like this? I couldn't find the way :(
You can concatenate the DataFrames with appropriate key names, then groupby the keys and build dictionaries at each group; finally build a json string from the entire thing:
out = (
pd.concat([input_df, customer_df], keys=['input', 'customer'])
.droplevel(1)
.groupby(level=0).apply(lambda x: x.to_dict('records'))
.to_json()
)
Output:
'{"customer":[{"first":"c","second":3}],"input":[{"first":"a","second":1},{"first":"b","second":2}]}'
or a dict by replacing the last to_json() to to_dict().
I have the data in CSV format, for example:
First row is column number, let's ignore that.
Second row, starting at Col_4, are number of days
Third row on: Col_1 and Col_2 are coordinates (lon, lat), Col_3 is a statistical value, Col_4 onwards are measurements.
As you can see, this format is a confusing mess. I would like to convert this to JSON the following way, for example:
{"points":{
"dates": ["20190103", "20190205"],
"0":{
"lon": "-8.072557",
"lat": "41.13702",
"measurements": ["-0.191792","-10.543130"],
"val": "-1"
},
"1":{
"lon": "-8.075557",
"lat": "41.15702",
"measurements": ["-1.191792","-2.543130"],
"val": "-9"
}
}
}
To summarise what I've done till now, I read the CSV to a Pandas DataFrame:
df = pandas.read_csv("sample.csv")
I can extract the dates into a Numpy Array with:
dates = df.iloc[0][3:].to_numpy()
I can extract measurements for all points with:
measurements_all = df.iloc[1:,3:].to_numpy()
And the lon and lat and val, respectively, with:
lon_all = df.iloc[1:,0:1].to_numpy()
lat_all = df.iloc[1:,1:2].to_numpy()
val_all = df.iloc[1:,2:3].to_numpy()
Can anyone explain how I can format this info a structure identical to the .json example?
With this dataframe
df = pd.DataFrame([{"col1": 1, "col2": 2, "col3": 3, "col4":3, "col5":5},
{"col1": None, "col2": None, "col3": None, "col4":20190103, "col5":20190205},
{"col1": -8.072557, "col2": 41.13702, "col3": -1, "col4":-0.191792, "col5":-10.543130},
{"col1": -8.072557, "col2": 41.15702, "col3": -9, "col4":-0.191792, "col5":-2.543130}])
This code does what you want, although it is not converting anything to strings as in you example. But you should be able to easily do it, if necessary.
# generate dict with dates
final_dict = {"dates": [list(df["col4"])[1],list(df["col5"])[1]]}
# iterate over relevant rows and generate dicts
for i in range(2,len(df)):
final_dict[i-2] = {"lon": df["col1"][i],
"lat": df["col2"][i],
"measurements": [df[cname][i] for cname in ["col4", "col5"]],
"val": df["col3"][i]
}
this leads to this output:
{0: {'lat': 41.13702,
'lon': -8.072557,
'measurements': [-0.191792, -10.54313],
'val': -1.0},
1: {'lat': 41.15702,
'lon': -8.072557,
'measurements': [-0.191792, -2.54313],
'val': -9.0},
'dates': [20190103.0, 20190205.0]}
Extracting dates from data and then eliminating first row from data frame:
dates =list(data.iloc[0][3:])
data=data.iloc[1:]
Inserting dates into dict:
points={"dates":dates}
Iterating through data frame and adding elements to the dictionary:
i=0
for index, row in data.iterrows():
element= {"lon":row["Col_1"],
"lat":row["Col_2"],
"measurements": [row["Col_3"], row["Col_4"]]}
points[str(i)]=element
i+=1
You can convert dict to string object using json.dumps():
points_json = json.dumps(points)
It will be string object, not json(dict) object. More about that in this post Converting dictionary to JSON
I converted the pandas dataframe values to a list, and then loop through one of the lists, and add the lists to a nested JSON object containing the values.
import pandas
import json
import argparse
import sys
def parseInput():
parser = argparse.ArgumentParser(description="Convert CSV measurements to JSON")
parser.add_argument(
'-i', "--input",
help="CSV input",
required=True,
type=argparse.FileType('r')
)
parser.add_argument(
'-o', "--output",
help="JSON output",
type=argparse.FileType('w'),
default=sys.stdout
)
return parser.parse_args()
def main():
args = parseInput()
input_file = args.input
output = args.output
dataframe = pandas.read_csv(input_file)
longitudes = dataframe.iloc[1:,0:1].T.values.tolist()[0]
latitudes = dataframe.iloc[1:,1:2].T.values.tolist()[0]
averages = dataframe.iloc[1:,2:3].T.values.tolist()[0]
measurements = dataframe.iloc[1:,3:].values.tolist()
dates=dataframe.iloc[0][3:].values.tolist()
points={"dates":dates}
for index, val in enumerate(longitudes):
entry = {
"lon":longitudes[index],
"lat":latitudes[index],
"measurements":measurements[index],
"average":averages[index]
}
points[str(index)] = entry
json.dump(points, output)
if __name__ == "__main__":
main()
I've tried to follow a bunch of answers I've seen on SO, but I'm really stuck here. I'm trying to convert a CSV to JSON.
The JSON schema has multiple levels of nesting and some of the values in the CSV will be shared.
Here's a link to one record in the CSV.
Think of this sample as two different parties attached to one document.
The fields on the document (document_source_id, document_amount, record_date, source_url, document_file_url, document_type__title, apn, situs_county_id, state_code) should not duplicate.
While the fields of each entity are unique.
I've tried to nest these using a complex groupby statement, but am stuck getting the data into my schema.
Here's what I've tried. It doesn't contain all fields because I'm having a difficult time understanding what it all means.
j = (df.groupby(['state_code',
'record_date',
'situs_county_id',
'document_type__title',
'document_file_url',
'document_amount',
'source_url'], as_index=False)
.apply(lambda x: x[['source_url']].to_dict('r'))
.reset_index()
.rename(columns={0:'metadata', 1:'parcels'})
.to_json(orient='records'))
Here's how the sample CSV should output
{
"metadata":{
"source_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentDetail?doc_id=2019012901225004",
"document_file_url":"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id=2019012901225004"
},
"state_code":"NY",
"nested_data":{
"parcels":[
{
"apn":"3972-61",
"situs_county_id":"36005"
}
],
"participants":[
{
"entity":{
"name":"5 AIF WILLOW, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantee"
},
{
"entity":{
"name":"5 ARCH INCOME FUND 2, LLC",
"situs_street":"19800 MACARTHUR BLVD",
"situs_city":"IRVINE",
"situs_unit":"SUITE 1150",
"state_code":"CA",
"situs_zip":"92612"
},
"participation_type":"Grantor"
}
]
},
"record_date":"01/31/2019",
"situs_county_id":"36005",
"document_source_id":"2019012901225004",
"document_type__title":"ASSIGNMENT, MORTGAGE"
}
You might need to use the json_normalize function from pandas.io.json
from pandas.io.json import json_normalize
import csv
li = []
with open('filename.csv', 'r') as f:
reader = csv.DictReader(csvfile)
for row in reader:
li.append(row)
df = json_normalize(li)
Here , we are creating a list of dictionaries from the csv file and creating a dataframe from the function json_normalize.
Below is one way to export your data:
# all columns used in groupby()
grouped_cols = ['state_code', 'record_date', 'situs_county_id', 'document_source_id'
, 'document_type__title', 'source_url', 'document_file_url']
# adjust some column names to map to those in the 'entity' node in the desired JSON
situs_mapping = {
'street_number_street_name': 'situs_street'
, 'city_name': 'situs_city'
, 'unit': 'situs_unit'
, 'state_code': 'state_code'
, 'zipcode_full': 'situs_zip'
}
# define columns used for 'entity' node. python 2 need to adjust to the syntax
entity_cols = ['name', *situs_mapping.values()]
#below for python 2#
#entity_cols = ['name'] + list(situs_mapping.values())
# specify output fields
output_cols = ['metadata','state_code','nested_data','record_date'
, 'situs_county_id', 'document_source_id', 'document_type__title']
# define a function to get nested_data
def get_nested_data(d):
return {
'parcels': d[['apn', 'situs_county_id']].drop_duplicates().to_dict('r')
, 'participants': d[['entity', 'participation_type']].to_dict('r')
}
j = (df.rename(columns=situs_mapping)
.assign(entity=lambda x: x[entity_cols].to_dict('r'))
.groupby(grouped_cols)
.apply(get_nested_data)
.reset_index()
.rename(columns={0:'nested_data'})
.assign(metadata=lambda x: x[['source_url', 'document_file_url']].to_dict('r'))[output_cols]
.to_json(orient="records")
)
print(j)
Note: If participants contain duplicates and must run drop_duplicates() as we do on parcels, then assign(entity) can be moved to defining the participants in the get_nested_data() function:
, 'participants': d[['participation_type', *entity_cols]] \
.drop_duplicates() \
.assign(entity=lambda x: x[entity_cols].to_dict('r')) \
.loc[:,['entity', 'participation_type']] \
.to_dict('r')