Groups sub-groups selections
0 sg1 csg1 sc1
1 sg1 csg1 sc2
2 sg1 csg2 sc3
3 sg1 csg2 sc4
4 sg2 csg3 sc5
5 sg2 csg3 sc6
6 sg2 csg4 sc7
7 sg2 csg4 sc8
I have the dataframe mentioned above and I am trying to create a JSON object as follows:
{
"sg1": {
"csg1": ['sc1', 'sc2'],
"csg2": ['sc3', 'sc4']
},
"sg2": {
"csg3": ['sc5', 'sc6'],
"csg4": ['sc7', 'sc8']
}
}
I tried using the pandas to_json and to_dict with orient arguments but I am not getting the expected result. I also tried grouping by the columns and then creating the list and converting it into a JSON.
Any help is much appreciated.
You can groupby ['Groups','sub-groups'] and build a dictionary from the multiindex series with a dictionary comprehension:
s = df.groupby(['Groups','sub-groups']).selections.agg(list)
d = {k1:{k2:v} for (k1,k2),v in s.iteritems()}
print(d)
# {'sg1': {'csg2': ['sc3', 'sc4']}, 'sg2': {'csg4': ['sc7', 'sc8']}}
You need to group on the columns of interest such as:
import pandas as pd
data = {
'Groups': ['sg1', 'sg1', 'sg1', 'sg1', 'sg2', 'sg2', 'sg2', 'sg2'],
'sub-groups': ['csg1', 'csg1', 'csg2', 'csg2', 'csg3', 'csg3', 'csg4', 'csg4'],
'selections': ['sc1', 'sc2', 'sc3', 'sc4', 'sc5', 'sc6', 'sc7', 'sc8']
}
df = pd.DataFrame(data)
print(df.groupby(['Groups', 'sub-groups'])['selections'].unique().to_dict())
The output is:
{
('sg1', 'csg1'): array(['sc1', 'sc2'], dtype=object),
('sg1', 'csg2'): array(['sc3', 'sc4'], dtype=object),
('sg2', 'csg3'): array(['sc5', 'sc6'], dtype=object),
('sg2', 'csg4'): array(['sc7', 'sc8'], dtype=object)
}
Lets try dictify function which builds a nested dictionary with top level keys from the Groups and corresponding sub level keys from sub-groups:
from collections import defaultdict
def dictify():
dct = defaultdict(dict)
for (x, y), g in df.groupby(['Groups', 'sub-groups']):
dct[x][y] = [*g['selections']]
return dict(dct)
# dictify()
{
"sg1": {
"csg1": ["sc1","sc2"],
"csg2": ["sc3","sc4"]
},
"sg2": {
"csg3": ["sc5","sc6"],
"csg4": ["sc7","sc8"]
}
}
Related
Im trying to flatten 2 columns from a table loaded into a dataframe as below:
u_group
t_group
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}
I want to separate them and get them as:
u_group.link
u_group.value
t_group.link
t_group.value
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/2696f18b376bca0
2696f18b376bca0
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
https://hi.com/api/now/table/system/99b27bc1db761f4
99b27bc1db761f4
I used the below code, but wasnt successful.
import ast
from pandas.io.json import json_normalize
df12 = spark.sql("""select u_group,t_group from tbl""")
def only_dict(d):
'''
Convert json string representation of dictionary to a python dict
'''
return ast.literal_eval(d)
def list_of_dicts(ld):
'''
Create a mapping of the tuples formed after
converting json strings of list to a python list
'''
return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])
A = json_normalize(df12['u_group'].apply(only_dict).tolist()).add_prefix('link.')
B = json_normalize(df['u_group'].apply(list_of_dicts).tolist()).add_prefix('value.')
TypeError: 'Column' object is not callable
Kindly help or suggest if any other code would work better.
need simple example and code for answer
example:
data = [[{'link':'A1', 'value':'B1'}, {'link':'A2', 'value':'B2'}],
[{'link':'C1', 'value':'D1'}, {'link':'C2', 'value':'D2'}]]
df = pd.DataFrame(data, columns=['u', 't'])
output(df):
u t
0 {'link': 'A1', 'value': 'B1'} {'link': 'A2', 'value': 'B2'}
1 {'link': 'C1', 'value': 'D1'} {'link': 'C2', 'value': 'D2'}
use following code:
pd.concat([df[i].apply(lambda x: pd.Series(x)).add_prefix(i + '_') for i in df.columns], axis=1)
output:
u_link u_value t_link t_value
0 A1 B1 A2 B2
1 C1 D1 C2 D2
Here are my 2 cents,
A simple way to achieve this using PYSPARK.
Create the dataframe as follows:
data = [
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}"""
),
(
"""{"link": "https://hi.com/api/now/table/system/2696f18b376bca0", "value": "2696f18b376bca0"}""",
"""{"link": "https://hi.com/api/now/table/system/99b27bc1db761f4", "value": "99b27bc1db761f4"}"""
)
]
df = spark.createDataFrame(data,schema=['u_group','t_group'])
Then use the from_json() to parse the dictionary and fetch the individual values as follows:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema_column = StructType([
StructField("link",StringType(),True),
StructField("value",StringType(),True),
])
df = df .withColumn('U_GROUP_PARSE',from_json(col('u_group'),schema_column))\
.withColumn('T_GROUP_PARSE',from_json(col('t_group'),schema_column))\
.withColumn('U_GROUP.LINK',col("U_GROUP_PARSE.link"))\
.withColumn('U_GROUP.VALUE',col("U_GROUP_PARSE.value"))\
.withColumn('T_GROUP.LINK',col("T_GROUP_PARSE.link"))\
.withColumn('T_GROUP.VALUE',col("T_GROUP_PARSE.value"))\
.drop('u_group','t_group','U_GROUP_PARSE','T_GROUP_PARSE')
Print the dataframe
df.show(truncate=False)
Please check the below image for your reference:
Say I have a DataFrame defined as:
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
I want to somehow alter the value in column "mail" to remove the key "fax". Eg, the output DataFrame would be something like:
output_df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com"
}
}
where the "fax" key-value pair has been deleted. I tried to use pandas.map with a dict in the lambda, but it does not work. One bad workaround I had was to normalize the dict, but this created unnecessary output columns, and I could not merge them back. Eg.;
df = pd.json_normalize(df)
Is there a better way for this?
You can use pop to remove a element from dict having the given key.
import pandas as pd
df['mail'].pop('fax')
df = pd.json_normalize(df)
df
Output:
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
Is there a reason you just don't access it directly and delete it?
Like this:
del df['mail']['fax']
print(df)
{'customer_name': 'john',
'phone': {'mobile': 0, 'office': 111},
'mail': {'office': 'john#office.com', 'personal': 'john#home.com'}}
This is the simplest technique to achieve your aim.
import pandas as pd
import numpy as np
df = {
"customer_name":"john",
"phone":{
"mobile":000,
"office":111
},
"mail":{
"office":"john#office.com",
"personal":"john#home.com",
"fax":"12345"
}
}
del df['mail']['fax']
df = pd.json_normalize(df)
df
Output :
customer_name phone.mobile phone.office mail.office mail.personal
0 john 0 111 john#office.com john#home.com
I have the following json file
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.
My goal is to flatten the data and load it into CSV. Here is the code I have written so far -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.
My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?
I can't use json_normalize as that library is not supported in our environment.
[output data]
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.
You can use pd.DataFrame to create DataFrame and explode the margins column
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None value in margins column to dictionary and convert it to column
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>
My API gives me a json file as output with the following structure:
{
"results": [
{
"statement_id": 0,
"series": [
{
"name": "PCJeremy",
"tags": {
"host": "001"
},
"columns": [
"time",
"memory"
],
"values": [
[
"2021-03-20T23:00:00Z",
1049911288
],
[
"2021-03-21T00:00:00Z",
1057692712
],
]
},
{
"name": "PCJohnny",
"tags": {
"host": "002"
},
"columns": [
"time",
"memory"
],
"values": [
[
"2021-03-20T23:00:00Z",
407896064
],
[
"2021-03-21T00:00:00Z",
406847488
]
]
}
]
}
]
}
I want to transform this output to a pandas dataframe so I can create some reports from it. I tried using the pdDataFrame.from_dict method:
with open(fn) as f:
data = json.load(f)
print(pd.DataFrame.from_dict(data))
But as a resulting set, I just get one column and one row with all the data back:
results
0 {'statement_id': 0, 'series': [{'name': 'Jerem...
The structure is just quite hard to understand for me as I am no professional. I would like to get a dataframe with 4 columns: name, host, time and memory with a row of data for every combination of values in the json file. Example:
name host time memory
JeremyPC 001 "2021-03-20T23:00:00Z" 1049911288
JeremyPC 001 "2021-03-21T00:00:00Z" 1049911288
Is this in any way possible? Thanks a lot in advance!
First extract the data from json you are interested in
extracted_data = []
for series in data['results'][0]['series']:
d = {}
d['name'] = series['name']
d['host'] = series['tags']['host']
d['time'] = [value[0] for value in series['values']]
d['memory'] = [value[1] for value in series['values']]
extracted_data.append(d)
df = pd.DataFrame(extracted_data)
# print(df)
name host time memory
0 PCJeremy 001 [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z] [1049911288, 1057692712]
1 PCJohnny 002 [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z] [407896064, 406847488]
Second, explode multiple columns into rows
df1 = pd.concat([df.explode('time')['time'], df.explode('memory')['memory']], axis=1)
df_ = df.drop(['time','memory'], axis=1).join(df1).reset_index(drop=True)
# print(df_)
name host time memory
0 PCJeremy 001 2021-03-20T23:00:00Z 1049911288
1 PCJeremy 001 2021-03-21T00:00:00Z 1057692712
2 PCJohnny 002 2021-03-20T23:00:00Z 407896064
3 PCJohnny 002 2021-03-21T00:00:00Z 406847488
With carefully constructing the dict, it could be done without exploding.
extracted_data = []
for series in data['results'][0]['series']:
d = {}
d['name'] = series['name']
d['host'] = series['tags']['host']
for values in series['values']:
d_ = d.copy()
for column, value in zip(series['columns'], values):
d_[column] = value
extracted_data.append(d_)
df = pd.DataFrame(extracted_data)
You could jmespath to extract the data; it is quite a handy tool for such nested json data. You can read the docs for more details; I will summarize the basics: If you want to access a key, use a dot, if you want to access values in a list, use []. Combination of these two will help in traversing the json paths. There are more tools; these basics should get you started.
Your json is wrapped in a data variable:
data
{'results': [{'statement_id': 0,
'series': [{'name': 'PCJeremy',
'tags': {'host': '001'},
'columns': ['time', 'memory'],
'values': [['2021-03-20T23:00:00Z', 1049911288],
['2021-03-21T00:00:00Z', 1057692712]]},
{'name': 'PCJohnny',
'tags': {'host': '002'},
'columns': ['time', 'memory'],
'values': [['2021-03-20T23:00:00Z', 407896064],
['2021-03-21T00:00:00Z', 406847488]]}]}]}
Let's create an expression to parse the json, and get the specific values:
expression = """{name: results[].series[].name,
host: results[].series[].tags.host,
time: results[].series[].values[*][0],
memory: results[].series[].values[*][-1]}
"""
Parse the expression to the json data:
expression = jmespath.compile(expression).search(data)
expression
{'name': ['PCJeremy', 'PCJohnny'],
'host': ['001', '002'],
'time': [['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z'],
['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z']],
'memory': [[1049911288, 1057692712], [407896064, 406847488]]}
Note the time and memory are nested lists, and match the values in data:
Create dataframe and explode relevant columns:
pd.DataFrame(expression).apply(pd.Series.explode)
name host time memory
0 PCJeremy 001 2021-03-20T23:00:00Z 1049911288
0 PCJeremy 001 2021-03-21T00:00:00Z 1057692712
1 PCJohnny 002 2021-03-20T23:00:00Z 407896064
1 PCJohnny 002 2021-03-21T00:00:00Z 406847488
I'm having a dataframe which contains a column as dictionary. And I need to groupby the column by the dictionary values. For example,
import pandas as pd
data = [
{
"name":"xx",
"values":{
"element":[
{
"path":"path1/id1"
},
{
"path":"path2/id1"
}
],
"nonrequired":[
{}
]
}
},
{
"name":"yy",
"values":{
"element":[
{
"path":"path1/id2"
},
{
"path":"path2/id2"
}
],
"nonrequired":[
{}
]
}
}
]
df = pd.DataFrame(data)
What I'm looking for,
I want to groupby the column "values" by inside specific key.
The grouping should be values->element->path
The grouping should be based on the partial path values. For example if path="path1/id2", the
grouping should be based on path="path1"
After grouping I need to extract the result as dictionary.
Expected result:
result = {
'path1': [
{
"name":'xx',
"renamecolumn":['id1','id2']
}
],
'path2': [
{
"name":'yy',
"renamecolumn":['id1','id2']
}
]
}
Still not 100% sure of the logic of the final dictionary creation as the example input and output don't quite match up. However, here is how you can extract the values and you can create your desired dictionary from there.
# ectract the values and split them on the forward slash
df['split'] = df['values'].apply(lambda x: [item['path'].split('/') for item in x['element']])
# generate the path and ids columns
df['path'] = df['split'].apply(lambda x: [x[i][0] for i in range(0,len(x))])
df['ids'] = df['split'].apply(lambda x: [x[i][1] for i in range(0,len(x))])
# separate out all the lists and
result = df.drop(['values', 'split'], axis=1) \
.explode('ids').explode('path').drop_duplicates()
Result is:
name path ids
0 xx path1 id1
0 xx path2 id1
1 yy path1 id2
1 yy path2 id2