Nested JSON into Dataframe - python

I have a Dataframe and one of the columns contains JSON objects of this type:
{'a': 'x', 'b':'y', 'c':'z'}
{'a': 'x1', 'b':'y2', 'c':'z3'}
...
How can I split such object and expand it into different a/b/c columns with their relative elements, within the same dataframe?
a b c
x y z
x1 y1 z1
...
Thank you in advance!

if your dataframe looks like this, with a column called json_col:
import pandas as pd
>>> df
json_col
0 {'a': 'x', 'b': 'y', 'c': 'z'}
1 {'a': 'x1', 'b': 'y2', 'c': 'z3'}
You can do this:
df[['a','b','c']] = df.json_col.apply(pd.Series)
resulting in this final df:
>>> df
json_col a b c
0 {'a': 'x', 'b': 'y', 'c': 'z'} x y z
1 {'a': 'x1', 'b': 'y2', 'c': 'z3'} x1 y2 z3

Related

Convert Dataframe to dictionary with one column as key and the other columns as another dict

Currently I have a dataframe.
ID
A
B
123
a
b
456
c
d
I would like to convert this into a dictionary, where the key of the dictionary is the "ID" column. The value of the dictionary would be another dictionary, where the keys of that dictionary are the name of the other columns, and the value of that dictionary would be the corresponding column value. Using the example above, this would look like:
{ 123 : { A : a, B : b}, 456 : {A : c, B : d} }
I have tried:
mydataframe.set_index("ID").to_dict() , but this results in a different format than the one wanted.
You merely need to pass the proper orient parameter, per the documentation.
import io
pd.read_csv(io.StringIO('''ID A B
123 a b
456 c d'''), sep='\s+').set_index('ID').to_dict(orient='index')
{123: {'A': 'a', 'B': 'b'}, 456: {'A': 'c', 'B': 'd'}}
Of course, the columns maintain their string types, as indicated by the quote marks.
Consider the following:
import pandas as pd
df = pd.DataFrame({'ID':[1,2,3], 'A':['x','y','z'], 'B':[111,222,333]})
What you're going for would be returned with the following two lines:
df.set_index('ID', inplace=True)
some_dict = {i:dict(zip(row.keys(), row.values)) for i, row in df.iterrows()}
With the output being equal to:
{1: {'A': 'x', 'B': 111}, 2: {'A': 'y', 'B': 222}, 3: {'A': 'z', 'B': 333}}

Convert Pandas dataframe to a dictionary with first column as key

I have a Pandas Dataframe :
A || B || C
x1 x [x,y]
x2 a [b,c,d]
and I am trying to make a dictionary to that looks like:
{x1: {B : x, c : [x,y]}, x2: {B: a, C:[b,c,d}}
I have tried the to_dict function but that changes the entire dataframe into a dictionary. I am kind of lost on how to iterate onto the first column and make it the key and the rest of the df a dictionary as the value of that key.
Try:
x = df.set_index("A").to_dict("index")
print(x)
Prints:
{'x1': {'B': 'x', 'C': ['x', 'y']}, 'x2': {'B': 'a', 'C': ['b', 'c', 'd']}}

Splitting dictionary into existing columns

Suppose I have the dataframe pd.DataFrame({'a':nan, 'b':nan, 'c':{'a':1, 'b':2},{'a':4, 'b':7, 'c':nan}, {'a':nan, 'b':nan, 'c':{'a':6, 'b':7}}). I want to take the values from the keys in the dictionary in column c and parse them into keys a and b.
Expected output is:
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
I know how to do this to create new columns, but that is not the task I need for this, since a and b have relevant information needing updates from c. I have not been able to find anything relevant to this task.
Any suggestions for an efficient method would be most welcome.
** EDIT **
The real problem is that I have the following dataframe, which I reduced to the above (in several, no doubt, extraneous steps):
a b c
0 nan nan [{'a':1, 'b':2}, {'a':6, 'b':7}]
1 4 7 nan
and I need to have output, in as few steps as possible, as per
a b c
0 1 2 {'a':1, 'b':2}
1 4 7 nan
2 6 7 {'a':6, 'b':7}
Thanks!
This works:
def func(x):
d = eval(x['c'])
x['a'] = d['a']
x['b'] = d['b']
return x
df = df.apply(lambda x : func(x), axis=1)
How about this:
for t in d['c'].keys():
d[t] = d['c'][t]
Here is an example:
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> d
{'a': '', 'b': '', 'c': {'a': 1, 'b': 2}}
>>> d.keys()
dict_keys(['a', 'b', 'c'])
>>> d['c'].keys()
dict_keys(['a', 'b'])
>>> for t in d['c'].keys():
... d[t] = d['c'][t]
...
>>> d
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>>
We can turn it into a function:
>>> def updateDict(dict, sourceKey):
... for targetKey in dict[sourceKey].keys():
... dict[targetKey] = dict[sourceKey][targetKey]
...
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2}}
>>> d = {'a': '', 'b': '', 'c':{'a':1, 'b':2, 'z':1000}}
>>> updateDict(d, 'c')
{'a': 1, 'b': 2, 'c': {'a': 1, 'b': 2, 'z': 1000}, 'z': 1000}
>>>

How to save pandas DataFrame's rows as JSON strings?

I have a pandas DataFrame df and I convert each row to JSON string as follows:
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
df_as_json = df.to_json(orient='records')
Then I want to iterate over the JSON strings (rows) of df_as_json and make further processing as follows:
for json_document in df_as_json.split('\n'):
jdict = json.loads(json_document)
//...
The problem is that df_as_json.split('\n') does not really split df_as_json into separate JSON strings.
How can I do what I need?
To get each row of the dataframe as a dict, you can use pandas.DataFrame.to_dict():
Code:
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
for jdict in df.to_dict(orient='records'):
print(jdict)
Results:
{'A': -0.81155648424969018, 'B': 0.54051722275060621, 'C': 2.1858014972680886, 'D': -0.92089743800379931}
{'A': -0.051650790117511704, 'B': -0.79176498452586563, 'C': -0.9181773278020231, 'D': 1.1698955805545324}
{'A': -0.59790963665018559, 'B': -0.63673166723131003, 'C': 1.0493603533698836, 'D': 1.0027811601157812}
{'A': -0.20909149867564752, 'B': -1.8022674158328837, 'C': 1.0849019267782165, 'D': 1.2203116471260997}
{'A': 0.33798033123267207, 'B': 0.13927004774974402, 'C': 1.6671536830551967, 'D': 0.29193412587056755}
{'A': -0.079327003827824386, 'B': 0.58625181818942929, 'C': -0.42365912798153349, 'D': -0.69644626255641828}
{'A': 0.33849577559616656, 'B': -0.42955248285258169, 'C': 0.070860788937864225, 'D': 1.4971679265264808}
{'A': 1.3411846077264038, 'B': -0.20189961315847924, 'C': 1.6294881274421233, 'D': 1.1168181183218009}
{'A': 0.61028134135655399, 'B': 0.48445766812257018, 'C': -0.31117315672299928, 'D': -1.7986688463810827}
{'A': 0.9181074339928279, 'B': 0.84151139156427757, 'C': -1.111794854210024, 'D': -0.7131446510569609}
Starting from v0.19, you can use to_json with lines=True parameter to save your data as a JSON lines file.
df.to_json('file.json', orient='records', lines=True)
This eliminates the need for a loop to save each record, as a solution with to_dict would involve.
The first 5 lines of file.json look like this -
{"A":0.0162261253,"B":0.8770884013,"C":0.1577913843,"D":-0.3097990255}
{"A":-1.2870077735,"B":-0.1610902061,"C":-0.2426829569,"D":-0.3247587907}
{"A":-0.7743891125,"B":-0.9487264737,"C":1.6366125588,"D":0.2943377348}
{"A":1.5128287075,"B":-0.389437321,"C":0.4841038875,"D":0.5315466818}
{"A":-0.1455759399,"B":1.0205229385,"C":0.6776108196,"D":0.832060379}
another way is
input_data=[row.to_json() for index,row in dataset.iterrows()]

python: How to modify dictonaries in a DataFrame?

How can I modify a list value inside dataframes? I am trying to adjust data received by JSON and the DataFrame is as below:
The dataframe has 'multiple dictionary' in one list.
Dataframe df:
id options
0 0 [{'a':1 ,'b':2, 'c':3, 'd':4},{'a':5 ,'b':6, 'c':7, 'd':8}]
1 1 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':13 ,'b':14, 'c':15, 'd':16}]
2 2 [{'a':9 ,'b':10, 'c':11, 'd':12},{'a':17 ,'b':18, 'c':19, 'd':20}]
If I want to use only 'a' and 'c' key / values in options how can I modify datafames? The expected result would be
Dataframe df:
id options
0 0 [{'a':1 ,'c':3},{'a':5 ,'c':7}]
1 1 [{'a':9, 'c':11},{'a':13,'c':15}]
2 2 [{'a':9 ,'c':11},{'a':17,c':19}]
I tried filtering but I could not assign the value to the dataframe
for x in totaldf['options']:
for y in x:
y = {a: y[a], 'c': y['c']} ...?
Using nested listed comprehension:
df['options'] = [[{'a': y['a'], 'c': y['b']} for y in x] for x in df['options']]
If you wanted to use a for loop it would be something like:
new_options = []
for x in df['options']:
row = []
for y in x:
row.append({a: y[a], 'c': y['c']})
new_options.append(row)
df['options'] = new_options
# An alternative vectorized solution.
df.options = df.options.apply(lambda x: [{k:v for k,v in e.items() if k in['a','c']} for e in x])
Out[398]:
id options
0 0 [{'a': 1, 'c': 3}, {'a': 5, 'c': 7}]
1 1 [{'a': 9, 'c': 11}, {'a': 13, 'c': 15}]
2 2 [{'a': 9, 'c': 11}, {'a': 17, 'c': 19}]

Categories