Pandas JSON Orient Autodetection - python

I'm trying to find out if Pandas.read_json performs some level of autodetection. For example, I have the following data:
data_records = [
{
"device": "rtr1",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr2",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr3",
"dc": "London",
"vendor": "Cisco",
},
]
data_index = {
"rtr1": {"dc": "London", "vendor": "Cisco"},
"rtr2": {"dc": "London", "vendor": "Cisco"},
"rtr3": {"dc": "London", "vendor": "Cisco"},
}
If I do the following:
import pandas as pd
import json
pd.read_json(json.dumps(data_records))
---
device dc vendor
0 rtr1 London Cisco
1 rtr2 London Cisco
2 rtr3 London Cisco
though I get the output that I desired, the data is record based. Being that the default orient is columns, I would have not thought this would have worked.
Therefore is there some level of autodetection going on? With index based inputs the behaviour seems more inline. As this shows appears to have parsed the data based on a column orient by default.
pd.read_json(json.dumps(data_index))
rtr1 rtr2 rtr3
dc London London London
vendor Cisco Cisco Cisco
pd.read_json(json.dumps(data_index), orient="index")
dc vendor
rtr1 London Cisco
rtr2 London Cisco
rtr3 London Cisco

We can't speak of auto-detection, but rather of a nested hierarchical structure determined by a specified orientation or by the one used by default.
Moreover, it should be specified that we cannot use any data structure with a given orientation.
Case 1 : Dataframes to / from a JSON string
read_json : Convert a JSON string to Dataframe with argument typ = 'frame'
to_json : Convert a DataFrame to a JSON string
Orient value is explicitly specified with the Pandas to_json and read_json functions in case of split, index, record, table and values orientations.
This is not necessary to specify orient value for columns because orientation is columns by default.
Case 2 : Series to / from a JSON string
read_json : Convert a JSON string to Series with argument typ = 'series'
to_json : Convert a Series to a JSON string
If typ = series in read_json, default value for argument orient is index see Pandas documentation
When trying to convert a Series into a JSON string using to_json, default orient value is also index.
With the other allowed orientation values for a Series split, records, index, table argument orient must be specified.
Resources
We have some oriented structures in the comment section at this github link (have a look around line 680 in the _json.py file).
Note that there are no examples with orient=columns in the code comments on git-hub.
This is simply because in the absence of an orientation specification, columns is used by default.
Clearer view of a nested hierarchical structure
import pandas as pd
import json
##### BEGINNING : HIERARCHICAL LEVEL #####
# Second Level - Values levels
d21 = {'v1': "value 1", 'v2': "value 3"}
d22 = {'v1': "value 3", 'v2': "value 4"}
# First Level - Rows levels
d1 = {'row1': d21, 'row2': d22}
# 0-Level - Columns Levels
d = {'col1': d1}
##### END : HIERARCHICAL LEVEL #####
print(pd.read_json(json.dumps(d))) # No need for specification : orient is columns by default
# col1
# row1 {'v1': 'value 1', 'v2': 'value 3'}
# row2 {'v1': 'value 3', 'v2': 'value 4'}
Be careful here
Data structures cannot be used with any value given to the orient argument. Otherwise, a builtins.AttributeError exception should be raised (see the github link to see the diffrent structures).
pd.read_json(json.dumps(data_records))
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
#### Like orient is columns by default the previous is the same as following
pd.read_json(json.dumps(data_records), orient='columns')
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
pd.read_json(json.dumps(data_records), orient='values')
# device dc vendor
# 0 rtr1 London Cisco
# 1 rtr2 London Cisco
# 2 rtr3 London Cisco
#### Dataframe shape is also important and exception could be raised
pd.read_json(json.dumps(data_records), orient='index')
# ...
# builtins.AttributeError: 'list' object has no attribute 'values'
pd.read_json(json.dumps(data_records), orient='table')
# builtins.KeyError: 'schema'
pd.read_json(json.dumps(data_records), orient='split')
# builtins.AttributeError: 'list' object has no attribute 'items'
Is there an autodetection mechanism ?
After some experimentations I can say now answer is no.
On github, split data shape is presented like the following :
data = {\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}
So let's do an experiment.
We will use read_json to read the data without filling in the orientation and see if the split shape is recognized.
Then we will read the data by entering the split orientation.
If there is an automatic shape recognition, the result should be the same in both cases.
import pandas as pd
import json
data = {\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}
json_string = json.dumps(data)
we print without filling in the orientation :
>>> pd.read_json(json_string)
columns index data
0 col 1 row 1 [a, b]
1 col 2 row 2 [c, d]
and now we print with filling in the split orientation.
>>> pd.read_json(json_string, orient='split')
col 1 col 2
row 1 a b
row 2 c d
Dataframes are different, Pandas do not recognize the split shape. There is no automatic detection mechanism.

TL;DR
When using pd.read_json() with orient=None, the representation of the data is automatically determined through pd.DataFrame().
Explanation
The pandas documentation is a bit misleading here. When not specifying orient, the parser for 'columns' is used, which is self.obj = pd.DataFrame(json.loads(json)). So
pd.read_json(json.dumps(data_records))
is equivalent to
pd.DataFrame(json.loads(json.dumps(data_records)))
which again is equivalent to
pd.DataFrame(data_records)
I.e., you pass a list of dicts to the DataFrame constructor, which then performs the automatic determination of the data representation. Note that this does not mean that orient is auto-detected. Instead, simple heuristics (see below) on how the data should be loaded into a DataFrame are applied.
Loading JSON-like data through pd.DataFrame()
For the 3 most relevant cases of JSON-structured data, the DataFrame construction through pd.DataFrame() is:
Dict of lists
In[1]: data = {"a": [1, 2, 3], "b": [9, 8, 7]}
...: pd.DataFrame(data)
Out[1]:
a b
0 1 9
1 2 8
2 3 7
Dict of dicts
In[2]: data = {"a": {"x": 1, "y": 2, "z": 3}, "b": {"x": 9, "y": 8, "z": 7}}
...: pd.DataFrame(data)
Out[2]:
a b
x 1 9
y 2 8
z 3 7
List of dicts
In[3]: data = [{'a': 1, 'b': 9}, {'a': 2, 'b': 8}, {'a': 3, 'b': 7}]
...: pd.DataFrame(data)
Out[3]:
a b
0 1 9
1 2 8
2 3 7

No, Pandas does not perform any autodetection when using the read_json function.
It is entirely determined by the orient parameter, which specifies the format of the input json data.
In your first example, you passed the data_records list to the json.dumps function, which is then converted it to a json-string. After passing the resulting json string to pd.read_json, it is seen as a record orientation.
In your second example, you passed the data_index to json.dumps which is thenseen as a "column" orientation
In both cases, the behavior of the read_json function is entirely based on the value of the orient parameter and not by an automatic detection by Pandas.

If you want to understand every detail of a function call, I would suggest using VSCode and setting "justMyCode": false in your launch.json for debugging.
That being said, if you follow what's going on when you call pd.read_json() you'll find out that it instantiates a JsonReader, before reading it which then instantiates a FrameParser in turn parsed with _parse_no_numpy:
def _parse_no_numpy(self):
json = self.json
orient = self.orient
if orient == "columns": # default
self.obj = DataFrame(
loads(json, precise_float=self.precise_float), dtype=None
)
elif orient == "split":
decoded = {
str(k): v
for k, v in loads(json, precise_float=self.precise_float).items()
}
self.check_keys_split(decoded)
self.obj = DataFrame(dtype=None, **decoded)
elif orient == "index": # your second case
self.obj = DataFrame.from_dict(
loads(json, precise_float=self.precise_float),
dtype=None,
orient="index",
)
elif orient == "table":
self.obj = parse_table_schema(json, precise_float=self.precise_float)
else:
self.obj = DataFrame(
loads(json, precise_float=self.precise_float), dtype=None
)
As you can see, like stated in a previous answer, in terms of orientation:
pd.read_json(json.dumps(data_records))
is equivalent to
pd.DataFrame(data_records)
and
pd.read_json(json.dumps(data_index), orient='index')
to
pd.DataFrame.from_dict(data_index, orient='index')
So in the end it all boils down to how pd.DataFrame handles the passed list of dict.
Going down this hole, you'll find that the constructor checks if the data is list-like and then calls nested_data_to_arrays which in turn calls to_arrays that finally calls _list_of_dict_to_arrays:
def _list_of_dict_to_arrays(
data: list[dict],
columns: Index | None,
) -> tuple[np.ndarray, Index]:
"""
Convert list of dicts to numpy arrays
if `columns` is not passed, column names are inferred from the records
- for OrderedDict and dicts, the column names match
the key insertion-order from the first record to the last.
- For other kinds of dict-likes, the keys are lexically sorted.
Parameters
----------
data : iterable
collection of records (OrderedDict, dict)
columns: iterables or None
Returns
-------
content : np.ndarray[object, ndim=2]
columns : Index
"""
if columns is None:
gen = (list(x.keys()) for x in data)
sort = not any(isinstance(d, dict) for d in data)
pre_cols = lib.fast_unique_multiple_list_gen(gen, sort=sort)
columns = ensure_index(pre_cols)
# assure that they are of the base dict class and not of derived
# classes
data = [d if type(d) is dict else dict(d) for d in data]
content = lib.dicts_to_array(data, list(columns))
return content, columns
The "autodetection" is actually the hierarchical handling of all possible cases/types.

Related

Converting complex nested json to csv via pandas

I have the following json file
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json. The actual file contains lot more attributes after 'margins', some of them nested and others not so. I just put some for brevity and to give an idea of expectations.
My goal is to flatten the data and load it into CSV. Here is the code I have written so far -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.
My goal is once, all the results are appended, use to_csv and convert them into rows. For each margin, the entire data is to be replicated as a seperate row. Here is what I am expecting the output to be. Can someone please help how to translate this?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand. Also, is there a better way to parse the json than doing the loop twice for one attribute i.e. margins?
I can't use json_normalize as that library is not supported in our environment.
[output data]
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.
You can use pd.DataFrame to create DataFrame and explode the margins column
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None value in margins column to dictionary and convert it to column
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>

Extract objects from nested json with Pandas

I have a nested json (like the one reported below) of translated labels, and I want to extract the leaves in separate json files, based on the languages key (it, en, etc).
I don't know at "compile time" the depth and the schema of the json, because there are a lot of files similiar to the big nested one, but I know that I always have the following structure: key path/to/en/label and value content.
I tried using Pandas with the json_normalize function to flatten my json, and works great, but afterwards I had trouble rebuilding the json schema, e.g. with the following json I get a 1x12 DataFrame, but I want a resulting DataFrame with shape 4x3, where 4 are the different labels (index) and 3 are the different languages (columns).
def fix_df(df: pd.DataFrame):
assert df.shape[0] == 1
columns = df.columns
columns_last_piece = [s.split("/")[-1] for s in columns]
fixed_columns = [s.split(".")[1] for s in columns_last_piece]
index = [".".join(elem.split(".")[2:]) for elem in columns_last_piece]
return pd.DataFrame(df.values, index=index, columns=fixed_columns)
def main():
path = pathlib.Path(os.getenv("FIXTURE_FLATTEN_PATH"))
assert path.exists()
json_dict = json.load(open(path, encoding="utf-8"))
flattened_json = pd.json_normalize(json_dict)
flattened_json_fixed = fix_df(flattened_json)
# do something with flattened_json_fixed
Example of my_labels.json:
{
"dynamicData": {
"bff_scoring": {
"subCollection": {
"dynamicData/bff_scoring/translations": {
"it": {
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
},
"en": {
"area_title.people": "PEOPLE",
"area_title.planet": "PLANET",
"area_title.prosperity": "PROSPERITY",
"area_title.principle-gov": "PRINCIPLE OF GOVERNANCE"
},
"fr":{
"area_title.people": "PERSONNES",
"area_title.planet": "PLANÈTE",
"area_title.prosperity": "PROSPERITÉ",
"area_title.principle-gov": "PRINCIPES DE GOUVERNANCE"
}
}
}
}
}
}
Example of my_labels_it.json:
{
"area_title.people": "PERSONE",
"area_title.planet": "PIANETA",
"area_title.prosperity": "PROSPERITÀ",
"area_title.principle-gov": "PRINCIPI DI GOVERNANCE"
}
I finally managed to solve this problem.
First, I need to use the melt function.
>>> df = flattened_json.melt()
>>> df
variable value
0 dynamicData.bff_scoring.subCollection.dynamicD... PERSONE
1 dynamicData.bff_scoring.subCollection.dynamicD... PIANETA
2 dynamicData.bff_scoring.subCollection.dynamicD... PROSPERITÀ
3 dynamicData.bff_scoring.subCollection.dynamicD... PRINCIPI DI GOVERNANCE
...
From here, I can extract the fields I'm interested with a regular expression. I tried using .str.extractall and explode, but I was greeted with an exception, so I relied to use .str.extract two times.
>>> df2 = df.assign(language=df.variable.str.extract(r".*\.([a-z]{2})\.[\w\.-]+$"), label=df.variable.str.extract(r"(?<=\.[a-z]{2}\.)([\w\.-]+)$")).drop(columns="variable")
>>> df2
value language label
0 PERSONE it area_title.people
1 PIANETA it area_title.planet
2 PROSPERITÀ it area_title.prosperity
3 PRINCIPI DI GOVERNANCE it area_title.principle-gov
...
And then, with a pivot, I can have the dataframe with the desired schema.
>>> df3 = df2.pivot(index="label", columns="language", values="value")
>>> df3
language en ... it
label ...
area_title.people PEOPLE ... PERSONE
area_title.planet PLANET ... PIANETA
area_title.principle-gov PRINCIPLE OF GOVERNANCE ... PRINCIPI DI GOVERNANCE
area_title.prosperity PROSPERITY ... PROSPERITÀ
From this dataframe is very simple to obtain the expected json.
>>> df3["it"].to_json(force_ascii=False)
'{"area_title.people":"PERSONE","area_title.planet":"PIANETA","area_title.principle-gov":"PRINCIPI DI GOVERNANCE","area_title.prosperity":"PROSPERITÀ"}'

What am I doing wrong in the process of vectorizing the test of whether my geolocation fields are valid?

I was calling out to a geolocation API and was converting the results to a DataFrame like so:
results = geolocator.lookup(ip_list)
results:
[{
query: "0.0.0.0",
coordinates: { lat: "0", lon: "0" }
}, ...]
So we queried 0.0.0.0 and the API returned "0"s for the lat / long, indicating an IP that obviously cant be geolocated. Weird way to handle things as opposed to a False value or something, but we can work with it.
To DataFrame:
df = pd.DataFrame(results)
But wait, this leads to those "coordinate" fields being dictionaries within the DataFrame, and I may be a Panda beginner but I know I probably want those stored as DataFrames, not dicts, so we can vectorize.
So instead I did:
for result in results:
result["coordinates"] = pd.DataFrame(result["coordinates"], index=[0])
df = pd.DataFrame(results)
Not sure what index=[0] does there but without it I get an error, so I did it like that. Stop me here and tell me why I'm wrong if I'm doing this badly so far. I'm new to Python and DataFrames more than 2D are confusing to visualize.
Then I wanted to process over df and add a "geolocated" column with True or False based on a vectorized test, and tried to do that like so:
def is_geolocated(coordinate_df):
# yes the API returned string coords
lon_zero = np.equal(coordinate_df["lon"], "0") # error here
lat_zero = np.equal(coordinate_df["lat"], "0")
return lon_zero & lat_zero
df["geolocated"] = is_mappable(df["coordinates"])
But this throws a KeyError "lon".
Am I even on the right track, and if not, how should I set this up?
Generally I would agree with you that a dictionary is a bad way to store latitude/longitude values. This happens due to the way pd.DataFrame() works, as it will pick up on the keys query and coordinates, where the value for the key coordinates is simply a dictionary of the lat/lon values.
You can circumvent the entire problem by, e.g., defining every row as a tuple, and the whole dataframe as a list of these tuples. You can then perform a comparison whether both the lat and lon value are zero, and return this as a new column.
import pandas as pd
# Test dataset
results = [{
'query': "0.0.0.0",
'coordinates': { 'lat': "0", 'lon': "0" }
},
{
'query': "0.0.0.0",
'coordinates': { 'lat': "1", 'lon': "1" }
}]
df = pd.DataFrame([(result['query'], result['coordinates']['lat'], result['coordinates']['lon']) for result in results])
df.columns = ['Query', 'Lat', 'Lon']
df['Geolocated'] = ((df['Lat'] == '0') & (df['Lon'] == '0'))
df.head()
Query Lat Lon Geolocated
0 0.0.0.0 0 0 True
1 0.0.0.0 1 1 False
In this code I used a list comprehension to build the list of tuples and defined the 'Geolocated' column as a series, which comes from the comparison of the row's Lat and Lon values.

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Pandas DataFrame from Dictionary with Lists

I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).
When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).
How do I get pandas to treat the lists as 'single values'?
As a hypothetical example:
data = { 'building': 'White House', 'DC?': True,
'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
I want to turn it into a DataFrame like this:
ix building DC? occupants
0 'White House' True ['Barack', 'Michelle', 'Sasha', 'Malia']
This works if you pass a list (of rows):
In [11]: pd.DataFrame(data)
Out[11]:
DC? building occupants
0 True White House Barack
1 True White House Michelle
2 True White House Sasha
3 True White House Malia
In [12]: pd.DataFrame([data])
Out[12]:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
This turns out to be very trivial in the end
data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df
Which results in:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.
d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df
C1 C2
A 20 21
B 10 11
Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do
n = len(data['occupants'])
for key, val in data.items():
if key != 'occupants':
data[key] = n*[val]
EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?
I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:
result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
'zombie': {'confidence': 1, 'ids': ['id3']}}
When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:
In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
confidence ids
hamster 1 [id1, id2]
zombie 1 [id3]
This works for me with python 3.8 + pandas 1.2.3.
if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?

Categories