unpack dictionary values in dataframe python

unpack dictionary values in dataframe python - python

I have this dataframe from where I need to exact the act1omschr from the column adresactiviteit, however sinds it is an object with a list and dict I don't know how to extract these values.
Can someone help me out?

It looks like that's not a dictionary, but a 'json' (java script object notation). It's a bit like a csv but with nested values and pretty comumn especially for web data.
Pandas has a function called 'json_normalize' which should help. For specifically using it on one column, this was answered pretty well over here. You should more or less be able to use the exact code given.

Related

Pandas Styles removing default table format

I am trying to format a pandas DataFrame value representation.
Basically, all I want is to get the "Thousand" separator on my values.
I managed to do it using the pd.style.format function. It does the job, but also "breaks" all my table original design.
here is an example of what is going on:
Is there anything I can do to avoid doing it? I want to keep the original table format, only changing the format of the value.
PS: Don't know if it makes any difference, but I am using Google Colab.

In case anyone is having the same problem as I was using Colab, I have found a solution:
.set_table_attributes('class="dataframe"') seems to solve the problem
More infos can be found here: https://github.com/googlecolab/colabtools/issues/1687

For this case you could do:
pdf.assign(a=pdf['a'].map("{:,.0f}".format))

How to create a dataframe from a complex (nested) list of dictionaries?

Good morning!
I have retrieved a bunch of data from the Facebook Graph API using the python facebook library. I'd like to organize all the meaningful data neatly in a csv file for later analysis, but the problem is I'm quite new to python and I don't know how to approach the format in which the data has been retrieved. Basically, I have all the data about a page posts from 01-05-2020 in a list called data_basic:
Every instance of the list represents one post and is a dict of size 8.
Every dict has: 3 dict elements, 3 string elements, 1 bool element and 1 list element.
For example, in order to access the media_type of the first post I must type: data_basic[0]['attachments']['data'][0]['value'], because inside the dict representing the first post I have a dict containing the attachments whose key 'data' contains a list in which I have the values (for example, again, media_type). A nightmare...
Every instance of the dict containing the post data is different... Attachment is the most nested, but something similar happens for the comments or the tags, while message, created time and so on are much more accessible.
I'd like to obtain a csv table whose rows are the various posts and whose columns are the variables (except, of course, the comments, which I'll store in a different file since there's more than one for each post).
How can I approach the problem? The first thing which comes to my mind is a brute force approach using a for cycle trough all the posts and all the variables, filling the dataframe place by place. But I hope there's a quicker and more elegant way... I've come across the json_normalize function, tried something, but I really don't understand how it works and if it can be of any help... Any thoughts?
Thanks in advance!
edit: a couple of screenshot in order to understand better

What .object means for a GroupBy Object

I keep seeing
for index, row in group.object.iterrows():
in Tensorflow tutorials. I get what it's doing, and that group is a GroupBy object, but I wonder what the ".object" is there for. I googled "group.object.iterrows", all I got was Tensorflow object detection code. I tried other variants, but nothing had a GroupBy.object example or description of what it is.
EDIT: here's a tutorial:
https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10/blob/master/generate_tfrecord.py
See line 70.
Here's another, there are a bunch, actually:
https://www.skcript.com/svr/realtime-object-and-face-detection-in-android-using-tensorflow-object-detection-api/
Some more context:
They involve making a tensorflow.train.Example, loading features into it. These were originally taken from some xml from some producing labeling tools, then converted to a csv, then converted to a pandas data frame.
In fact, the code mostly looks like cut-and-paste from some original script with small edits.

Like a DataFrame, a Pandas GroupBy object supports accessing columns by attribute access notation, as long as the column name doesn't conflict with "regular" attributes. object is merely one of the column names in the grouped data, and group.object accesses that column.

object is a column in the group DataFrame.

Python/Dataframes: Column count and names?

Sorry for this newbie question. Working in Python, I need to get an integer value of the number of columns in the df and for each of them I need to get the column names.
df.printSchema() displays a nice tree view
df.describe().show() displays some stats as well.
but can't seem to find a way to do get a count of columns and an array of column names. Perhaps I should do them using SQL API but am not quite familiar with it yet. Still learning basics... Thanks so much!

What is the data structure in python that can contain multiple pandas data frames?

I want to write a function to return several data frames (different dims) and put them into a larger "container" and then select each from the "container" using indexing. I think I want to find some data structure like list in R, which can have different kinds of objects.
What can I use to do this?

I haven't done much with Panels, but what exactly is the functionality that you need? Is there a reason a simple python list wouldn't work? Or, if you want to refer by name and not just by list position, a dictionary?

It depends a bit what you want to achieve. People used to work a lot with MultiIndex and have an identifier of the dataframe as an index (documentation).
But recently, there has been a lot of improvements of the Panels class, which is most likely the optimal solution for you (api, documentation)

I agree with #foobar I have used MultiColumns and MultiIndexes before for this type of Data. However, I believe the best datatype for this would be to use a pandas Panel. Here is the documentation...
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.html
You can add the frames on just like you would add elements to a dict

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.