Understanding the details of this Python Code - python

The task is to load the iris data set from sklearn and then make some plots. I wish to understand what each command is doing.
from sklearn.datasets import load_iris
Q1 Is load_iris a function in sklearn?
data = load_iris()
Q2 Now I believe this load_iris function is returning some output which we are storing as data. What exactly is the output of load_iris()? type etc?
df = pd.DataFrame(data.data, columns=data.feature_names)
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
df['target_names'] = [data.target_names[i] for i in data.target]
Q4 I don't understand the right hand side of the above code
Need help with Questions 1,2,3 and 4. I tried looking at the Scikit documentation but didn't understand it. Also this code is from an online course on edx but they didn't explain the code.

Discover the power of intercativity of Jupyter/iPython.
I'm using iPython in this example.
Q1 Is load_iris a function in sklearn?
In [33]: type(load_iris)
Out[33]: function
Q2 Now I believe this load_iris function is returning some output
which we are storing as data. What exactly is the output of
load_iris()? type etc?
Docstring - is very helpful:
In [34]: load_iris?
Signature: load_iris(return_X_y=False)
Docstring:
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification
dataset.
================= ==============
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
================= ==============
Read more in the :ref:`User Guide <datasets>`.
Parameters
----------
return_X_y : boolean, default=False.
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` object.
.. versionadded:: 0.18
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, and 'DESCR', the
full description of the dataset.
(data, target) : tuple if ``return_X_y`` is True
...
print description:
In [51]: print(data.DESCR)
Iris Plants Database
====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
...
Q3 Now we are storing this as a dataframe. but what is data.data and data.feature_names
In [37]: type(data.data)
Out[37]: numpy.ndarray
In [88]: data.data.shape
Out[88]: (150, 4)
In [38]: df = pd.DataFrame(data.data, columns=data.feature_names)
In [39]: pd.set_option('display.max_rows', 10)
In [40]: df
Out[40]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns]
In [41]: df.columns
Out[41]: Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], dtype='object')
In [42]: data.feature_names
Out[42]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
Q4 I don't understand the right hand side of the above code Need help with Questions 1,2,3 and 4. I tried looking at the Scikit
documentation but didn't understand it. Also this code is from an
online course on edx but they didn't explain the code.
Execute the code and check the result - usually it's easy to see what has happened. BTW, i'd use Numpy for this step:
In [49]: df['target_names'] = np.take(data.target_names, data.target)
In [50]: df
Out[50]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target_names
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]

Related

How to annotate swarmplot points on a categorical axis and labels from a different column

I’m trying to add labels to a few values in my matplotlib/seaborn plot. Not all, just those above a certain value (below, using iris from sklearn, labels for values greater than 3.6 on the x-axis).
Here, from #Scinana, last year is a discussion of doing that when both axes are numeric. But while it includes an accepted answer, I’m having trouble adapting it to my situation. The links provided in the accepted answer don't help either.
The code below works until the last step (the labeling), which throws: 'TypeError: 'FacetGrid' object is not callable'
Additionally, the outliers need to be annotated with values from dfiris['sepal length (cm)'], not just 'outliers'.
import sklearn as sklearn
from sklearn.datasets import load_iris
dfiris = load_iris()
dfiris = pd.DataFrame(data=dfiris.data, columns=dfiris.feature_names)
dfiris['name'] = np.where(dfiris['sepal width (cm)'] < 3, 'Amy', 'Bruce') # adding a fake categorical variable
dfiris['name'] = np.where((dfiris.name != 'Amy') & (dfiris['petal length (cm)'] >= 3.4), 'Charles', dfiris.name) # adding to that fake categorical variable
a_viz = sns.catplot(x='sepal width (cm)', y= 'name', kind = 'swarm', data=dfiris)
a_viz.fig.set_size_inches(5, 6)
a_viz.fig.subplots_adjust(top=0.81, right=0.86)
for x, y in zip(dfiris['sepal width (cm)'], dfiris['name']):
if x > 3.6:
a_viz.text(x, y, 'outlier', horizontalalignment='left', size='medium', color='black')
The following duplicates didn't completely address the issue with adding annotations from a different column, nor how to prevent the annotations from overlapping.
How to Annotate Bars in a Seaborn Facetgrid (works in Factorplot)
Different functions return different object types, such as FacetGrid and AxesSubplot. Why and what is the difference?
displot 'FacetGrid' object is not callable
For a swarmplot, there's no way to distinguish the tick location on the independent axis for each observation, which means the text annotations for each value on the x-axis will overlap.
This can be resolved by using pandas.DataFrame.groupby to create the strings to be passed to s=.
Non-overlapping Annotations
import seaborn as sns
# load sample data that has text labels
df = sns.load_dataset('iris')
# plot the DataFrame
g = sns.catplot(x='sepal_width', y='species', kind='swarm', data=df, height=7, aspect=2)
# there is only one axes for this plot; provide an alias for ease of use
ax = g.axes[0, 0]
# get the ytick locations for each name
ytick_loc = {v.get_text(): v.get_position()[1] for v in ax.get_yticklabels()}
# add the ytick locations for each observation
df['ytick_loc'] = df.species.map(ytick_loc)
# filter the dataframe to only contain the outliers
outliers = df[df.sepal_width.gt(3.6)].copy()
# convert the column to strings for annotations
outliers['sepal_length'] = outliers['sepal_length'].astype(str)
# combine all the sepal_length values as a single string for each species and width
labels = outliers.groupby(['sepal_width', 'ytick_loc']).agg({'sepal_length': '\n'.join}).reset_index()
# iterate through each axes of the FacetGrid with `for ax in g.axes.flat:` or specify the exact axes to use
for _, (x, y, s) in labels.iterrows():
ax.text(x + 0.01, y, s=s, horizontalalignment='left', size='medium', color='black', verticalalignment='center', linespacing=1)
DataFrame Views
df
sepal_length sepal_width petal_length petal_width species ytick_loc
0 5.1 3.5 1.4 0.2 setosa 0
1 4.9 3.0 1.4 0.2 setosa 0
2 4.7 3.2 1.3 0.2 setosa 0
3 4.6 3.1 1.5 0.2 setosa 0
4 5.0 3.6 1.4 0.2 setosa 0
outliers
sepal_length sepal_width petal_length petal_width species ytick_loc
5 5.4 3.9 1.7 0.4 setosa 0
10 5.4 3.7 1.5 0.2 setosa 0
14 5.8 4.0 1.2 0.2 setosa 0
15 5.7 4.4 1.5 0.4 setosa 0
16 5.4 3.9 1.3 0.4 setosa 0
18 5.7 3.8 1.7 0.3 setosa 0
19 5.1 3.8 1.5 0.3 setosa 0
21 5.1 3.7 1.5 0.4 setosa 0
32 5.2 4.1 1.5 0.1 setosa 0
33 5.5 4.2 1.4 0.2 setosa 0
44 5.1 3.8 1.9 0.4 setosa 0
46 5.1 3.8 1.6 0.2 setosa 0
48 5.3 3.7 1.5 0.2 setosa 0
117 7.7 3.8 6.7 2.2 virginica 2
131 7.9 3.8 6.4 2.0 virginica 2
labels
sepal_width ytick_loc sepal_length
0 3.7 0 5.4\n5.1\n5.3
1 3.8 0 5.7\n5.1\n5.1\n5.1
2 3.8 2 7.7\n7.9
3 3.9 0 5.4\n5.4
4 4.0 0 5.8
5 4.1 0 5.2
6 4.2 0 5.5
7 4.4 0 5.7
Overlapping Annotations
import seaborn as sns
# load sample data that has text labels
df = sns.load_dataset('iris')
# plot the DataFrame
g = sns.catplot(x='sepal_width', y='species', kind='swarm', data=df, height=7, aspect=2)
# there is only one axes for this plot; provide an alias for ease of use
ax = g.axes[0, 0]
# get the ytick locations for each name
ytick_loc = {v.get_text(): v.get_position()[1] for v in ax.get_yticklabels()}
# plot the text annotations
for x, y, s in zip(df.sepal_width, df.species.map(ytick_loc), df.sepal_length):
if x > 3.6:
ax.text(x, y, s, horizontalalignment='left', size='medium', color='k')

python clustering package that can cluster based in distance matrix but also predict new row (without new clustering/distance matrix)

I am aware of various (sklearn) clustering algorithm that work with distance matrices - e.g. produced via a proximity matrix coming from a random forest (some clumsy reproducible code below). Is there any clustering algorithm (working with distance matrix), where the fitted cluster model (e.g. cluster_model below) can produce the cluster membership of a new data row?
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
def distanceMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1 * np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return 1 - proxMat
# use iris data to make example reproducible and fast
iris = datasets.load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2], ['iris-' + species for species in iris['target_names'].tolist()])
# simple one hot
df['iris_setosa'] = (df['target_name'] == 'iris-setosa').astype(int)
df['iris_versicolor'] = (df['target_name'] == 'iris-versicolor').astype(int)
df['iris_virginica'] = (df['target_name'] == 'iris-virginica').astype(int)
# the new regression model "target"
y = df['petal width (cm)']
X = df.drop([
'target'
,'target_name'
,'petal width (cm)'
], axis = 1)
# fit random forest just for the purpose of getting proximity matrix
# open question does it matter which target is picked and/or whether regresion or classification?
# this is just to produce a toy dataset with mixed data
overfitted_model = RandomForestRegressor(n_estimators=250, min_samples_leaf=10)
overfitted_model.fit(X, y)
distance_matrix = distanceMatrix(overfitted_model, X, normalize=True)
cluster_model = AgglomerativeClustering(n_clusters=3, affinity='precomputed', linkage='average')
cluster_model.fit(distance_matrix)
df['label'] = cluster_model.labels_
PS:
Readers may find this interesting in this context.
For Agglomerative Clustering, adding additional data points requires a recompute of the clusters because of how this type of clustering works. Agglomerative Clustering iteratively builds clusters based on started points, and then merges according to a linkage measure, so adding new data points can and will modify the final clusters.
...
cluster_model.fit_predict(distance_matrix)
df['label'] = cluster_model.labels_
Check out fit_predict, which is more relevant for unsupervised or transductive estimators.
output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target target_name iris_setosa iris_versicolor iris_virginica label
0 5.1 3.5 1.4 0.2 0 iris-setosa 1 0 0 1
1 4.9 3.0 1.4 0.2 0 iris-setosa 1 0 0 1
2 4.7 3.2 1.3 0.2 0 iris-setosa 1 0 0 1
3 4.6 3.1 1.5 0.2 0 iris-setosa 1 0 0 1
4 5.0 3.6 1.4 0.2 0 iris-setosa 1 0 0 1
.. ... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2 iris-virginica 0 0 1 2
146 6.3 2.5 5.0 1.9 2 iris-virginica 0 0 1 2
147 6.5 3.0 5.2 2.0 2 iris-virginica 0 0 1 2
148 6.2 3.4 5.4 2.3 2 iris-virginica 0 0 1 2
149 5.9 3.0 5.1 1.8 2 iris-virginica 0 0 1 2

Is there an function that gives practical information about a dataframe?

In Python, there is a function data.info(). This function gives you all the information about a dataset such as datatypes, memory, number of entries, etc.
Here you can look up for more information about the .info() function in Python.
Is there also a function in R that gives me this kind of information?
So here we have a few options
Base R
Within Base R there are a few options for getting these kind of data regarding your data:
str
You can use str to see the structure of a data frame
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary
Additionally, there is the summary function which completes a five number summary for each column and then counts for factors:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
dplyr
dplyr provides something similar to str which shows some of the data types
library(dplyr)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, ...
skimr
Finally, the skimr package provides an enhanced summary including little histograms
library(skimr)
skim(iris)
-- Data Summary ------------------------
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
-- Variable type: factor -------------------------------------------------------
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50
-- Variable type: numeric ------------------------------------------------------
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Between those functions you can get a pretty good look at your data!
It's not a single function, but the first three things I always do are
library(tidyverse)
# Shows top 6 rows
iris %>% head()
# Gives dimensions of data.frame
iris %>% dim()
# Gives the classes of the data in each column (e.g. numeric, character etc)
iris %>% sapply(class)
The best package I use, that I haven't seen above, is inspectdf (mentioned by Niels in a comment above). inspectdf does much of the summary you see in skimr in #MDEWITT via specific function calls; for instance, inspect_cat and inspect_num for categorical and numerical variable summaries, respectively.
The contribution of my comment is that inspectdf has two additional functions inspect_imb and inspect_cor which, respectively, look at the most common value per column and the correlation between numerical cols. I find these tremendously useful for data cleaning/pre-processing.

Descriptive statistics along rows of text file using Pandas

I am reading a text file using Pandas in Python. I am using Python 2.7. The dataset in use in this question is related to a question that I had asked before here. To be specific, the first two rows, and the first column of my data comprise of text information. The following is a snapshot of a truncated version of my dataset.
The data file can be found here. I am using the helpful answers given here to load the dataset (df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)).
I want to get descriptive statistics of my pandas dataframe along rows, instead of columns. I have tried using df.describe(), but it gives me descriptive statistics along columns. I had a look at the answers given in this question, but I get the following error when I use the answers suggested in that link.
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index foxq1')
How can I get descriptive statistics using Pandas for the numerical entries in every row for the dataset that I have? Thanks in advance.
Following a few comments, I am including the actual code that I am using, and the error message:
The actual code is this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
df.apply(pd.DataFrame.describe, axis=1)
Error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-0d7a5fde0f42> in <module>()
----> 1 df.apply(pd.DataFrame.describe, axis=1)
2 #df.apply(pd.DataFrame.describe, axis=1)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260 f, axis,
4261 reduce=reduce,
-> 4262 ignore_failures=ignore_failures)
4263 else:
4264 return self._apply_broadcast(f, axis)
/Users/LG/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
4356 try:
4357 for i, v in enumerate(series_gen):
-> 4358 results[i] = func(v)
4359 keys.append(v.name)
4360 except Exception as e:
TypeError: ('unbound method describe() must be called with DataFrame instance as first argument (got Series instance instead)', u'occurred at index object1')
From the question you referenced, you can just use this code (in other words apply describe along the rows):
df.apply(pd.DataFrame.describe, axis=1)
And you get the following result:
count mean std min 25% 50% 75% max
object1 5.0 3.1 1.581139 1.1 2.1 3.1 4.1 5.1
object2 5.0 3.2 1.581139 1.2 2.2 3.2 4.2 5.2
object3 5.0 3.3 1.581139 1.3 2.3 3.3 4.3 5.3
object4 5.0 3.4 1.581139 1.4 2.4 3.4 4.4 5.4
object5 5.0 3.5 1.581139 1.5 2.5 3.5 4.5 5.5
object6 5.0 3.6 1.581139 1.6 2.6 3.6 4.6 5.6
object7 5.0 3.7 1.581139 1.7 2.7 3.7 4.7 5.7
object8 5.0 3.8 1.581139 1.8 2.8 3.8 4.8 5.8
You can try to use numpy to obtain much of the statistics for rows:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
print df
Type T1 T2 T3 T4 T5 T6 T7
Tag Tag1 Tag1 Tag1 Tag5 Tag5 Tag6 Tag6
object1 1.1 2.1 3.1 4.1 5.1 6.1 7.1
object2 1.2 2.2 3.2 4.2 5.2 6.2 7.2
object3 1.3 2.3 3.3 4.3 5.3 6.3 7.3
object4 1.4 2.4 3.4 4.4 5.4 6.4 7.4
object5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
object6 1.6 2.6 3.6 4.6 5.6 6.6 7.6
object7 1.7 2.7 3.7 4.7 5.7 6.7 7.7
object8 1.8 2.8 3.8 4.8 5.8 6.8 7.8
data = df.values
data_mean = np.mean(data, axis=1)
data_std = np.std(data, axis=1)
data_min = np.min(data, axis=1)
data_max = np.max(data, axis=1)
print data_mean
[ 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8]
print data_std
[ 2. 2. 2. 2. 2. 2. 2. 2.]
print data_min
[ 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8]
print data_max
[ 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8]

Python: List Nested Dictionary to pandas DataFrame Issue

I am struggling with some simple from_dict conversion. I have a list nested dictionaries in dictionary as below. (quite confusing to me as well)
dict_total = {'Jane' : {'a1' : [1.1,1.3,1.4,1.9],
'a2' : [3.1,2.4,2.3,1.2],
'a3' : [4.3,2.3,1.5,5.3],
'st' : ['d','dc','sc','sc']},
'Mark' : {'a1' : [3.1,2.3,1.3,1.9],
'a2' : [1.2,2.3,9.3,1.2],
'a3' : [1.1,5.5,1.2,5.3],
'st' : ['cs','s','wc','cd']}
}
Above is just simple example, but my original contains more then 20000+ keys in dict_total. I want to convert this dictionary to dataframe (hopefully on loops) like below.
df_total =
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 sc
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd
As you can see the keys for dict_total would be the index of dataframe, and each keys for "Jane" and "Mark" will be the column name, and lists for values.
Hope there is a pythonic way to solve this. Thanks
I think need concat with dict comprehension, last remove first level by reset_index:
df_total = (pd.concat({k: pd.DataFrame(v) for k, v in dict_total.items()})
.reset_index(level=1, drop=True))
print (df_total)
a1 a2 a3 st
Jane 1.1 3.1 4.3 d
Jane 1.3 2.4 2.3 dc
Jane 1.4 2.3 1.5 sc
Jane 1.9 1.2 5.3 sc
Mark 3.1 1.2 1.1 cs
Mark 2.3 2.3 5.5 s
Mark 1.3 9.3 1.2 wc
Mark 1.9 1.2 5.3 cd

Categories