Pandas DataFrame from MultiIndex and NumPy structured array (recarray)

Pandas DataFrame from MultiIndex and NumPy structured array (recarray) - python

First I create a two-level MultiIndex:
import numpy as np
import pandas as pd
ind = pd.MultiIndex.from_product([('X','Y'), ('a','b')])
I can use it like this:
pd.DataFrame(np.zeros((3,4)), columns=ind)
Which gives:
X Y
a b a b
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
But now I'm trying to do this:
dtype = [('Xa','f8'), ('Xb','i4'), ('Ya','f8'), ('Yb','i4')]
pd.DataFrame(np.zeros(3, dtype), columns=ind)
But that gives:
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
I expected something like the previous result, with three rows.
Perhaps more generally, what I want to do is to generate a Pandas DataFrame with MultiIndex columns where the columns have distinct types (as in the example, a is float but b is int).

This looks like a bug, and worth reporting as an issue github.
A workaround is to set the columns manually after construction:
In [11]: df1 = pd.DataFrame(np.zeros(3, dtype))
In [12]: df1.columns = ind
In [13]: df1
Out[13]:
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0

pd.DataFrame(np.zeros(3, dtype), columns=ind)
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
is just showing the textual representation of the dataframe output.
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
is then just the text representation of the index.
if you instead:
df = pd.DataFrame(np.zeros(3, dtype), columns=ind)
print type(df.columns)
<class 'pandas.indexes.multi.MultiIndex'>
You see it is indeed a pd.MultiIndex
That said and out of the way. What I don't understand is why specifying the index in the dataframe constructor removes the values.
A work around is this.
df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
print df
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0

Related

convert a dict with key and list of values into pandas Dataframe where values are column names

Given a dict like this:
d={'paris':['a','b'],
'brussels':['b','c'],
'mallorca':['a','d']}
#when doing:
df = pd.DataFrame(d)
df.T
I dont get the expected result.
What I would like to get is a one_hot_encoding DF, in which the columns are the capitals and the value 1 or 0 corresponds to every of the letters that every city includes being paris, mallorca ect
The desired result is:
df = pd.DataFrame([[1,1,0,0],[0,1,1,0],[1,0,0,1]], index=['paris','brussels','mallorca'], columns=list('abcd'))
df.T
Any clever way to do this without having to multiloop over the first dict to transform it into another one?

Solution 1:
Combine df.apply with series.value_counts and append df.fillna to fill NaN values with zeros.
out = df.apply(pd.Series.value_counts).fillna(0)
print(out)
paris brussels mallorca
a 1.0 0.0 1.0
b 1.0 1.0 0.0
c 0.0 1.0 0.0
d 0.0 0.0 1.0
Solution 1:
Transform your df using df.melt and then use the result inside pd.crosstab.
Again use df.fillna to change NaN values to zeros. Finally, reorder the columns based on the order in the original df.
out = df.melt(value_name='index')
out = pd.crosstab(index=out['index'], columns=out['variable'])\
.fillna(0).loc[:, df.columns]
print(out)
paris brussels mallorca
index
a 1 0 1
b 1 1 0
c 0 1 0
d 0 0 1

I don't know how 'clever' my solution is but it works and it is pretty consise and readable.
import pandas as pd
d = {'paris': ['a', 'b'],
'brussels': ['b', 'c'],
'mallorca': ['a', 'd']}
df = pd.DataFrame(d).T
df.columns = ['0', '1']
df = pd.concat([df['0'], df['1']])
df = pd.crosstab(df, columns=df.index)
print(df)
Yields:
brussels mallorca paris
a 0 1 1
b 1 0 1
c 1 0 0
d 0 1 0

create correlation-matrix-like data frame in pandas

I have a df with correlation values for A and B
df = pd.DataFrame({'x':['A','A','B','B'],'y':['A','B','A','B'],'c':[1,0.5,0.5,1]})
I'm trying to create a correlation-matrix-like data frame from df of the kind DataFrame.corr would give me.
I tried
corr = df.pivot_table(columns='y',index='x')
y A B
x
A 1.0 0.5
B 0.5 1.0
but I don't know how to get rid of the multi-index.

You just need specifying values to get rid of multiindex
corr = df.pivot_table(columns='y',index='x', values='c')
Out[41]:
y A B
x
A 1.0 0.5
B 0.5 1.0
If you also want to get rid of axis name, chain rename_axis
corr = (df.pivot_table(columns='y',index='x', values='c')
.rename_axis(index=None, columns=None))
Out[43]:
A B
A 1.0 0.5
B 0.5 1.0

Pairwise Euclidean distance with pandas ignoring NaNs

I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.

The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()

pandas, convert DataFrame to MultiIndex'ed DataFrame

I have a pandas.DataFrame that I want to convert to a MultiIndexed pandas.DataFrame.
import numpy
import pandas
import itertools
xs = numpy.linspace(0, 10, 100)
ys = numpy.linspace(0, 0.1, 20)
zs = numpy.linspace(0, 5, 200)
def func(x, y, z):
return x * y / z
vals = list(itertools.product(xs, ys, zs))
result = [func(x, y, z) for x, y, z in vals]
# Original DataFrame.
df = pandas.DataFrame(vals, columns=['x', 'y', 'z'])
df = pandas.concat((pandas.DataFrame(result, columns=['result']), df), axis=1)
# I want to turn `df` into this `df2`.
index = pandas.MultiIndex.from_tuples(vals, names=['x', 'y', 'z'])
df2 = pandas.DataFrame(result, columns=['result'], index=index)
Note that in this example I create what I want and what I have.
So, IRL I would start with df and want to turn it into df2 (and don't have access to vals and result), how do I do this?

You need set_index:
print (df2.head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
print (df.set_index(['x','y','z']).head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
If need compare both DataFrames, need replace NaN to same values, else get False:
print (df.set_index(['x','y','z']).eq(df2).all())
result False
dtype: bool
print (np.nan == np.nan)
False
print (df.fillna(1).set_index(['x','y','z']).eq(df2.fillna(1)).all())
result True
dtype: bool

Return list of indices/index where a min/max value occurs in a pandas dataframe

I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]

Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame from MultiIndex and NumPy structured array (recarray) - python

This looks like a bug, and worth reporting as an issue github. A workaround is to set the columns manually after construction: In [11]: df1 = pd.DataFrame(np.zeros(3, dtype)) In [12]: df1.columns = ind In [13]: df1 Out[13]: X Y a b a b 0 0.0 0 0.0 0 1 0.0 0 0.0 0 2 0.0 0 0.0 0

Related

convert a dict with key and list of values into pandas Dataframe where values are column names

create correlation-matrix-like data frame in pandas

Pairwise Euclidean distance with pandas ignoring NaNs

pandas, convert DataFrame to MultiIndex'ed DataFrame

Return list of indices/index where a min/max value occurs in a pandas dataframe

Categories

Resources