pandas DataFrame diagonal - python

What is an efficient way to get the diagonal of a square DataFrame. I would expect the result to be a Series with a MultiIndex with two levels, the first being the index of the DataFrame the second level being the columns of the DataFrame.
Setup
import pandas as pd
import numpy as np
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(3, 3) * 5,
columns = list('abc'),
index = list('ABC'),
dtype=np.int64
)
I want to see this:
print df.stack().loc[[('A', 'a'), ('B', 'b'), ('C', 'c')]]
A a 2
B b 2
C c 3

If you don't mind using numpy you could use numpy.diag
pd.Series(np.diag(df), index=[df.index, df.columns])
A a 2
B b 2
C c 3
dtype: int64

You could do something like this:
In [16]:
midx = pd.MultiIndex.from_tuples(list(zip(df.index,df.columns)))
pd.DataFrame(data=np.diag(df), index=midx)
Out[16]:
0
A a 2
B b 2
C c 3
np.diag will give you the diagonal values as a np array, you can then construct the multiindex by zipping the index and columns and pass this as the desired index in the DataFrame ctor.
Actually the complex multiindex generation doesn't need to be so complicated:
In [18]:
pd.DataFrame(np.diag(df), index=[df.index, df.columns])
Out[18]:
0
A a 2
B b 2
C c 3
But johnchase's answer is neater

You can also use iat in a list comprehension to get the diagonal.
>>> pd.Series([df.iat[n, n] for n in range(len(df))], index=[df.index, df.columns])
A a 2
B b 2
C c 3
dtype: int64

Related

Select only columns that have at most N unique values

I want to count the number of unique values in each column and select only those columns which have less than 32 unique values.
I tried using
df.filter(nunique<32)
and
df[[ c for df.columns in df if c in c.nunique<32]]
but because nunique is a method and not function they don't work. Thought len(set() would work and tried
df.apply(lambda x : len(set(x))
but doesn't work as well. Any ideas please? thanks in advance!
nunique can be called on the entire DataFrame (you have to call it). You can then filter out columns using loc:
df.loc[:, df.nunique() < 32]
Minimal Verifiable Example
df = pd.DataFrame({'A': list('abbcde'), 'B': list('ababab')})
df
A B
0 a a
1 b b
2 b a
3 c b
4 d a
5 e b
df.nunique()
A 5
B 2
dtype: int64
df.loc[:, df.nunique() < 3]
B
0 a
1 b
2 a
3 b
4 a
5 b
If anyone wants to do it in a method chaining fashion, you can:
df.loc[:, lambda x: x.nunique() < 3]

Why transpose data to get a multiindexed dataframe?

I'm a bit confused with data orientation when creating a Multiindexed DataFrame from a DataFrame.
I import data with read_excel() and I begin with something like:
import pandas as pd
df = pd.DataFrame([['A', 'B', 'A', 'B'], [1, 2, 3, 4]],
columns=['k', 'k', 'm', 'm'])
df
Out[3]:
k k m m
0 A B A B
1 1 2 3 4
I want to multiindex this and to obtain:
A B A B
k k m m
0 1 2 3 4
Mainly from Pandas' doc, I did:
arrays = df.iloc[0].tolist(), list(df)
tuples = list(zip(*arrays))
multiindex = pd.MultiIndex.from_tuples(tuples, names=['topLevel', 'downLevel'])
df = df.drop(0)
If I try
df2 = pd.DataFrame(df.values, index=multiindex)
(...)
ValueError: Shape of passed values is (4, 1), indices imply (4, 4)
I then have to transpose the values:
df2 = pd.DataFrame(df.values.T, index=multiindex)
df2
Out[11]:
0
topLevel downLevel
A k 1
B k 2
A m 3
B m 4
Last I re-transpose this dataframe to obtain:
df2.T
Out[12]:
topLevel A B A B
downLevel k k m m
0 1 2 3 4
OK, this is what I want, but I don't understand why I have to transpose 2 times. It seems useless.
You can create the MultiIndex yourself, and then drop the row. From your starting df:
import pandas as pd
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.columns], names=[None]*2)
df = df.iloc[1:].reset_index(drop=True)
A B A B
k k m m
0 1 2 3 4

Apply an operation on pandas DataFrame rows based on a Series that matchs the number of columns of the DataFrame

Let's say that I have a DataFrame df and a Series s like this:
>>> df = pd.DataFrame(np.random.randn(2,3), columns=["A", "B", "C"])
>>> df
A B C
0 -0.625816 0.793552 -1.519706
1 -0.955960 0.142163 0.847624
>>> s = pd.Series([1, 2, 3])
>>> s
0 1
1 2
2 3
dtype: int64
I'd like to add the values of s to each row in df. I guess I should use some apply with axis=1 or applymap but I can't figure out how (do I have to transpose at some point?).
Actually my problem is more complex that that and the final DataFrame will be composed of the elements of the initial DataFrame that will have been processed according to the values of two Series.
Possible solution is add 1d numpy array created from Series for prevent alignment columns of DataFrame to index of Series:
df = df + s.values
print (df)
A B C
0 0.207070 1.995021 4.829518
1 0.819741 2.802982 2.801355
If same columns and index values it working with sum:
#index is same like columns names
s = pd.Series([1, 2, 3], index=df.columns)
print (s)
A 1
B 2
C 3
dtype: int64
df = df + s

Join an empty pandas DataFrame with a Multiindex DataFrame

I want to build a large pandas DataFrame in a loop. In the first iteration the DataFrame df1 is still empty. When I join df1 with df2 that has a MultiIndex, the Index gets squashed somehow.
df1 = pd.DataFrame(index=range(6))
df2 = pd.DataFrame(np.random.randn(6, 3),
columns=pd.MultiIndex.from_arrays((['A','A','A'],
['a', 'b', 'c'])))
df1[df2.columns] = df2
df1
(A, a) (A, b) (A, c)
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
I was hoping for a DataFrame with the MultiIndex intact like this:
A
a b c
0 -0.673923 1.392369 1.848935
1 1.427368 0.042691 0.130962
2 -0.258589 0.216157 0.196101
3 -1.022283 1.312113 -0.770108
4 0.511127 -0.633477 -0.229149
5 -1.364237 0.713107 2.124274
What am I doing wrong?
The multiple index will not always recognized when we do assign for a simple index , so
df1 = pd.DataFrame(index=range(6),columns=pd.MultiIndex.from_arrays([[],[]]))
df1[df2.columns] = df2
df1
Out[697]:
A
a b c
0 -0.755397 0.574920 0.901570
1 -0.165472 -1.865715 1.583416
2 -0.403287 1.358329 0.706650
3 0.028019 1.432543 -0.586325
4 -0.414851 0.825253 0.745090
5 0.389917 0.940657 0.125837

set a multi index in the dataframe constructor using the data-list provided to the constructor

I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2

Categories