Subsetting multiindex columns - python

I have a dataframe with multiindex hierarchical colmn names, including empty strings as column index names. How to subset second and third columns?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5,3),
index=[1,2,3,4,5],
columns=[['A', 'A', 'B'],
['a', 'b', ''],
['', 'x', '']]
)
df.columns.names = ["c_ix0", "c_ix1", "c_ix2"]
print(df)
c_ix0 A B
c_ix1 a b
c_ix2 x
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
expected output:
c_ix0 A B
c_ix1 b
c_ix2 x
1 1 2
2 4 5
3 7 8
4 10 11
5 13 14

I believe you need xs:
a = df.xs('b', axis=1, level=1)
print (a)
c_ix0 A
c_ix2 x
1 1
2 4
3 7
4 10
5 13
b = df.xs('B', axis=1, level=0)
print (b)
c_ix1
c_ix2
1 2
2 5
3 8
4 11
5 14
If want select by positions use iloc:
c = df.iloc[:, 1]
print (c)
1 1
2 4
3 7
4 10
5 13
Name: (A, b, x), dtype: int32
EDIT:
d = df.iloc[:, [1, 2]]
print (d)
c_ix0 A B
c_ix1 b
c_ix2 x
1 1 2
2 4 5
3 7 8
4 10 11
5 13 14

Related

Sum of every two columns and leave one column in pandas dataframe

My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I want to do is the output dataframe looks like this:
Out
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
That is to say, sum the column (a,b),(c,d),(e,f) separately and keep each last column and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.
You can seelct columns by posistions by iloc, sum each 2 values and last rename columns by f-strings
i = 2
for x in range(0, len(df.columns), i):
df.iloc[:, x] = df.iloc[:, x:x+i].sum(axis=1)
df = df.rename(columns={df.columns[x]:f's{x // i + 1}'})
print (df)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
For one do
df['a'] = df['a'] + df['b']
df.rename(columns={col1: 's1')}, inplace=True)
You can use a loop to do all
the loop using enumerate and zip, generates
(0,('a','b')), (1,('c','d')), (2,('e','f'))
use these indexes to do the sum and the renaming
import pandas as pd
cols = ['a','b','c','d','e','f']
df =pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=cols)
for idx, (col1, col2) in enumerate(zip(cols[::2], cols[1::2])):
df[col1] = df[col1] + df[col2]
df.rename(columns={col1: 's'+str(idx+1)}, inplace=True)
print(df)
CODE DEMO
You can try this:-
res = pd.DataFrame()
for i in range(len(df.columns)-1):
if i%2==0:
res[df.columns[i]] = df[df.columns[i]]+df[df.columns[i+1]]
else:
res[df.columns[i]] = df[df.columns[i]]
res['f'] = df[df.columns[-1]]
res.columns = ['s1', 'b', 's2', 'd', 's3', 'f']
Output:-
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
df['s1'] = df['a'] + df['b']
df['s2'] = df['c'] + df['d']
df['s3'] = df['e'] + df['f']
df = a b c d e f s1 s2 s3
0 1 2 3 4 5 6 3 7 11
1 1 2 3 4 5 6 3 7 11
2 1 2 3 4 5 6 3 7 11
and you can remove the columns 'a', 'b', 'c'
df.pop('a')
df.pop('c')
df.pop('d')
df = b e f s1 s2 s3
0 2 5 6 3 7 11
1 2 5 6 3 7 11
2 2 5 6 3 7 11
Jump is in steps of two; so we can split the dataframe with np.split :
res = np.split(df.to_numpy(), df.shape[-1] // 2, 1)
Next, we compute the new data, where we sum pairs of columns and keep the last column in each pair :
new_frame = np.hstack([np.vstack((np.sum(entry,1), entry[:,-1])).T for entry in res])
Create new column, taking into cognizance the jump of 2 :
new_cols = [f"s{ind//2+1}" if ind%2==0 else val for ind,val in enumerate(df.columns)]
Create new dataframe :
pd.DataFrame(new_frame, columns=new_cols)
s1 b s2 d s3 f
0 3 2 7 4 11 6
1 3 2 7 4 11 6
2 3 2 7 4 11 6

How to drop duplicates in python if consecutive values are the same in two columns?

I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12

Multiply multiple columns conditionally by one column (easy in Excel)

I have a dataframe like this:
a b c d m1 m2
3 2 2 2 5 4
1 4 1 1 5 4
3 2 2 3 5 4
I would like to multiply a and b for m1 and c and d for m2:
a b c d m1 m2
15 10 8 8 5 4
5 20 4 4 5 4
15 10 8 12 5 4
Also retaining the original dataframe structure, This is fairly simple in Excel, but Pandas is proving complicated since if I try the first multiplication (m1) then the DF drops non used columns.
Cheers!
Use mul with subset of columns defined by list of columns names:
df[['a', 'b']] = df[['a', 'b']].mul(df['m1'], axis=0)
df[['c', 'd']] = df[['c', 'd']].mul(df['m2'], axis=0)
print (df)
a b c d m1 m2
0 15 10 8 8 5 4
1 5 20 4 4 5 4
2 15 10 8 12 5 4
Here's one way using np.repeat:
df.iloc[:, :4] *= np.repeat(df.iloc[:, 4:].values, 2, axis=1)
print(df)
a b c d m1 m2
0 15 10 8 8 5 4
1 5 20 4 4 5 4
2 15 10 8 12 5 4

Reshape MultiIndex dataframe to tabular format

Given a sample MultiIndex:
idx = pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c', 'd']])
df = pd.DataFrame({'value' : np.arange(12)}, index=idx)
df
value
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
How can I efficiently convert this to a tabular format like so?
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Furthermore, given the dataframe above, how can I bring it back to its original multi-indexed state?
What I've tried:
pd.DataFrame(df.values.reshape(-1, df.index.levels[1].size),
index=df.index.levels[0], columns=df.index.levels[1])
Which works for the first problem, but I'm not sure how to bring it back to its original from there.
Using unstack and stack
In [5359]: dff = df['value'].unstack()
In [5360]: dff
Out[5360]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [5361]: dff.stack().to_frame('name')
Out[5361]:
name
0 a 0
b 1
c 2
d 3
1 a 4
b 5
c 6
d 7
2 a 8
b 9
c 10
d 11
By using get_level_values
pd.crosstab(df.index.get_level_values(0),df.index.get_level_values(1),values=df.value,aggfunc=np.sum)
Out[477]:
col_0 a b c d
row_0
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Another alternative, which you should think of when using stack/unstack (though unstack is clearly better in this case!) is pivot_table:
In [11]: df.pivot_table(values="value", index=df.index.get_level_values(0), columns=df.index.get_level_values(1))
Out[11]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

pandas compare and select the smallest number from another dataframe

I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.
Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs

Categories