Selecting by multiIndex - python

I have two DataFrames
df_a = pd.DataFrame(data=[['A', 'B', 'C'], ['A1', 'B1', 'C1']], columns=['first', 'secound', 'third'])
df_a.set_index(['first', 'secound'], inplace=True)
df_b = pd.DataFrame(data=[['A', 'B', 12], ['A', 'B', 143], ['C1', 'C1', 11]], columns=['first', 'secound', 'data'])
df_b.set_index(['first', 'secound'], inplace=True)
third
first secound
A B C
A1 B1 C1
data
first secound
A B 12
B 143
C1 C1 11
How I can select only shared index elements in df_b:
data
first secound
A B 12
B 143
Thanks for help

You could take the intersection of the indexes, and use that as an indexer for df_b.loc:
In [28]: df_b.loc[df_b.index.intersection(df_a.index)]
Out[28]:
data
first secound
A B 12
B 143
or, alternatively, use isin to generate a boolean mask for df_b.loc:
In [32]: df_b.loc[df_b.index.isin(df_a.index)]
Out[32]:
data
first secound
A B 12
B 143
Using isin seems to be the fastest option:
This was the setup used to generate the perfplot above:
import numpy as np
import pandas as pd
import perfplot
def isin(x):
df_a, df_b = x
return df_b.loc[df_b.index.isin(df_a.index)]
def intersection(x):
df_a, df_b = x
return df_b.loc[df_b.index.intersection(df_a.index)]
def join(x):
df_a, df_b = x
return df_a.drop(df_a.columns, axis=1).join(df_b).dropna()
def make_df(n):
df = pd.DataFrame(np.random.randint(10, size=(n, 3)))
df = df.set_index([0, 1])
return df
perfplot.show(
setup=lambda n: [make_df(n) for i in range(2)],
kernels=[isin, intersection, join],
n_range=[2**k for k in range(2, 15)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')

You could join the index of df_a with df_b, and then drop NaNs:
>>> df_a.drop(df_a.columns, axis=1).join(df_b).dropna()
data
first secound
A B 12.0
B 143.0

Related

Removing columns selectively from multilevel index dataframe

Say we have a dataframe like this and want to remove columns when certain conditions met.
df = pd.DataFrame(
np.arange(2, 14).reshape(-1, 4),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data1','data2'],
['F', 'K','R','X'],
['C', 'D','E','E']
], names=['meter', 'Sleeper','sweeper'])
)
df
then lets say we want to remove cols only when meter == data1 and sweeper == E
so I tried
df = df.drop(('data1','E'),axis = 1)
KeyError: 'E'
second try
df.drop(('data1','E'), axis = 1, level = 2)
KeyError: "labels [('data1', 'E')] not found in level"
Pandas: drop a level from a multi-level column index?
Seems drop doesn't support selection over split levels ([0,2] here). We can create a mask with the conditions instead using get_level_values:
# keep where not ((level0 is 'data1') and (level2 is 'E'))
col_mask = ~((df.columns.get_level_values(0) == 'data1')
& (df.columns.get_level_values(2) == 'E'))
df = df.loc[:, col_mask]
We can also do this by integer location by excluding the locs that are in a particular index slice, however, this is overall less clear and less flexible:
idx = pd.IndexSlice['data1', :, 'E']
cols = [i for i in range(len(df.columns))
if i not in df.columns.get_locs(idx)]
df = df.iloc[:, cols]
Either approach produces df:
meter data1 data2
Sleeper F K X
sweeper C D E
A 2 3 5
B 6 7 9
C 10 11 13
You have to do them individually, since they are on different levels:
df.drop('data1', axis=1, level='meter').drop('E', axis = 1, level='sweeper')
Out[833]:
meter data2
Sleeper K
sweeper D
A 3
B 7
C 11

pandas merge and update efficiently

Iam getting df1 from the database.
Df2 needs to be merged with df1. Df1 contains additional columns not present in df2. df2 contains indexes that are already present in df1 and which rows need to be updated. the dataframe are multi indexed.
What i want:
-keep rows in df1 that are not in df2
-update df1's values with df2's values for matching indexes
-in the updated rows keep the values of the columns that are not present in df2.
-append rows that are in df2 but not in df1
My Solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={'idx1': ['A', 'B', 'C', 'D', 'E'], 'idx2': [1, 2, 3, 4, 5], 'one': ['df1', 'df1', 'df1', 'df1', 'df1'],
'two': ["y", "x", "y", "x", "y"]})
df2 = pd.DataFrame(data={'idx1': ['D', 'E', 'F', 'G'], 'idx2': [4, 5, 6, 7], 'one': ['df2', 'df2', 'df2', 'df2']})
desired_result = pd.DataFrame(data={'idx1': ['A', 'B', 'C', 'D', 'E', 'F', 'G'], 'idx2': [1, 2, 3, 4, 5, 6, 7],
'one': ['df1','df1','df1','df2', 'df2', 'df2', 'df2'], 'two': ["y", "x", "y", "x", "y",np.nan,np.nan]})
updated = pd.merge(df1[['idx1', 'idx2']], df2, on=['idx1', 'idx2'], how='right')
keep = df1[~df1.isin(df2)].dropna()
my_res = pd.concat([updated, keep])
my_res.drop(columns='two', inplace=True)
my_res = pd.merge(my_res,df1[['idx1','idx2','two']], on=['idx1','idx2'])
This is very inefficient as i:
merge by right outer join df2 into index only columns of df1
find indexes that are in df2 but not in df1
concat the two dataframes
drop the columns that were not included in df2
merge on index to append those columns that i've previously dropped
Is there maybe a more efficient easier way to do this? I just cannot wrap my head around this.
EDIT:
By mutliindexed i mean that to identify a row i need to look at 4 different columns combined.
And unfortunately my solution does not work properly.
Merge the dataframes, update the column one with the values from one_, then drop this temporary column.
df = df1.merge(df2, on=['idx1', 'idx2'], how='outer', suffixes=['', '_'])
df['one'].update(df['one_'])
>>> df.drop(columns=['one_'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
Using DataFrame.append, Dataframe.drop_duplicates and Series.update:
First we append df1 and df2. Then we drop the duplicates based on column idx1 and idx2. Finally we update the two column NaN based on existing values in df1.
df3 = (df1.append(df2, sort=False)
.drop_duplicates(subset=['idx1', 'idx2'], keep='last')
.reset_index(drop=True))
df3['two'].update(df1['two'])
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN
One line combine_first
Yourdf=df2.set_index(['idx1','idx2']).combine_first(df1.set_index(['idx1','idx2'])).reset_index()
Yourdf
Out[216]:
idx1 idx2 one two
0 A 1 df1 y
1 B 2 df1 x
2 C 3 df1 y
3 D 4 df2 x
4 E 5 df2 y
5 F 6 df2 NaN
6 G 7 df2 NaN

Python iterate each sub group of rows and apply function

I need to combine all iterations of subgroups to apply a function to and return a single value output along with concatenated string items identifying which iterations were looped.
I understand how to use pd.groupby and can set level=0 or level=1 and then call agg{'LOOPED_AVG':'mean'}. However, I need to group (or subset) rows by subgroup and then combine all rows from an iteration and then apply the function to it.
Input data table:
MAIN_GROUP SUB_GROUP CONCAT_GRP_NAME X_1
A 1 A1 9
A 1 A1 6
A 1 A1 3
A 2 A2 7
A 3 A3 9
B 1 B1 7
B 1 B1 3
B 2 B2 7
B 2 B2 8
C 1 C1 9
Desired result:
LOOP_ITEMS LOOPED_AVG
A1 B1 C1 6.166666667
A1 B2 C1 7
A2 B1 C1 6.5
A2 B2 C1 7.75
A3 B1 C1 7
A3 B2 C1 8.25
Assuming that you have three column pairs then you can apply the following, for more column pairs then adjust the script accordingly. I wanted to give you a way to solve the problem, this may not be the most efficient way but it gives a starting point.
import pandas as pd
import numpy as np
ls = [
['A', 1, 'A1', 9],
['A', 1, 'A1', 6],
['A', 1, 'A1', 3],
['A', 2, 'A2', 7],
['A', 3, 'A3', 9],
['B', 1, 'B1', 7],
['B', 1, 'B1', 3],
['B', 2, 'B2', 7],
['B', 2, 'B2', 8],
['C', 1, 'C1', 9],
]
#convert to dataframe
df = pd.DataFrame(ls, columns = ["Main_Group", "Sub_Group", "Concat_GRP_Name", "X_1"])
#get count and sum of concatenated groups
df_sum = df.groupby('Concat_GRP_Name')['X_1'].agg(['sum','count']).reset_index()
#print in permutations formula to calculate different permutation combos
import itertools as it
perms = it.permutations(df_sum.Concat_GRP_Name)
def combute_combinations(df, colname, main_group_series):
l = []
import itertools as it
perms = it.permutations(df[colname])
# Provides sorted list of unique values in the Series
unique_groups = np.unique(main_group_series)
for perm_pairs in perms:
#take in only the first three pairs of permuations and make sure
#the first column starts with A, secon with B, and third with C
if all([main_group in perm_pairs[ind] for ind, main_group in enumerate(unique_groups)]):
l.append([perm_pairs[ind] for ind in range(unique_groups.shape[0])])
return l
t = combute_combinations(df_sum, 'Concat_GRP_Name', df['Main_Group'])
#convert to dataframe and drop duplicate pairs
df2 = pd.DataFrame(t, columns = ["Item1", 'Item2', 'Item3']) .drop_duplicates()
#do a join between the dataframe that contains the sums and counts for the concat_grp_name to bring in the counts for
#each column from df2, since there are three columns: we must apply this three times
merged = df2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item1'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item1_sum'}, axis=1)\
.rename({'count':'item1_count'}, axis=1)
merged2 = merged.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item2'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item2_sum'}, axis=1)\
.rename({'count':'item2_count'}, axis=1)
merged3 = merged2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item3'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item3_sum'}, axis=1)\
.rename({'count':'item3_count'}, axis=1)
#get the sum of all of the item_sum cols
merged3['sums']= merged3[['item3_sum', 'item2_sum', 'item1_sum']].sum(axis = 1)
#get sum of all the item_count cols
merged3['counts']= merged3[['item3_count', 'item2_count', 'item1_count']].sum(axis = 1)
#find the average
merged3['LOOPED_AVG'] = merged3['sums'] / merged3['counts']
#remove irrelavent fields
merged3 = merged3.drop(['item3_count', 'item2_count', 'item1_count', 'item3_sum', 'item2_sum', 'item1_sum', 'counts', 'sums' ], axis = 1)

Why transpose data to get a multiindexed dataframe?

I'm a bit confused with data orientation when creating a Multiindexed DataFrame from a DataFrame.
I import data with read_excel() and I begin with something like:
import pandas as pd
df = pd.DataFrame([['A', 'B', 'A', 'B'], [1, 2, 3, 4]],
columns=['k', 'k', 'm', 'm'])
df
Out[3]:
k k m m
0 A B A B
1 1 2 3 4
I want to multiindex this and to obtain:
A B A B
k k m m
0 1 2 3 4
Mainly from Pandas' doc, I did:
arrays = df.iloc[0].tolist(), list(df)
tuples = list(zip(*arrays))
multiindex = pd.MultiIndex.from_tuples(tuples, names=['topLevel', 'downLevel'])
df = df.drop(0)
If I try
df2 = pd.DataFrame(df.values, index=multiindex)
(...)
ValueError: Shape of passed values is (4, 1), indices imply (4, 4)
I then have to transpose the values:
df2 = pd.DataFrame(df.values.T, index=multiindex)
df2
Out[11]:
0
topLevel downLevel
A k 1
B k 2
A m 3
B m 4
Last I re-transpose this dataframe to obtain:
df2.T
Out[12]:
topLevel A B A B
downLevel k k m m
0 1 2 3 4
OK, this is what I want, but I don't understand why I have to transpose 2 times. It seems useless.
You can create the MultiIndex yourself, and then drop the row. From your starting df:
import pandas as pd
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.columns], names=[None]*2)
df = df.iloc[1:].reset_index(drop=True)
A B A B
k k m m
0 1 2 3 4

select rows from dataframe where any of the columns is higher 0.001

I would normally write
df[ (df.Col1>0.0001) | (df.Col2>0.0001) | (df.Col3>0.0001) ].index
to get the labels where the condition holds True. If I have many columns, and say I had a tuple
cols = ('Col1', 'Col2', 'Col3')
cols is a subset of df columns.
Is there a more succinct way of writing the above?
You can combine pandas.DataFrame.any and list indexing to create a mask for use in indexing.
Note that cols has to be a list, not a tuple.
import pandas as pd
import numpy as np
N = 10
M = 0.8
df = pd.DataFrame(data={'Col1':np.random.random(N), 'Col2':np.random.random(N),
'Col3':np.random.random(N), 'Col4':np.random.random(N)})
cols = ['Col1', 'Col2', 'Col3']
mask = (df[cols] > M).any(axis=1)
print(df[mask].index)
# Int64Index([0, 1, 4, 5, 6, 7], dtype='int64')
You can use list comprehension using 'any' or 'all':
import pandas as pd
import numpy as np
In [148]: df = pd.DataFrame(np.random.randn(25).reshape(5,5), columns=list('abcde'))
In [149]: df
Out[149]:
a b c d e
0 -1.484887 2.204350 0.498393 0.003432 0.792417
1 -0.595458 0.850336 0.286450 0.201722 1.699081
2 -0.437681 -0.907156 0.514573 -1.162837 -0.334180
3 -0.160818 -0.384901 0.076484 0.599763 1.923360
4 0.351161 0.519289 1.727934 -1.232707 0.007984
Example where you want all the columns in a given row to be greater than -1
In [153]: df.iloc[ [row for row in df.index if all(df.loc[row] > -1)], :]
Out[153]:
a b c d e
1 -0.595458 0.850336 0.286450 0.201722 1.699081
3 -0.160818 -0.384901 0.076484 0.599763 1.923360
Example where you want any the columns in a given row to be greater than -1
In [154]: df.iloc[ [row for row in df.index if any(df.loc[row] > -1)], :]
Out[154]:
a b c d e
0 -1.484887 2.204350 0.498393 0.003432 0.792417
1 -0.595458 0.850336 0.286450 0.201722 1.699081
2 -0.437681 -0.907156 0.514573 -1.162837 -0.334180
3 -0.160818 -0.384901 0.076484 0.599763 1.923360
4 0.351161 0.519289 1.727934 -1.232707 0.007984

Categories