pandas: conditional select using .loc with MultiIndex - python

I have read doc of Advanced indexing with hierarchical index where using .loc for MultiIndex is explained. Also this thread: Using .loc with a MultiIndex in pandas?
Still I don't see how select rows where (first index == some value) or (second index == some value)
Example:
import pandas as pd
index = pd.MultiIndex.from_arrays([['a', 'a', 'a', 'b', 'b', 'b'],
['a', 'b', 'c', 'a', 'b', 'c']],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1,2,3,4,5,6], 'y': [6,5,4,3,2,1]}, index=index)
Is this DataFrame:
x y
i0 i1
a a 1 6
b 2 5
c 3 4
b a 4 3
b 5 2
c 6 1
How can I get rows where i0 == 'b' or i1 == 'b'?
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

I think the easier answer is to use the DataFrame.query function which allows you to query the multi-index by name as follows:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
df.query('i0 == "b" | i1 == "b"')
returns:
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

Use get_level_values()
>>> mask = (df.index.get_level_values(0)=='b') | (df.index.get_level_values(1)=='b')
>>> df[mask] # same as df.loc[mask]
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

This might be possible with some logical condition on the index columns i0 and i1 unsing .loc. However to me using .iloc seems easier:
You can get the iloc index via pd.MultiIndex.get_locs.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
idx0 = index.get_locs(['b', slice(None)]) # i0 == 'b' => [3, 4, 5]
idx1 = index.get_locs([slice(None), 'b']) # i1 == 'b' => [1, 4]
idx = np.union1d(idx0, idx1)
print(df.iloc[idx])
will yield
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
Note:
slice(None) means the same as [:] in index-slicing.

It is quite some time since this question was raised. After reading the answers available, however, I do see the benefit of adding my response which is going to answer original query exactly and how do it efficiently with minimum coding.
To select multiple indices as in your question, you can do :
df.loc[('b','b')]
Please note most critical point here is to use parenthesis () for indices. This will give an output :
x 5
y 2
Name: (b, b), dtype: int64
You can further add column name ('x' in my case) as if needed by doing as below:
df.loc[('b','b'),'x']
This will give output:
5
Entire process is in the attached image.

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

How to compare and replace individual cell values in data according to a list?: Pandas

I have a dataframe containing numerical values. I want to replace all values in the dataframe by comparing individual cell values to the respective elements of the list. The length of the list and the length of the columns are the same. Here's an example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Output
a b c
0 101 2 3
1 4 500 6
2 712 8 9
list_numbers = [100,100,100]
I want to compare individual cell values to the respective elements of the list.
So, the column 'a' will be compared to 100. If the values are greater than hundred, I want to replace the values with another number.
Here is my code so far:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df_columns = df.columns
df_index = df.index
#Creating a new dataframe to store the values.
df1 = pd.DataFrame(index= df_index, columns = df_columns)
df1 = df1.fillna(0)
for index, value in enumerate(df.columns):
#df.where replaces values where the condition is false
df1[[value]] = df[[value]].where(df[[value]] > list_numbers [index], -1)
df1[[value]] = df[[value]].where(df[[value]] < list_numbers [index], 1)
#I am getting something like: nan for column a and error for other columns.
#The output should look something like:
Output
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1
Iterating over a DataFrame iterates over its column names. So you could simply do:
df1 = pd.DataFrame()
for i, c in enumerate(df):
df1[c] = np.where(df[c] >= list_numbers[i], 1, -1)
You can avoid iterating over the columns, and use numpy broadcasting (which is more efficient):
df1 = pd.DataFrame(
np.where(df.values > np.array(list_numbers), 1, -1),
columns=df.columns)
df1
Output:
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

Select rows based on condition and set values from a vector

I want to set the entire rows to a value from a vector, if a condition in on column is met.
import pandas as pd
df = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 1, 1]], columns=('one', 'two', 'three'))
vector = pd.Series([2,3,4])
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 1 1
I want the result to be like this:
df_wanted = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 4, 4]], columns=('one', 'two', 'three'))
print(df_wanted)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
I tried this but it gives me error:
df.loc[df['one']=='b'] = vector[df['one']=='b']
ValueError: Must have equal len keys and value when setting with an iterable
// m.
You can specify columns in list for set:
df.loc[df['one']=='b', ['two', 'three']] = vector[df['one']=='b']
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
Or if need more dynamic solution - select all numeric columns:
df.loc[df['one']=='b', df.select_dtypes(np.number).columns] = vector[df['one']=='b']
Or compare only once and assign to variable:
m = df['one']=='b'
df.loc[m, df.select_dtypes(np.number).columns] = vector[m]

Why transpose data to get a multiindexed dataframe?

I'm a bit confused with data orientation when creating a Multiindexed DataFrame from a DataFrame.
I import data with read_excel() and I begin with something like:
import pandas as pd
df = pd.DataFrame([['A', 'B', 'A', 'B'], [1, 2, 3, 4]],
columns=['k', 'k', 'm', 'm'])
df
Out[3]:
k k m m
0 A B A B
1 1 2 3 4
I want to multiindex this and to obtain:
A B A B
k k m m
0 1 2 3 4
Mainly from Pandas' doc, I did:
arrays = df.iloc[0].tolist(), list(df)
tuples = list(zip(*arrays))
multiindex = pd.MultiIndex.from_tuples(tuples, names=['topLevel', 'downLevel'])
df = df.drop(0)
If I try
df2 = pd.DataFrame(df.values, index=multiindex)
(...)
ValueError: Shape of passed values is (4, 1), indices imply (4, 4)
I then have to transpose the values:
df2 = pd.DataFrame(df.values.T, index=multiindex)
df2
Out[11]:
0
topLevel downLevel
A k 1
B k 2
A m 3
B m 4
Last I re-transpose this dataframe to obtain:
df2.T
Out[12]:
topLevel A B A B
downLevel k k m m
0 1 2 3 4
OK, this is what I want, but I don't understand why I have to transpose 2 times. It seems useless.
You can create the MultiIndex yourself, and then drop the row. From your starting df:
import pandas as pd
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.columns], names=[None]*2)
df = df.iloc[1:].reset_index(drop=True)
A B A B
k k m m
0 1 2 3 4

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

Categories