This question already has answers here:
Subset multi-indexed DataFrame based on multiple level 1 columns
(4 answers)
Closed 1 year ago.
After constructing a 3-d pandas dataframe I have difficulty accessing only specific columns.
The scenario is as follows:
head = ["h1", "h2"]
cols = ["col_1", "col_2", "col_3"]
heads = len(cols) * [head[0]] + len(cols) * [head[1]] # -> ['h1','h1','h1','h2','h2','h2']
no_of_rows = 4
A = np.array(heads)
B = np.array(cols * len(head)) # -> ['col_1','col_2','col_3','col_1','col_2','col_3']
C = np.array([np.zeros(no_of_rows)] * len(head) * len(cols)) # -> shape=(6, 4)
df = pd.DataFrame(data=C.T,
columns=pd.MultiIndex.from_tuples(zip(A,B)))
yielding
h1 h2
col_1 col_2 col_3 col_1 col_2 col_3
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
Now I would like to get e.g. all col_1, meaning col_1 of h1 and col_1 of h2. The output should look like this
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
Any suggestions how I could access those two columns?
You can use df.loc with slice(None), as follows:
df.loc[:, (slice(None), 'col_1')]
or use pd.IndexSlice, as follows:
idx = pd.IndexSlice
df.loc[:, idx[:, 'col_1']]
or simply:
df.loc[:, pd.IndexSlice[:, 'col_1']]
(Defining extra variable idx for pd.IndexSlice is useful as a shorthand if you are going to use pd.IndexSlice multiple times. )
Result:
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
You can also do it with .xs() as follows:
df.xs('col_1', level=1, axis=1)
Result:
h1 h2
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
Slightly different output without the repeating col_1 column labels shown.
The first 2 ways support selecting multiple columns too, e.g. ['col_1', 'col_3']:
df.loc[:, (slice(None), ['col_1', 'col_3'])]
and also:
df.loc[:, pd.IndexSlice[:, ['col_1', 'col_3']]]
Result:
h1 h2
col_1 col_3 col_1 col_3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
You can use loc with get_level_values(1), as your columns col1, col2, col3 are in the first level of your index:
>>> df.loc[:,df.columns.get_level_values(1).isin(['col_1'])]
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
If you want to grab all columns under h1, you can set get_level_values(0), and grab h1:
>>> df.loc[:,df.columns.get_level_values(0).isin(['h1'])]
h1
col_1 col_2 col_3
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
Related
I have two dataframes. One only contains binary values, the other floats between 0 and 1.
Eg.
df1:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0
df2:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.068467 0.099870 0.090778 0.087500 0.612955 0.081495 0.570557
1 0.091651 0.084946 0.082704 0.103070 0.517317 0.092595 0.603526
2 0.070380 0.104353 0.103062 0.086780 0.598848 0.101543 0.570064
3 0.052239 0.123760 0.215329 0.087608 0.581883 0.080650 0.574241
4 0.087564 0.104460 0.125887 0.079945 0.646284 0.081015 0.609308
What I need is to compute the average of df1 where df2 >= 0.5 (or any other number)
All I could find on this topic is for columns only and I could not get it to work on the entire dataframe.
Any help is appreciated.
First is necessary same index and same columns names in both DataFrames.
Then use DataFrame.where for set missing values to False values by mask and then get mean:
df = df1.where(df2 >= 0.5).mean()
If need mean of all values use numpy.nanmean for exclude missing values:
mean = np.nanmean(df1.where(df2 >= 0.5))
Another idea is convert all values to Series with DataFrame.stack and then get mean:
mean = df1.where(df2 >= 0.5).stack().mean()
What about creating a dataframe with the values above 0.5 included and NaN everywhere else:
df = df1.where(df2 >= 0.5)
We then calculate the sum of the values and count the number of values to get the mean:
sum_values = df.sum().sum()
count_values = df.count().sum()
mean_value = sum_values / count_values
I have a dataframe which looks like below:
df
column_A column_B
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
5 1.0 0.0
I want to create a if condition like:
if(df['column_A'] & df['column_b'] = 0.0:
df['label]='OK'
else:
df['label']='NO'
I tried this:
if((0.0 in df['column_A'] ) & (0.0 in df['column_B']))
for index, row in df.iterrows():
(df[((df['column_A'] == 0.0) & (df['column_B']== 0.0))])
Nothing really gave the expected outcome
I expect my output to be:
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
You can use np.where in order to create an array with either OK or NO depending on the result of the condition:
import numpy as np
df['label'] = np.where(df.column_A.add(df.column_B).eq(0), 'OK', 'NO')
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Use numpy.where with DataFrame.any:
#solution if only 1.0, 0.0 values
df['label'] = np.where(df[['column_A', 'column_B']].any(axis=1), 'NO','OK')
#general solution with compare 0
#df['label'] = np.where(df[['column_A', 'column_B']].eq(0).all(axis=1),'OK','NO')
print (df)
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Let's be given a data-frame like the following one:
import pandas as pd
import numpy as np
a = ['a', 'b']
b = ['i', 'ii']
mi = pd.MultiIndex.from_product([a,b], names=['first', 'second'])
A = pd.DataFrame(np.zeros([3,4]), columns=mi)
first a b
second i ii i ii
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
I would like to create new columns iii for all first-level columns and assign the value of a new array (of matching size). I tried the following, to no avail.
A.loc[:,pd.IndexSlice[:,'iii']] = np.arange(6).reshape(3,-1)
The result should look like this:
a b
i ii iii i ii iii
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 2.0 0.0 0.0 3.0
2 0.0 0.0 4.0 0.0 0.0 5.0
Since you have multiple index in columns , I recommend create the additional append df , then concat it back
appenddf=pd.DataFrame(np.arange(6).reshape(3,-1),
index=A.index,
columns=pd.MultiIndex.from_product([A.columns.levels[0],['iii']]))
appenddf
a b
iii iii
0 0 1
1 2 3
2 4 5
A=pd.concat([A,appenddf],axis=1).sort_index(level=0,axis=1)
A
first a b
second i ii iii i ii iii
0 0.0 0.0 0 0.0 0.0 1
1 0.0 0.0 2 0.0 0.0 3
2 0.0 0.0 4 0.0 0.0 5
Another workable solution
for i,x in enumerate(A.columns.levels[0]):
A[x,'iii']=np.arange(6).reshape(3,-1)[:,i]
A
first a b a b
second i ii i ii iii iii
0 0.0 0.0 0.0 0.0 0 1
1 0.0 0.0 0.0 0.0 2 3
2 0.0 0.0 0.0 0.0 4 5
# here I did not add `sort_index`
I am attempting to multiple specific columns a value in their respective row.
For example:
X Y Z
A 10 1 0 1
B 50 0 0 0
C 80 1 1 1
Would become:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
The problem I am having is that it is timing out when I use mul(). My real dataset is very large. I tried to iterate it with loop in my real code as follows:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[3:248] = df_final_small.iloc[i].values[3:248] * df_final_small.iloc[i].values[2]
Which when applied to the example dataframe would look like this:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[1:4] = df_final_small.iloc[i].values[1:4] * df_final_small.iloc[i].values[0]
There must be a better way to do this, I am having problems figuring out how to only cast the multiplication to certain columns in the row rather than the entire row.
EDIT:
To detail further here is my df.head(5).
id gross 150413 Welcome Email 150413 Welcome Email Repeat Cust 151001 Welcome Email 151001 Welcome Email Repeat Cust 161116 eKomi 1702 Hot Leads Email 1702 Welcome Email - All Purchases 1804 Hot Leads ... SILVER GOLD PLATINUM Acquisition Direct Mail Conversion Direct Mail Retention Direct Mail Retention eMail cluster x y
0 0033333 46.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 10 -0.230876 0.461990
1 0033331 2359.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 9 0.231935 -0.648713
2 0033332 117.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
3 0033334 89.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
4 0033335 1908.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 7 -0.974142 0.145032
Just specify the columns you want to multiply. Example
df=pd.DataFrame({'A':10,'X':1,'Y':1,'Z':1},index=[1])
df.loc[:,['X', 'Y', 'Z']]=df.loc[:,['X', 'Y', 'Z']].values*df.iloc[:,0:1].values
If want to provide an arbitrary range of columns use iloc
range_of_columns= range(10,5001)+range(5030,10001)
df.iloc[:,range_of_columns].values*df.iloc[:,0:1].values #multiplying the range of columns with the first column
Using mul with axis = 0 also get the index value by get_level_values
df.mul(df.index.get_level_values(1),axis=0)
Out[167]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also when the dataframe is way to big , you can split it and do it by chunk .
dfs = np.split(df, [2], axis=0)
pd.concat([x.mul(x.index.get_level_values(1), axis=0) for x in dfs])
Out[174]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also I will recommend numpy broadcast
df.values*df.index.get_level_values(1)[:,None]
Out[177]: Int64Index([[10, 0, 10], [0, 0, 0], [80, 80, 80]], dtype='int64')
pd.DataFrame(df.values*df.index.get_level_values(1)[:,None],index=df.index,columns=df.columns)
Out[181]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
I'm playing around with computing subtotals within a DataFrame that looks like this (note the MultiIndex):
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
I can successfully add the subtotals with the following code:
(
df
.groupby(level=0)
.apply(
lambda df: pd.concat(
[df.xs(df.name), df.sum().to_frame('Total').T]
)
)
)
And it looks like this:
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
However, when I work with the transposed DataFrame, it does not work. The DataFrame looks like:
A B
1 2 1 2
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
And I use the following code:
(
df2
.groupby(level=0, axis=1)
.apply(
lambda df: pd.concat(
[df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')],
axis=1
)
)
)
I have specified axis=1 everywhere I can think of, but I get an error:
ValueError: cannot reindex from a duplicate axis
I would expect the output to be:
A B
1 2 Total 1 2 Total
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
Is this a bug? Or have I not specified the axis correctly everywhere? As a workaround, I can obviously transpose the DataFrame, produce the totals, and transpose back, but I'd like to know why it's not working here, and submit a bug report if necessary.
The problem DataFrame can be generated with:
df2 = pd.DataFrame(
np.zeros([6, 4]),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)