I'm trying to group rows by multiple columns.
What I want to achieve can be illustrated by this small example:
import pandas as pd
col_index = pd.MultiIndex.from_arrays([['A','A','B','B'],['a','b','c','d']])
df = pd.DataFrame([ [1,2,3,3],
[4,2,2,2],
[6,4,2,2],
[1,2,4,4],
[3,8,4,4],
[1,2,3,3]], columns = col_index)
DataFrame created by this looks like this:
A B
a b c d
0 1 2 3 3
1 4 2 2 2
2 6 4 2 2
3 1 2 4 4
4 3 8 4 4
5 1 2 3 3
I would like to group by 'c' and 'd', actually whole 'B'
This gives me "KeyError: 'c' "
#something like this
df.groupby(['c','d'], axis = 1, level = 1)
#or like this
df.groupby('B', axis = 1, level = 0)
I tried searching for answer but I can't seem to find any.
Can somebody tell me what I'm doing wrong?
This is one way of doing it by resetting the columns first:
df.set_axis(df.columns.droplevel(0), axis=1,inplace=False).groupby(['c','d']).sum()
Out[531]:
a b
c d
2 2 10 6
3 3 2 4
4 4 4 10
You can also specify the 2-level multi-indices explicitly.
df.groupby([("B","c"), ("B", "d")])
Related
Context: I'd like to add a new multi-index/row on top of the columns. For example if I have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
Desired Output: How could I make it so that I can add "Table X" on top of the columns A,B, and C?
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Possible solutions(?): I was thinking about transposing the dataframe, adding the multi-index, and transpose it back again, but not sure how to do that without having to write the dataframe columns manually (I've checked other SO posts about this as well)
Thank you!
In the meantime I've also discovered this solution:
tt = pd.concat([tt],keys=['Table X'], axis=1)
Which also yields the desired output
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
If you want a data frame like you wrote, you need a Multiindex data frame, try this:
import pandas as pd
# you need a nested dict first
dict_nested = {'Table X': {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]}}
# then you have to reform it
reformed_dict = {}
for outer_key, inner_dict in dict_nested.items():
for inner_key, values in inner_dict.items():
reformed_dict[(outer_key, inner_key)] = values
# last but not least convert it to a multiindex dataframe
multiindex_df = pd.DataFrame(reformed_dict)
print(multiIndex_df)
# >> Table X
# >> A B C
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9
You can use pd.MultiIndex.from_tuples() to set / change the columns of the dataframe with a multi index:
tt.columns = pd.MultiIndex.from_tuples((
('Table X', 'A'), ('Table X', 'B'), ('Table X', 'C')))
Result (tt):
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Add-on, as those are multi index levels you can later change them:
tt.columns.set_levels(['table_x'],level=0,inplace=True)
tt.columns.set_levels(['a','b','c'],level=1,inplace=True)
table_x
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows
I have a dataframe with 2 columns, and I want to create a 3rd column based on a comparison between the 2 columns.
So the logic is:
column 1 val = 3, column 2 val = 4, so the new column value is nothing
column 1 val = 3, column 2 val = 2, so the new column is 3
It's a very similar problem to one previously asked but the answer there isn't working for me, using np.where()
Here's what I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[FinalDF['a'],""])
and after that failed I tried to see if maybe it doesn't like the [x,y] I gave it, so I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[1,0])
the result is always:
ValueError: either both or neither of x and y should be given
Edit: I also removed the [x,y], to see what happens, since the documentation says it is optional. But I still get an error:
ValueError: Length of values does not match length of index
Which is odd because they are sitting in the same dataframe, although one column does have some Nan values.
I don't think I can use np.select because I have a condition here. I've linked to the previous questions so readers can reference them in future questions.
Thanks for any help.
I think that this should work:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'], FinalDF['a'],"")
Example:
FinalDF = pd.DataFrame({'a':[4,2,4,5,5,4],
'b':[4,3,2,2,2,4],
})
print FinalDF
a b
0 4 4
1 2 3
2 4 2
3 5 2
4 5 2
5 4 4
Output:
a b c
0 4 4
1 2 3
2 4 2 4
3 5 2 5
4 5 2 5
5 4 4
or if the column b has to have a greater value of column a, use this:
FinalDF['c'] = np.where(FinalDF['a']<FinalDF['b'], FinalDF['b'],"")
Output:
a b c
0 4 4
1 2 3 3
2 4 2
3 5 2
4 5 2
5 4 4