How to use groupby in dataframe and still show all the columns - python

Have a dataframe with 15 columns, and am trying to use groupby to find the maximum value of one of those columns.
This shows what I've been doing and the output. I am getting the maximum value of item_number_revision in each item_number_start I would like to also be able to show all of my other existing columns in the original dataframe:

If you want to update item_number_revision to contain max by group, you can do this:
data1 = df.set_index('item_number_start').assign(item_number_revision=df.groupby('item_number_start')['item_number_revision'].max()).reset_index()
Input:
item_number_start item_number_revision other_column
0 80-0010 1 a
1 80-0011 2 b
2 80-0012 3 c
3 80-0010 4 d
4 80-0011 5 e
5 80-0012 6 f
6 80-0010 7 g
7 80-0011 8 h
8 80-0012 9 i
9 80-0010 10 j
10 80-0011 11 k
11 80-0012 12 l
Output:
item_number_start item_number_revision other_column
0 80-0010 10 a
1 80-0011 11 b
2 80-0012 12 c
3 80-0010 10 d
4 80-0011 11 e
5 80-0012 12 f
6 80-0010 10 g
7 80-0011 11 h
8 80-0012 12 i
9 80-0010 10 j
10 80-0011 11 k
Alternatively, if you want to preserve the original column and add a new column containing the max, you can do this:
data1 = df.set_index('item_number_start').assign(item_number_revision_max=df.groupby('item_number_start')['item_number_revision'].max()).reset_index()
Output:
item_number_start item_number_revision other_column item_number_revision_max
0 80-0010 1 a 10
1 80-0011 2 b 11
2 80-0012 3 c 12
3 80-0010 4 d 10
4 80-0011 5 e 11
5 80-0012 6 f 12
6 80-0010 7 g 10
7 80-0011 8 h 11
8 80-0012 9 i 12
9 80-0010 10 j 10
10 80-0011 11 k 11
11 80-0012 12 l 12
In the examples above, we use set_index() to temporarily make the original DataFrame use an index that matches that of the groupby()...max() Series, then we use assign() to either overwrite the column we took the max of, or add a new column, where each row is assigned the max for its group. In the end, we use reset_index() to restore the column we temporarily set as the index.
UPDATE:
To delete all rows except those with item_number_revision equal to item_number_revision_max, we can do this:
data1 = (
df.join(
df.groupby('item_number_start')['item_number_revision'].max()
.to_frame().set_index('item_number_revision', append=True),
on=['item_number_start', 'item_number_revision'], how='right')
)
Output:
item_number_start item_number_revision other_column
9 80-0010 10 j
10 80-0011 11 k
11 80-0012 12 l

Related

Pandas, filter dataframe based on unique values in one column and grouby in another

I have a dataframe like this:
ID Packet Type
1 1 A
2 1 B
3 2 A
4 2 C
5 2 B
6 3 A
7 3 C
8 4 C
9 4 B
10 5 B
11 6 C
12 6 B
13 6 A
14 7 A
I want to filter the dataframe so that I have only entries that are part of a packet with size n and which types are all different. There are only n types.
For this example let's use n=3 and the types A,B,C.
In the end I want this:
ID Packet Type
3 2 A
4 2 C
5 2 B
11 6 C
12 6 B
13 6 A
How do I do this with pandas?
Another solution, using .groupby + .filter:
df = df.groupby("Packet").filter(lambda x: len(x) == x["Type"].nunique() == 3)
print(df)
Prints:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
You can do transform with nunique
out = df[df.groupby('Packet')['Type'].transform('nunique')==3]
Out[46]:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
I'd loop over the groupby object, filter and concatenate:
>>> pd.concat(frame for _,frame in df.groupby("Packet") if len(frame) == 3 and frame.Type.is_unique)
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A

how to export values generated from agg groupby function

I have a large dataset based on servers at target locations. I used the following code to calculate the mean of a set of values for each server grouped by Site.
df4 = df4.merge(df4.groupby('SITE',as_index=False).agg({'DSKPERCENT':'mean'})[['SITE','DSKPERCENT']],on='SITE',how='left')
Sample Resulting DF
Site Server DSKPERCENT DSKPERCENT_MEAN
A 1 12 11
A 2 10 11
A 3 11 11
B 1 9 9
B 2 12 9
B 3 7 9
C 1 12 13
C 2 12 13
C 3 16 13
what I need now is to print/export the newly calculated mean per site. How can I print/export just the single unique calculated mean value per site (i.e. Site A has a calculated mean of 11, Site B of 9, etc...)?
IIUC, you're looking for a groupby -> transform type of operation. Essentially using transform is similar to agg except that the results are broadcasted back to the same shape of the original group.
Sample Data
df = pd.DataFrame({
"groups": list("aaabbbcddddd"),
"values": [1,2,3,4,5,6,7,8,9,10,11,12]
})
df
groups values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 d 8
8 d 9
9 d 10
10 d 11
11 d 12
Method
df["group_mean"] = df.groupby("groups")["values"].transform("mean")
print(df)
groups values group_mean
0 a 1 2
1 a 2 2
2 a 3 2
3 b 4 5
4 b 5 5
5 b 6 5
6 c 7 7
7 d 8 10
8 d 9 10
9 d 10 10
10 d 11 10
11 d 12 10

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

compare multiple columns of pandas dataframe with one column

I have a dataframe:
df-
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
I want to add a column 'Result' which should return 1 if the value in column 'E' is greater than the values in B, C & D columns else return 0.
Output should be:
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
For few columns, i would use logic like : if(and(E>B,E>C,E>D),1,0),
But I have to compare around 20 columns (from B to U) with column name 'V'. Additionally, the dataframe has around 100 thousand rows.
I am using
df['Result']=np.where((df.ix[:,1:20])<df['V']).all(1),1,0)
And it gives a Memory error.
One possible solution is compare in numpy and last convert boolean mask to ints:
df['Result'] = (df.iloc[:, 1:4].values < df[['E']].values).all(axis=1).astype(int)
print (df)
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories