I have a large dataset based on servers at target locations. I used the following code to calculate the mean of a set of values for each server grouped by Site.
df4 = df4.merge(df4.groupby('SITE',as_index=False).agg({'DSKPERCENT':'mean'})[['SITE','DSKPERCENT']],on='SITE',how='left')
Sample Resulting DF
Site Server DSKPERCENT DSKPERCENT_MEAN
A 1 12 11
A 2 10 11
A 3 11 11
B 1 9 9
B 2 12 9
B 3 7 9
C 1 12 13
C 2 12 13
C 3 16 13
what I need now is to print/export the newly calculated mean per site. How can I print/export just the single unique calculated mean value per site (i.e. Site A has a calculated mean of 11, Site B of 9, etc...)?
IIUC, you're looking for a groupby -> transform type of operation. Essentially using transform is similar to agg except that the results are broadcasted back to the same shape of the original group.
Sample Data
df = pd.DataFrame({
"groups": list("aaabbbcddddd"),
"values": [1,2,3,4,5,6,7,8,9,10,11,12]
})
df
groups values
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
6 c 7
7 d 8
8 d 9
9 d 10
10 d 11
11 d 12
Method
df["group_mean"] = df.groupby("groups")["values"].transform("mean")
print(df)
groups values group_mean
0 a 1 2
1 a 2 2
2 a 3 2
3 b 4 5
4 b 5 5
5 b 6 5
6 c 7 7
7 d 8 10
8 d 9 10
9 d 10 10
10 d 11 10
11 d 12 10
Related
I have a dataframe like this:
ID Packet Type
1 1 A
2 1 B
3 2 A
4 2 C
5 2 B
6 3 A
7 3 C
8 4 C
9 4 B
10 5 B
11 6 C
12 6 B
13 6 A
14 7 A
I want to filter the dataframe so that I have only entries that are part of a packet with size n and which types are all different. There are only n types.
For this example let's use n=3 and the types A,B,C.
In the end I want this:
ID Packet Type
3 2 A
4 2 C
5 2 B
11 6 C
12 6 B
13 6 A
How do I do this with pandas?
Another solution, using .groupby + .filter:
df = df.groupby("Packet").filter(lambda x: len(x) == x["Type"].nunique() == 3)
print(df)
Prints:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
You can do transform with nunique
out = df[df.groupby('Packet')['Type'].transform('nunique')==3]
Out[46]:
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
I'd loop over the groupby object, filter and concatenate:
>>> pd.concat(frame for _,frame in df.groupby("Packet") if len(frame) == 3 and frame.Type.is_unique)
ID Packet Type
2 3 2 A
3 4 2 C
4 5 2 B
10 11 6 C
11 12 6 B
12 13 6 A
Have a dataframe with 15 columns, and am trying to use groupby to find the maximum value of one of those columns.
This shows what I've been doing and the output. I am getting the maximum value of item_number_revision in each item_number_start I would like to also be able to show all of my other existing columns in the original dataframe:
If you want to update item_number_revision to contain max by group, you can do this:
data1 = df.set_index('item_number_start').assign(item_number_revision=df.groupby('item_number_start')['item_number_revision'].max()).reset_index()
Input:
item_number_start item_number_revision other_column
0 80-0010 1 a
1 80-0011 2 b
2 80-0012 3 c
3 80-0010 4 d
4 80-0011 5 e
5 80-0012 6 f
6 80-0010 7 g
7 80-0011 8 h
8 80-0012 9 i
9 80-0010 10 j
10 80-0011 11 k
11 80-0012 12 l
Output:
item_number_start item_number_revision other_column
0 80-0010 10 a
1 80-0011 11 b
2 80-0012 12 c
3 80-0010 10 d
4 80-0011 11 e
5 80-0012 12 f
6 80-0010 10 g
7 80-0011 11 h
8 80-0012 12 i
9 80-0010 10 j
10 80-0011 11 k
Alternatively, if you want to preserve the original column and add a new column containing the max, you can do this:
data1 = df.set_index('item_number_start').assign(item_number_revision_max=df.groupby('item_number_start')['item_number_revision'].max()).reset_index()
Output:
item_number_start item_number_revision other_column item_number_revision_max
0 80-0010 1 a 10
1 80-0011 2 b 11
2 80-0012 3 c 12
3 80-0010 4 d 10
4 80-0011 5 e 11
5 80-0012 6 f 12
6 80-0010 7 g 10
7 80-0011 8 h 11
8 80-0012 9 i 12
9 80-0010 10 j 10
10 80-0011 11 k 11
11 80-0012 12 l 12
In the examples above, we use set_index() to temporarily make the original DataFrame use an index that matches that of the groupby()...max() Series, then we use assign() to either overwrite the column we took the max of, or add a new column, where each row is assigned the max for its group. In the end, we use reset_index() to restore the column we temporarily set as the index.
UPDATE:
To delete all rows except those with item_number_revision equal to item_number_revision_max, we can do this:
data1 = (
df.join(
df.groupby('item_number_start')['item_number_revision'].max()
.to_frame().set_index('item_number_revision', append=True),
on=['item_number_start', 'item_number_revision'], how='right')
)
Output:
item_number_start item_number_revision other_column
9 80-0010 10 j
10 80-0011 11 k
11 80-0012 12 l
Let say I have my data shaped as in this example
idx = pd.MultiIndex.from_product([[1, 2, 3, 4, 5, 6], ['a', 'b', 'c']],
names=['numbers', 'letters'])
col = ['Value']
df = pd.DataFrame(list(range(18)), idx, col)
print(df.unstack())
The output will be
Value
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
letters and numbers are indexes and Value is the only column
The question is how can I replace Value column with columns named as values of index letters?
So I would like to get such output
numbers a b c
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
where a, b and c are columns and numbers is the only index.
Appreciate your help.
The problem is caused by you are using unstack with DataFrame, not pd.Series
df.Value.unstack().rename_axis(None,1)
Out[151]:
a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
Wen-Ben's answer prevents you from running into a data frame with multiple column levels in the first place.
If you happened to be stuck with a multi-index column anyway, you can get rid of it by using .droplevel():
df = df.unstack()
df.columns = df.columns.droplevel()
df
Out[7]:
letters a b c
numbers
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
5 12 13 14
6 15 16 17
I have a dataframe:
df-
A B C D E
0 V 10 5 18 20
1 W 9 18 11 13
2 X 8 7 12 5
3 Y 7 9 7 8
4 Z 6 5 3 90
I want to add a column 'Result' which should return 1 if the value in column 'E' is greater than the values in B, C & D columns else return 0.
Output should be:
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
For few columns, i would use logic like : if(and(E>B,E>C,E>D),1,0),
But I have to compare around 20 columns (from B to U) with column name 'V'. Additionally, the dataframe has around 100 thousand rows.
I am using
df['Result']=np.where((df.ix[:,1:20])<df['V']).all(1),1,0)
And it gives a Memory error.
One possible solution is compare in numpy and last convert boolean mask to ints:
df['Result'] = (df.iloc[:, 1:4].values < df[['E']].values).all(axis=1).astype(int)
print (df)
A B C D E Result
0 V 10 5 18 20 1
1 W 9 18 11 13 0
2 X 8 7 12 5 0
3 Y 7 9 7 8 0
4 Z 6 5 3 90 1
This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 4 years ago.
I have a pandas dataframe, with 4 rows and 4 columns - here is asimple version:
import pandas as pd
import numpy as np
rows = np.arange(1, 4, 1)
values = np.arange(1, 17).reshape(4,4)
df = pd.DataFrame(values, index=rows, columns=['A', 'B', 'C', 'D'])
what I am trying to do is to convert this to a 2 * 8 dataframe, with B, C and D alligng for each array - so it would look like this:
1 2
1 3
1 4
5 6
5 7
5 8
9 10
9 11
9 12
13 14
13 15
13 16
reading on pandas documentation I tried this:
df1 = pd.pivot_table(df, rows = ['B', 'C', 'D'], cols = 'A')
but gives me an error that I cannot identify the source (ends with
DataError: No numeric types to aggregate
)
following that I want to split the dataframe based on A values, but I think the .groupby command is probably going to take care of it
What you are looking for is the melt function
pd.melt(df,id_vars=['A'])
A variable value
0 1 B 2
1 5 B 6
2 9 B 10
3 13 B 14
4 1 C 3
5 5 C 7
6 9 C 11
7 13 C 15
8 1 D 4
9 5 D 8
10 9 D 12
11 13 D 16
A final sorting according to A is then necessary
pd.melt(df,id_vars=['A']).sort('A')
A variable value
0 1 B 2
4 1 C 3
8 1 D 4
1 5 B 6
5 5 C 7
9 5 D 8
2 9 B 10
6 9 C 11
10 9 D 12
3 13 B 14
7 13 C 15
11 13 D 16
Note: pd.DataFrame.sort has been deprecated in favour of pd.DataFrame.sort_values.