Creating aggregated summed table in Python [duplicate] - python

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 3 years ago.
Is there an easy way of creating a summed aggregate table in Python from columns in an existing table? I have only used SQL previously and this would be done with a code such as:
select AreaCode, Measure, sum(Value) as 'VALUE'
from Table
group by AreaCode, Measure
In my current table (sticking with the example above) I have hundreds of rows containing AreaCode, Measure and Value that i want to aggregate in a new table in Python

Given a pandas dataframe named table looking like this:
table
# AreaCode Measure Value
#0 A M1 13
#1 A M1 1
#2 B M1 15
#3 B M1 1
#4 A M2 54
#5 A M2 1
#6 B M2 17
#7 B M2 1
The code to perform the aggregation would be:
table.groupby(['AreaCode', 'Measure'], as_index=False).sum()
# AreaCode Measure Value
#0 A M1 14
#1 A M2 55
#2 B M1 16
#3 B M2 18
Code to generate table and test this solution:
table = pd.DataFrame({'AreaCode': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A', 5: 'A', 6: 'B', 7: 'B'}, 'Measure': {0: 'M1', 1: 'M1', 2: 'M1', 3: 'M1', 4: 'M2', 5: 'M2', 6: 'M2', 7: 'M2'}, 'Value': {0: 13, 1: 1, 2: 15, 3: 1, 4: 54, 5: 1, 6: 17, 7: 1}})

Related

Filter for column value and set its other column to an array

I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!
here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)
Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5

Python Pandas group by mean() for a certain count of rows

I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance
You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())
Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0

groupby + apply results in a series appearing both in index and column - how to prevent it?

I've got a following data frame:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1).set_index('id')
I wanted to sort with within groups according to var1, so I wrote the following:
ex.groupby('var3').apply(lambda x: x.sort_values('var1'))
However, it results in a data frame which has var3 both in index and in column. How to prevent that and leave it only in a column?
Add optional param to groupby as_index=False
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1'))
Or, if you don't want multiIndex
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1')) \
.reset_index(level=0, drop=True)
You could use:
df_sorted=ex.groupby('var3').apply(lambda x: x.sort_values('var1')).reset_index(drop='var3')
print(df_sorted)
var1 var2 var3
0 20.272109 27.731092 1
1 20.680272 55.672269 1
2 21.088435 43.907563 1
3 21.496599 83.403361 1
4 22.857143 71.638655 1
5 23.945578 62.815126 1
6 34.965986 67.226891 2
7 36.462585 59.243697 2
8 39.183673 43.487395 2
But you only need DataFrame.sort_values
sorting first by var3 and then by var1:
df_sort=ex.sort_values(['var3','var1'])
print(df_sort)
var1 var2 var3
id
11 20.272109 27.731092 1
13 20.680272 55.672269 1
12 21.088435 43.907563 1
16 21.496599 83.403361 1
15 22.857143 71.638655 1
14 23.945578 62.815126 1
17 34.965986 67.226891 2
18 36.462585 59.243697 2
19 39.183673 43.487395 2

How do you pass a dictionary through the pandas replace function? [duplicate]

This question already has answers here:
Pandas - replacing column values
(5 answers)
Closed 4 years ago.
How would you use a dictionary to replace values throughout all the columns in a dataframe?
Below is an example, where I attempted to pass the dictionary through the replace function. I have a script that produces various sized dataframes based on a company's manager/employee structure - so the number of columns varies per dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace({di})
print(df)
There is a similar question linked below in which the solution specifies column names, but given that I'm looking at a whole dataframe that would vary in column names/size, I'd like to apply the replace function to the whole dataframe.
Remap values in pandas column with a dict
Thanks!
Don't put {} around your dictionary - that tries to turn it into a set, which throws an error since a dictionary can't be an element in a set. Instead pass in the dictionary directly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace(di)
print(df)
Output:
col1 col2 col3 col4
0 w a w w
1 1 2 1 1
2 2 NaN 2 2
col1 col2 col3 col4
0 w a w w
1 A B A A
2 B NaN B B

Performing groupby function using two columns as parameters regardless of the order of the columns

Given the following dataframe:
Node_1 Node_2 Time
A B 6
A B 4
B A 2
B C 5
How can one obtain, using groupby or other methods, the dataframe as follows:
Node_1 Node_2 Mean_Time
A B 4
B C 5
The first row's Mean_Time being obtained by finding the average of all routes A->B and B->A, i.e. (6 + 4 + 2)/3 = 4
You could sort each row of the Node_1 and Node_2 columns using np.sort:
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
df.loc[:, nodes.columns] = arr
which results in df now looking like:
Node_1 Node_2 Time
0 A B 6
1 A B 4
2 A B 2
3 B C 5
With the Node columns sorted, you can groupby/agg as usual:
result = df.groupby(cols).agg('mean').reset_index()
import numpy as np
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
cols = nodes.columns.tolist()
df.loc[:, nodes.columns] = arr
result = df.groupby(cols).agg('mean').reset_index()
print(result)
yields
Node_1 Node_2 Time
0 A B 4
1 B C 5
Something in the lines of should give you the desired result... This got a lot uglier than it was :D
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
# Create new column to group by
df["Node"] = df[["Node_1","Node_2"]].apply(lambda x: tuple(sorted(x)),axis=1)
# Create Mean_time column
df["Mean_time"] = df.groupby('Node').transform('mean')
# Drop duplicate rows and drop Node and Time columns
df = df.drop_duplicates("Node").drop(['Node','Time'],axis=1)
print(df)
Returns:
Node_1 Node_2 Mean_time
0 A B 4
3 B C 5
An alternative would be to use:
df = (df.groupby('Node', as_index=False)
.agg({'Node_1':lambda x: list(x)[0],
'Node_2':lambda x: list(x)[0],
'Time': np.mean})
.drop('Node',axis=1))

Categories