groupby with multiple columns with addition and frequency counts in pandas [duplicate] - python

This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 4 years ago.
I have a table that is looks like follows:
name type val
A online 12
B online 24
A offline 45
B online 32
A offline 43
B offline 44
I want to dataframe in such a manner that it can be groupby with multiple cols name & type, which also have additional columns that return the count of the record with val being added of the same type records. It should be like follows:
name type count val
A online 1 12
offline 2 88
B online 2 56
offline 1 44
I have tried pd.groupby(['name', 'type'])['val'].sum() that gives the addition but unable to add the count of records.

Add parameter sort=False to groupby for avoid default sorting and aggregate by agg with tuples with new columns names and aggregate functions, last reset_index for MultiIndex to columns:
df1 = (df.groupby(['name', 'type'], sort=False)['val']
.agg([('count', 'count'),('val', 'sum')])
.reset_index())
print (df1)
name type count val
0 A online 1 12
1 B online 2 56
2 A offline 2 88
3 B offline 1 44

You can try pivoting i.e
df.pivot_table(index=['name','type'],aggfunc=['count','sum'],values='val')
count sum
val val
name type
A offline 2 88
online 1 12
B offline 1 44
online 2 56

Related

drop rows based on a condition based on another [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 6 months ago.
I have the following data frame
user_id
value
1
5
1
7
1
11
1
15
1
35
2
8
2
9
2
14
I want to drop all rows that are not the maximum value of every user_id
resulting on a 2 row data frame:
user_id
value
1
35
2
14
How can I do that?
You can use pandas.DataFrame.max after the grouping.
Assuming that your original dataframe is named df, try the code below :
out = df.groupby('user_id', as_index=False).max('value')
>>> print(out)
Edit :
If you want to group more than one column, use this :
out = df.groupby(['user_id', 'sex'], as_index=False, sort=False)['value'].max()
>>> print(out)

Merging two dataframes while considering overlaps and missing indexes [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have multiple dataframes that have an ID and a value and I am trying to merge them such that each ID has all the values in it's row.
ID
Value
1
10
3
21
4
12
5
43
7
11
And then I have another dataframe:
ID
Value2
1
12
2
14
4
55
6
23
7
90
I want to merge these two in a way where it considers the ID's that are already in the first dataframe and if an ID that is the second dataframe is not in the first one, it adds it to the ID row with value2 leaving value empty. This is what my result would look like:
ID
Value
Value2
1
10
12
3
21
-
4
12
55
5
43
-
7
11
90
2
-
14
6
-
23
Hope this makes sense. I don't really care for the order of the ID numbers, they can be sorted or not. My goal is to be able to create dictionaries for each ID with "Value", "Value2", "Value3,... as keys and the corresponding actual value numbers as the keys values. Please let me know if any clarification needed.
You can use pandas' merge method (see here for the help page):
import pandas as pd
df1.merge(df2, how='outer', on='ID')
Specifying 'outer' will use union keys from both dataframes.

Populate Dataframe column from information in other Dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, one (A) contains the notes associated with certain accounts. The other (B) is a list of accounts that i wish to add a column containing the note for that account. In this example there will be times when the account number in dataframe B is not in dataframe A and i would like to fill this either NaN or 0.
Input:
Dataframe A:
Account Note
11 a
12 b
13 c
14 d
15 e
16 f
Dataframe B:
Account
11
25
42
14
15
19
26
Desired Output:
Dataframe C:
Account Note
11 a
25
42
14 d
15 e
19
26
Note that in my real world example the size of Dataframe B will be much bigger than A
Try merge with how='left' and on='Account':
>>> df_b.merge(df_a, how='left', on='Account')

Create multiple DataFrames based on given column values [duplicate]

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 4 years ago.
There's probably a simple solution to this that I just couldn't find...
With the given DataFrame, how can I separate it into multiple DataFrames and go from something like:
>>>import pandas as pd
>>>d ={'LOT': [102,104,162,102,104,102],'VAL': [22,424,65,4,34,6]}
>>>df = pd.DataFrame(data=d)
>>>df
LOT VAL
0 102 22
1 104 424
2 162 65
3 102 4
4 104 34
5 102 6
to:
>>>df[0]
LOT VAL
0 102 22
1 102 4
2 102 6
>>>df[1]
LOT VAL
0 104 424
1 104 34
>>>df[2]
LOT VAL
0 162 65
With 3 distinct DataFrames
Please let me know if you need more information.
This is a simple groupby. Let me see if I find a dupe:
import pandas as pd
df = pd.DataFrame({
'LOT': [102,104,162,102,104,102],
'VAL': [22,424,65,4,34,6]
})
df = [x for _, x in df.groupby('LOT')]
Ok, I found something. However the answer seems overcomplicated so I'm gonna leave this here.
Looks a lot like: Split pandas dataframe based on groupby

Pandas: Get top 10 values AFTER grouping

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?
Is this what you're looking for?
df.groupby('id').head(10)
I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Categories