How do I transpose and aggregate this dataframe in right order?

How do I transpose and aggregate this dataframe in right order? - python

I am trying to find an efficient way to create a dataframe which lists all distinct game values as the columns and then aggregates the rows by user_id for game play hours accordingly? This is my example df:
user_id | game | game_hours | rank_order
1 | Fortnight | 1.5 | 1
1 | COD | 0.5 | 2
1 | Horizon | 1.7 | 3
1 | ... | ... | n
2 | Fifa2021 | 1.9 | 1
2 | A Way Out | 0.2 | 2
2 | ... | ... | n
...
Step 1: How do I get this to this df format (match rank order correctly due to time sequence)?
user_id | game_1 | game_2 | game_3 | game_n ...| game_hours
1 | Fortnight | COD | Horizon| | 3.7
2 | Fifa21 | A Way Out | | | 2.1
...

Use DataFrame.pivot with DataFrame.add_prefix and for new column DataFrame.assign with aggregation sum:
df = (df.pivot('user_id','rank_order','game')
.add_prefix('game_')
.assign(game_hours=df.groupby('user_id')['game_hours'].sum())
.reset_index()
.rename_axis(None, axis=1))
print (df)
user_id game_1 game_2 game_3 game_hours
0 1 Fortnight COD Horizon 3.7
1 2 Fifa2021 A Way Out NaN 2.1

Related

Calculate mean median first third quartile from dataframe with distribution in with pandas

I have an aggregate level dataframe as follow:
errors | class | num_students
1 | A | 5
2 | A | 8
3 | A | 2
...
10 | A | 1
1 | B | 9
2 | B | 12
3 | B | 5
10 | B | 2
...
The original data was at student ID level so in this dataframe, distribution of errors for each class is calculated. I want to get summary statistics per class from my dataframe that looks like below:
Class | average error | median error | Q1 error | Q3 error
A | 2.1 | 2 | 1 | 3
B | 3.4 | 3 | 2 | 5
What is the best way to accomplish this?

Converting groupby pandas df of absolute numbers to percentage of row totals

I have some data in my df df that shows the 2 categories a user belongs to. For which I want to see the number of users for each category pair expressed as a %total of the row.
Original dataframe df:
+------+------+--------+
| cat1 | cat2 | user |
+------+------+--------+
| A | X | 687568 |
| A | Y | 68575 |
| B | Y | 56478 |
| A | X | 6587 |
| A | Y | 45678 |
| B | X | 5678 |
| B | X | 967 |
| A | X | 345 |
+------+------+--------+
I convert this to a groupby df using:
df2 = df.groupby(['cat1', 'cat2']).agg({'user': 'nunique'}).reset_index().pivot(index='cat1', columns='cat2',values='user')to get the pairwise calculation of number of users per combination of categories (numbers here are made up):
+------+----+----+
| cat2 | X | Y |
+------+----+----+
| cat1 | | |
+------+----+----+
| A | 5 | 5 |
| B | 10 | 40 |
+------+----+----+
And I would like to convert the numbers to percent totals of the rows (Cat1), e.g. for the first row, 5/(5+5) = 0.5 and so on to give:
+------+-----+-----+
| cat2 | X | Y |
+------+-----+-----+
| cat1 | | |
| A | 0.5 | 0.5 |
| B | 0.2 | 0.8 |
+------+-----+-----+
Would I have to create a new column in my grouped df that contains the row-wise sum, and then iterate through each value in a row and divide it by that total?

You can simplify your expression:
piv = df.pivot_table('user', 'cat1', 'cat2', aggfunc='nunique')
pct = piv.div(piv.sum(axis=1), axis=0)
Output:
>>> piv
cat2 X Y
cat1
A 3 2
B 2 1
>>> pct
cat2 X Y
cat1
A 0.600000 0.400000
B 0.666667 0.333333

Grouping many columns in one column in Pandas

I have a DataFrame that is similar to this one:
| | id | Group1 | Group2 | Group3 |
|---|----|--------|--------|--------|
| 0 | 22 | A | B | C |
| 1 | 23 | B | C | D |
| 2 | 24 | C | B | A |
| 3 | 25 | D | A | C |
And I want to get something like this:
| | Group | id_count |
|---|-------|----------|
| 0 | A | 3 |
| 1 | B | 3 |
| 2 | C | 3 |
| 3 | D | 2 |
Basically for each group I want to know how many people(id) have chosen it.
I know there is pd.groupby(), but it only gives an appropriate result for one column (if I give it a list, it does not combine group 1,2,3 in one column).

Use DataFrame.melt with GroupBy.size:
df1 = (df.melt('id', value_name='Group')
.groupby('Group')
.size()
.reset_index(name='id_count'))
print (df1)
Group id_count
0 A 3
1 B 3
2 C 4
3 D 2

How to aggregate data by counts of a level, setting each level's counts as its own column?

I have data which has a row granularity in terms of events, and I want to aggregate them by a customer ID. The data is in the form of a pandas df and looks like so:
| Event ID | Cust ID | P1 | P2 | P3 | P4 |
------------------------------------------
| 1 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 2 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 3 | 1 | 10 | 12 | 0 | 0 |
---------------------------
| 4 | 2 | 206 | 0 | 0 | 0 |
---------------------------
| 5 | 2 | 206 | 25 | 0 | 0 |
----------------------------
P1 to P4 have numbers which are just levels, they are event categories which I need to get counts of (there are 175+ codes), where each event category gets its own column.
The output I want, would ideally look like:
| Cust ID | Count(12) | Count(10) | Count(25) | Count(206) |
------------------------------------------------------------
| 1 | 3 | 1 | 0 | 0 |
---------------------
| 2 | 0 | 0 | 1 | 2 |
---------------------
The challenge I am facing is taking the counts across multiple columns. There are 2 '12's in P1 and 1 '12' in P2.
I tried using groupby and merge. But I've either used them incorrectly or they're the wrong functions to use because I get a lot of 'NaN's in the resulting table.

You can use the following method:
df = pd.DataFrame({'Event ID':[1,2,3,4,5],
'Cust ID':[1]*3+[2]*2,
'P1':[12,12,10,206,25],
'P2':[0,0,12,0,0],
'P3':[0]*5,
'P4':[0]*5})
df.melt(['Event ID','Cust ID'])\
.groupby('Cust ID')['value'].value_counts()\
.unstack().add_prefix('Count_')\
.reset_index()
Output:
value Cust ID Count_0 Count_10 Count_12 Count_25 Count_206
0 1 8.0 1.0 3.0 NaN NaN
1 2 6.0 NaN NaN 1.0 1.0

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.

A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I transpose and aggregate this dataframe in right order? - python

Related

Calculate mean median first third quartile from dataframe with distribution in with pandas

Converting groupby pandas df of absolute numbers to percentage of row totals

Grouping many columns in one column in Pandas

How to aggregate data by counts of a level, setting each level's counts as its own column?

Pandas, create new column based on values from previuos rows with certain values

Categories

Resources