How do I transpose and aggregate this dataframe in right order? - python

I am trying to find an efficient way to create a dataframe which lists all distinct game values as the columns and then aggregates the rows by user_id for game play hours accordingly? This is my example df:
user_id | game | game_hours | rank_order
1 | Fortnight | 1.5 | 1
1 | COD | 0.5 | 2
1 | Horizon | 1.7 | 3
1 | ... | ... | n
2 | Fifa2021 | 1.9 | 1
2 | A Way Out | 0.2 | 2
2 | ... | ... | n
...
Step 1: How do I get this to this df format (match rank order correctly due to time sequence)?
user_id | game_1 | game_2 | game_3 | game_n ...| game_hours
1 | Fortnight | COD | Horizon| | 3.7
2 | Fifa21 | A Way Out | | | 2.1
...

Use DataFrame.pivot with DataFrame.add_prefix and for new column DataFrame.assign with aggregation sum:
df = (df.pivot('user_id','rank_order','game')
.add_prefix('game_')
.assign(game_hours=df.groupby('user_id')['game_hours'].sum())
.reset_index()
.rename_axis(None, axis=1))
print (df)
user_id game_1 game_2 game_3 game_hours
0 1 Fortnight COD Horizon 3.7
1 2 Fifa2021 A Way Out NaN 2.1

Related

Calculate mean median first third quartile from dataframe with distribution in with pandas

I have an aggregate level dataframe as follow:
errors | class | num_students
1 | A | 5
2 | A | 8
3 | A | 2
...
10 | A | 1
1 | B | 9
2 | B | 12
3 | B | 5
10 | B | 2
...
The original data was at student ID level so in this dataframe, distribution of errors for each class is calculated. I want to get summary statistics per class from my dataframe that looks like below:
Class | average error | median error | Q1 error | Q3 error
A | 2.1 | 2 | 1 | 3
B | 3.4 | 3 | 2 | 5
What is the best way to accomplish this?

Converting groupby pandas df of absolute numbers to percentage of row totals

I have some data in my df df that shows the 2 categories a user belongs to. For which I want to see the number of users for each category pair expressed as a %total of the row.
Original dataframe df:
+------+------+--------+
| cat1 | cat2 | user |
+------+------+--------+
| A | X | 687568 |
| A | Y | 68575 |
| B | Y | 56478 |
| A | X | 6587 |
| A | Y | 45678 |
| B | X | 5678 |
| B | X | 967 |
| A | X | 345 |
+------+------+--------+
I convert this to a groupby df using:
df2 = df.groupby(['cat1', 'cat2']).agg({'user': 'nunique'}).reset_index().pivot(index='cat1', columns='cat2',values='user')to get the pairwise calculation of number of users per combination of categories (numbers here are made up):
+------+----+----+
| cat2 | X | Y |
+------+----+----+
| cat1 | | |
+------+----+----+
| A | 5 | 5 |
| B | 10 | 40 |
+------+----+----+
And I would like to convert the numbers to percent totals of the rows (Cat1), e.g. for the first row, 5/(5+5) = 0.5 and so on to give:
+------+-----+-----+
| cat2 | X | Y |
+------+-----+-----+
| cat1 | | |
| A | 0.5 | 0.5 |
| B | 0.2 | 0.8 |
+------+-----+-----+
Would I have to create a new column in my grouped df that contains the row-wise sum, and then iterate through each value in a row and divide it by that total?
You can simplify your expression:
piv = df.pivot_table('user', 'cat1', 'cat2', aggfunc='nunique')
pct = piv.div(piv.sum(axis=1), axis=0)
Output:
>>> piv
cat2 X Y
cat1
A 3 2
B 2 1
>>> pct
cat2 X Y
cat1
A 0.600000 0.400000
B 0.666667 0.333333

Grouping many columns in one column in Pandas

I have a DataFrame that is similar to this one:
| | id | Group1 | Group2 | Group3 |
|---|----|--------|--------|--------|
| 0 | 22 | A | B | C |
| 1 | 23 | B | C | D |
| 2 | 24 | C | B | A |
| 3 | 25 | D | A | C |
And I want to get something like this:
| | Group | id_count |
|---|-------|----------|
| 0 | A | 3 |
| 1 | B | 3 |
| 2 | C | 3 |
| 3 | D | 2 |
Basically for each group I want to know how many people(id) have chosen it.
I know there is pd.groupby(), but it only gives an appropriate result for one column (if I give it a list, it does not combine group 1,2,3 in one column).
Use DataFrame.melt with GroupBy.size:
df1 = (df.melt('id', value_name='Group')
.groupby('Group')
.size()
.reset_index(name='id_count'))
print (df1)
Group id_count
0 A 3
1 B 3
2 C 4
3 D 2

How to aggregate data by counts of a level, setting each level's counts as its own column?

I have data which has a row granularity in terms of events, and I want to aggregate them by a customer ID. The data is in the form of a pandas df and looks like so:
| Event ID | Cust ID | P1 | P2 | P3 | P4 |
------------------------------------------
| 1 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 2 | 1 | 12 | 0 | 0 | 0 |
--------------------------
| 3 | 1 | 10 | 12 | 0 | 0 |
---------------------------
| 4 | 2 | 206 | 0 | 0 | 0 |
---------------------------
| 5 | 2 | 206 | 25 | 0 | 0 |
----------------------------
P1 to P4 have numbers which are just levels, they are event categories which I need to get counts of (there are 175+ codes), where each event category gets its own column.
The output I want, would ideally look like:
| Cust ID | Count(12) | Count(10) | Count(25) | Count(206) |
------------------------------------------------------------
| 1 | 3 | 1 | 0 | 0 |
---------------------
| 2 | 0 | 0 | 1 | 2 |
---------------------
The challenge I am facing is taking the counts across multiple columns. There are 2 '12's in P1 and 1 '12' in P2.
I tried using groupby and merge. But I've either used them incorrectly or they're the wrong functions to use because I get a lot of 'NaN's in the resulting table.
You can use the following method:
df = pd.DataFrame({'Event ID':[1,2,3,4,5],
'Cust ID':[1]*3+[2]*2,
'P1':[12,12,10,206,25],
'P2':[0,0,12,0,0],
'P3':[0]*5,
'P4':[0]*5})
df.melt(['Event ID','Cust ID'])\
.groupby('Cust ID')['value'].value_counts()\
.unstack().add_prefix('Count_')\
.reset_index()
Output:
value Cust ID Count_0 Count_10 Count_12 Count_25 Count_206
0 1 8.0 1.0 3.0 NaN NaN
1 2 6.0 NaN NaN 1.0 1.0

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Categories