How do i get only the new unique values per group? - python

import pandas as pd
df = pd.DataFrame({'Month': [2, 2, 3, 3],
'user': ['Michael', 'Michael', 'Lea', 'Michael']})
I have a dataframe like this, it is already a result grouped by Month.
Month user
0 2 Michael
1 2 Michael
2 3 Lea
3 3 Michael
What I want is to count the total unique AND the new unique users compared to the month before.
Total is no problem, can just use:
df.groupby(['Month'])['user'].nunique()
Month
2 1
3 2
But what I want are only the new unique ones, I do not want to count the ones that already were there in Month 2 when I count in Month 3.
In my minimal example "Lea" is a new user in Month "3", "Michael" is not because he was already user in Month "2". So my expected result would be per month the count of new unique users like this
Month Unique_Count_New_Users
0 2 1
1 3 1 <- Lea is new compared to February, Michael isn't
How can I achieve this in python? Do I need some sort of element wise comparison between the groups?
So i edit here to make it more clear: I need compare to all previous month if the user was already there.
import pandas as pd
df = pd.DataFrame({'Month':[2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'user':['Michael', 'Michael', 'Markus', 'Moritz', 'Lea',
'Michael', 'Stefan', 'Dora', 'Erika',
'Dora', 'Markus']})
df
Month user
0 2 Michael
1 2 Michael
2 2 Markus
3 2 Moritz
4 2 Lea
5 3 Michael
6 3 Stefan
7 3 Dora
8 3 Erika
9 4 Dora
10 4 Markus
df.groupby(['Month'])['user'].nunique()
# Solution
# Sort the dataframe first
df.sort_values(by='month', inplace=True)
# Duplicated trick
(~df['user'].duplicated()).groupby(df['Month']).sum()
# Result
Month
2 4
3 3
4 0

IIUC, you can use
(~df['user'].duplicated()).groupby(df['Month']).sum()
Demo:
>>> df
Month user
0 2 Michael
1 2 Michael
2 3 Lea
3 3 Michael
>>> (~df['user'].duplicated()).groupby(df['Month']).sum()
Month
2 1
3 1
I'm assuming that the 'Month' column is sorted, otherwise the duplicated trick won't work.
edit: your exact output can be produced with
(~df['user'].duplicated()).groupby(df['Month']).sum().reset_index().rename({'user': 'Unique_Count_New_Users'}, axis=1)

Related

Creating Data Frame with repeating values that repeat

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.
The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.
df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})
So the final dataframe would look something like
date
time_of_day
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
Thanks much!
you need once np.repeat and once np.tile
df = pd.DataFrame({'date': np.repeat(range(1,31),4),
'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
date time_of_day
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
or you could use pd.MultiIndex.from_product, same result.
df = (
pd.MultiIndex.from_product([range(1,31), range(1,5)],
names=['date','time_of_day'])
.to_frame(index=False)
)
or product from itertools
from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])
New feature in merge cross
out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')

How to find the total amount of a column x sorted by another column y?

I have a dataframe with a list of songs, they contain data such as name, artist, year, streams etc. I'm trying to find the 'year' in which songs got the most 'votes' i.o.w. the year with the highest number of total votes.
I'm pretty new to dataframes, and I know how to find things such as the total votes and sort by certain things, but for this, you need to also group them by year and find the sum, and that's what I'm mainly having trouble with.
Does this help?
>>> df = pd.DataFrame({"year": [1, 1, 2, 3, 3], "votes": [2, 4, 1, 5, 2]})
>>> df
year votes
0 1 2
1 1 4
2 2 1
3 3 5
4 3 2
>>> df.groupby("year")["votes"].sum()
year
1 6
2 1
3 7
Name: votes, dtype: int64
>>> df.groupby("year")["votes"].sum().idxmax()
3

Pandas: How to get unique values counted by two indexes

I need to find month by month way of showing year to date unique values. For example:
month value
1 a
1 b
1 a
2 a
2 a
2 a
3 c
3 b
3 b
4 d
4 e
4 f
Should output:
Month Monthly unique Year to date unique
1 2 2
2 1 2
3 2 3
4 3 6
For monthly unique it is just a matter of group by and unique(), but it won't work for year-to-date this way. Year-to-date may be achieved by using for loop and filtering dataframe month by month since the beginning of the year, but it's slow, non-pythonic way I want to omit.
How to do it in efficient way?
Let us do
s = df.groupby('month').value.agg(['nunique',list])
s['list'] = s['list'].cumsum().map(lambda x : len(set(x)))
s
nunique list
month
1 2 2
2 1 2
3 2 3
4 3 6
BEN_YO's approach is pretty simple and effective for small datasets. However, it can be slow and costly on big dataframe due to cumsum on lists (of strings).
Let's try drop_duplicates first and only work on duplicates:
(df.drop_duplicates(['month','value'])
.assign(year=lambda x: ~x.duplicated(['value']))
.groupby('month')
.agg({'value':'nunique', 'year':'sum'})
.assign(year=lambda x: x.year.cumsum())
)
Output:
value year
month
1 2 2
2 1 2
3 2 3
4 3 6

Create new column based on multiple groupby conditions

I want a new column in this df with the following condition. The column education is a categorical value that goes from 1 to 5 (1 is the lower level of education and 5 is the higher level of education). I want to create a function with the following logic (so as to create a new column in the df)
First, for any id check if there is at least a education level graduated, then the new column must have the higher level of education graduated.
Second, if there is no graduated education level for some particular id (must have all educaction level in "In course"). So, must check the maximium level of education and substract one.
df
id education stage
1 2 Graduated
1 3 Graduated
1 4 In course
2 3 In course
3 2 Graduated
3 3 In course
4 2 In course
expected output:
id education stage new_column
1 2 Graduated 3
1 3 Graduated 3
1 4 In course 3
2 3 In course 2
3 2 Graduated 2
3 3 In course 2
4 2 In course 1
You can do it like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 2, 3, 3, 4], 'education': [2, 3, 4, 3, 2, 3, 2],
'stage': ['Graduated', 'Graduated', 'In course', 'In course', 'Graduated', 'In course', 'In course']})
max_gr = df[df.stage == 'Graduated'].groupby('id').education.max()
max_ic = df[df.stage == 'In course'].groupby('id').education.max()
# set all cells to the value from max_ed
df['new_col'] = df.id.map(max_gr)
# set cells that have not been filled to the value from max_ic - 1
df.loc[df.new_col.isna(), ['new_col']] = df.id.map(max_ic - 1)
series.map(other_series) returns a new series where the values from series have been replaced by the values from other_series.
This is one way.
df['new'] = df.loc[df['stage'] == 'Graduated']\
.groupby('id')['education']\
.transform(max).astype(int)
df['new'] = df['new'].fillna(df.loc[df['stage'] == 'InCourse']\
.groupby('id')['education']\
.transform(max).sub(1)).astype(int)
Result
id education stage new
0 1 2 Graduated 3
1 1 3 Graduated 3
2 1 4 InCourse 3
3 2 3 InCourse 2
4 3 2 Graduated 2
5 3 3 InCourse 2
6 4 2 InCourse 1
Explanation
First, map to "Graduated" dataset grouped by id on max education.
Second, map to "InCourse" dataset grouped by id on max education minus 1.
Alternative solution based on Markus Löffler.
max_ic = df[df.stage.eq('In course')].groupby('id').education.max() - 1
max_gr = df[df.stage.eq('Graduated')].groupby('id').education.max()
# Update with max_gr
max_ic.update(max_gr)
df['new_col'] = df.id.map(max_ic)

Pandas indexing behavior after grouping: do I see an "extra row"?

This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).

Categories