Groupby and Cumulative count in Pandas

Groupby and Cumulative count in Pandas - python

I want to group by two columns and get their cumulative count. I tried looking for relevant code in the group ,couldn't find it, but got few hints based on what I have coded, but it is ending up with an error. Can this be solved?
ID ABC XYZ
1 A .512
2 A .123
3 B .999
4 B .999
5 B .999
6 C .456
7 C .456
8 C .888
9 d .888
10 d .888
The output should be as below[Either ABC or XYZ has new value counter should be incremented].
ID ABC XYZ GID
1 A .123 1
2 A .512 2
3 B .999 3
4 B .999 3
5 B .999 3
6 C .456 4
7 C .456 4
8 C .888 5
9 d .888 6
10 d .888 6
The code is as below
DF=DF.sort(['ABC','XYZ'] ,ascending = [1,0])
DF['GID'] = DF.groupby('ABC','XYZ').cumcount()
But it is ending up with an Error:
ValueError: No axis named XYZ for object type

I got the desired results like this.
c1 = df.ABC != DF.ABC.shift()
c2 = df.XYZ != DF.XYZ.shift()
DF['GID'] = (c1 | c2).cumsum()
DF

Related

How to randomly choose a string from a list and to iterate it over dataframe based on condition?

Let's say I have the following df -
data={'Location':[1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4]}
df = pd.DataFrame(data=data)
df
Location
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 3
14 3
15 4
16 4
17 4
In addition, I have the following dict:
Unlock={
1:"A",
2:"B",
3:"C",
4:"D",
5:"E",
6:"F",
7:"G",
8:"H",
9:"I",
10:"J"
}
I'd like to create another column that will randomly select a string from the 'Unlock' dict based on the condition that Location<=Unlock. So for example - for Location 2 some rows will get 'A' and some rows will get 'B'.
I've tried to do the following but with no luck (I'm getting an error) -
df['Name']=np.select(df['Location']<=Unlock,np.random.choice(Unlock,size=len(df))
Thanks in advance for your help!

You can convert your dictionary values to a list, and randomly select the values of a subset of this list: only up to Location number of elements.
With Python versions >= 3.7, dict maintains insertion order. For lower versions - see below.
lst = list(Unlock.values())
df['Name'] = df['Location'].transform(lambda loc: np.random.choice(lst[:loc]))
Example output:
Location Name
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 2 B
7 2 A
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 3 C
14 3 B
15 4 A
16 4 C
17 4 D
If you are using a lower version of Python, you can Build a list of dictionary values, sorted by key:
lst = [value for key, value in sorted(Unlock.items())]

For a vectorial method, multiply by a random value (0,1] and ceil, then map with your dictionary.
This will give you an equiprobable value between 1 and the current value (included):
import numpy as np
df['random'] = (np.ceil(df['Location'].mul(1-np.random.random(size=len(df))))
.astype(int).map(Unlock)
)
output (reproducible with np.random.seed(0)):
Location random
0 1 A
1 1 A
2 1 A
3 2 B
4 2 A
5 2 B
6 2 A
7 2 B
8 2 B
9 3 B
10 3 C
11 3 B
12 3 B
13 3 C
14 3 A
15 4 A
16 4 A
17 4 D

How to assign a group number to each ID (n=1,2,3.....)?

I want to assign a number to each group. I tried to do
df['group_n'] = df.groupby('ID').ngroup()
but it gives me an error msg:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
If i do, df['group_n'] = df.groupby('ID').ngroup().add(1)
I get _n in descending order (meaning C:3, B:2, A:1) is there a way to preserve that order but have group_n start from 0?
My current table:
ID date sender
C Jan20 3
C Feb20 7
C Mar20 12
C Apr20 15
B Mar20 1
B May20 10
B Jun20 15
...
A Jan21 10
A Feb21 12
A Mar21 20
A Apr21 5
desired table:
ID date sender group_n
C Jan20 3 1
C Feb20 7 1
C Mar20 12 1
C Apr20 15 1
B Mar20 1 2
B May20 10 2
B Jun20 15 2
A Jan21 10 3
A Feb21 12 3
A Mar21 20 3
A Apr21 5 3
Thank you in advance!

Use:
df['group_n'] = pd.factorize(df['ID'])[0] + 1
Or:
df['group_n'] = df.groupby('ID', sort=False).ngroup().add(1)
print(df)
ID date sender group_n
A Jan20 3 1
A Feb20 7 1
A Mar20 12 1
A Apr20 15 1
B Mar20 1 2
B May20 10 2
B Jun20 15 2
C Jan21 10 3
C Feb21 12 3
C Mar21 20 3
C Apr21 5 3

Python: dataframe pivot with duplicate labels

I have a panda dataframe as below
index ColumnName ColumnValue
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
6 A 7
7 B 8
8 C 9
I want ouput like below as panda dataframe
A B C
1 2 3
4 5 6
7 8 9
Can anyone sugget how i can i achieve desired output ?
Regards
Vipul

First solution came into my mind is to use for loop with unique columnName as below. If you want pivot method to achieve it, someone else might help you.
columns = df['ColumnName'].unique()
data = {}
for column in columns:
data[column] = list(df[df['ColumnName'] == column]['ColumnValue'])
pd.DataFrame(data)
which will give you the below output
A B C
0 1 2 3
1 4 5 6
2 7 8 9

How to change value in columns 4,5,6 of a dataframe to percentage format?

I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)

You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0

Python Pandas: Get 2 set of random samples per group

I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements

You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64

This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.

There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby and Cumulative count in Pandas - python

I got the desired results like this. c1 = df.ABC != DF.ABC.shift() c2 = df.XYZ != DF.XYZ.shift() DF['GID'] = (c1 | c2).cumsum() DF

Related

How to randomly choose a string from a list and to iterate it over dataframe based on condition?

How to assign a group number to each ID (n=1,2,3.....)?

Python: dataframe pivot with duplicate labels

How to change value in columns 4,5,6 of a dataframe to percentage format?

Python Pandas: Get 2 set of random samples per group

Categories

Resources