How to cut and group by letter in pandas dataframe - python

A B
a0 1
a0-2 2
a1 3
a2 4
a2-2 5
a3 6
a4 7
I would like to group below bins
df.B.sum
[a0~a0-2) 3
[a1~a1-2) 3
[a2~a2-2) 9
[a3~a3-2) 6
[a4~a4-2) 7
How this could be done...

You can use groupby by Series created by cut by second letter of column A:
print (df.A.str[1:2].astype(int))
0 0
1 0
2 1
3 2
4 2
5 3
6 4
Name: A, dtype: int32
bins = [-1,0,1,2,5]
labels=['[a0~a0-2)','[a1~a1-2)','[a2~a2-2)','[a3~a4-2)']
s = pd.cut(df.A.str[1:2].astype(int), bins=bins, labels=labels)
print (s)
0 [a0~a0-2)
1 [a0~a0-2)
2 [a1~a1-2)
3 [a2~a2-2)
4 [a2~a2-2)
5 [a3~a4-2)
6 [a3~a4-2)
Name: A, dtype: category
Categories (4, object): [[a0~a0-2) < [a1~a1-2) < [a2~a2-2) < [a3~a4-2)]
df = df.groupby(s).B.sum().reset_index()
print (df)
A B
0 [a0~a0-2) 3
1 [a1~a1-2) 3
2 [a2~a2-2) 9
3 [a3~a4-2) 13
Another similar solution as another answer, only is used map function:
d = {'a0': '[a0~a0-2)',
'a1': '[a1~a1-2)',
'a2': '[a2~a2-2)',
'a3': '[a3~a4-2)',
'a4': '[a3~a4-2)'}
df = df.groupby(df.A.str[:2].map(d)).B.sum().reset_index()
print (df)
A B
0 [a0~a0-2) 3
1 [a1~a1-2) 3
2 [a2~a2-2) 9
3 [a3~a4-2) 13

You can create a new column with the shortened version of you column and then group on this column.
# take only the first two characters into the new column
df['group_col'] = df.A.str[:2]
df.groupby('group_col').B.sum()
Of course you can be creative when creating the group column.
lo = {'a0': 0, 'a1': 1, 'a2': 2, 'a3': 3, 'a4': 3}
df['group_col'] = df.A.str[:2].apply(lambda val: lo[val])
df.groupby('group_col').B.sum()
group_col
0 3
1 3
2 9
3 13
Name: B, dtype: int64

If you like to group by elements that start with same letter and number, you can use a function in groupby like this :
def group_func(i):
global df
return df.iloc[i]['A'].split("-")[0]
df.groupby(group_func).sum()
otherwise if you want to group every two elements,
def group_func(i):
return i // 2
df.groupby(group_func).sum()

Related

create a dataframe mask and sum columns

Suppose I have the following dataframe
# dictionary with list object in values
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
}
# creating a Dataframe object
df = pd.DataFrame(details)
I want to query on each columns with the follow conditions to obtain a boolean mask and then perform the sum on axis=1
A1 >= 3
A2 >=3
A3 >=4
I would like to end-up with the following dataframe
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
'score' : [1,2,2,3]
}
# creating a Dataframe object
df = pd.DataFrame(details)
How would you do it?
Since your operators are the same, you can try numpy broadcasting
import numpy as np
df['score'] = (df.T >= np.array([3,3,4])[:, None]).sum()
print(df)
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
You could also do:
df.assign(score = (df >=[3,3,4]).sum(1))
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
If you want to specifically align your comparators to each column, you can pass them as a dictionary that is alignable against the DataFrames columns.
>>> comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
>>> df['score'] = df.ge(comparisons).sum(axis=1)
>>> df
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
For a little more manual control, you can always subset your df according to your comparators before performing the comparisons.
comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
df['score'] = df[comparisons.index].ge(comparisons).sum(axis=1)

Sort grouped dataframe

Now I have a dataframe below.
Type Major GPA
1 A 0
2 B 1
3 C 0
4 A 0
5 B 0
6 C 1
I would like to groupby('Major', sort=False), but sort the outer group by referencing col 'GPA'
The desired dataframe would be like this:
Type Major GPA
2 B 1
5 B 0
6 C 1
3 C 0
1 A 0
4 A 0
How this can be done? Thanks!!
Let us use transform create the additional key
out = df.assign(key = df.groupby('Major')['GPA'].transform('sum')).sort_values(['key','Major','GPA'],ascending = [False,True,False]).drop('key',1)
Out[37]:
Type Major GPA
1 2 B 1
4 5 B 0
5 6 C 1
2 3 C 0
0 1 A 0
3 4 A 0
This might work:
def my_order(x):
order = {'B': 0, 'C': 1, 'A': 2}
return order[x]
df.sort_values(['Major', 'GPA'], ascending=[True, False], key=my_order)

Pandas - Keeping groups having at least two different codes

I'm working with a DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'group' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4],
'brand' : ['A', 'B', 'X', 'A', 'B', 'C', 'X', 'B', 'C', 'X', 'A', 'B'],
'code' : [2185, 2185, 0, 1410, 1390, 1390, 0, 3670, 4870, 0, 2000, 0]})
print(df)
group brand code
0 1 A 2185
1 1 B 2185
2 1 X 0
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
10 4 A 2000
11 4 B 0
My goal is to view only the groups having a least two different codes. Missing codes labelled with 0's should not be taken into consideration in the filtering criterion. For example, even though the two records from group 4 have different codes, we don't keep this group in the final DataFrame since one of the code is missing.
The resulting DataFrame on the above example should look like this:
group brand code
1 2 A 1410
2 2 B 1390
3 2 C 1390
4 2 X 0
5 3 B 3670
6 3 C 4870
7 3 X 0
I didn't manage to do much with this problem. I think that the first step should be to create a mask to remove the records with a missing (0) code. Something like:
mask = df['code'].eq(0)
df = df[~mask]
print(df)
group brand code
0 1 A 2185
1 1 B 2185
3 2 A 1410
4 2 B 1390
5 2 C 1390
7 3 B 3670
8 3 C 4870
10 4 A 2000
And now only keep the groups having a least two different codes but I don't know how to work this out in Python. Also, this method will remove the records with an missing code in my final DataFrame which I don't want. I want to have a view on the full group.
Any additional help would be appreciated.
This is transform():
mask = (df.groupby('group')['code']
.transform(lambda x: x.mask(x==0) # mask out the 0 values
.nunique() # count the nunique
)
.gt(1)
)
df[mask]
Output:
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
Option 2: Similar idea, but without the lambda function:
mask = (df['code'].mask(df['code']==0) # mask out the 0 values
.groupby(df['group']) # groupby
.transform('nunique') # count uniques
.gt(1) # at least 2
)
We can also use groupby.filter:
df.groupby('group').filter(lambda x: x.code.mask(x.code.eq(0)).nunique()>1)
or surely faster than the previous:
( df.assign(code=df['code'].replace(0,np.nan))
.groupby('group')
.filter(lambda x: x.code.nunique()>1)
.fillna({'code':0}) )
Output
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0

When to use .count() and .value_counts() in Pandas?

I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

How can I extract a column from pandas dataframe attach it to rows while keeping the other columns same.
This is my example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': np.arange(0,5),
'sample_1' : [5,6,7,8,9],
'sample_2' : [10,11,12,13,14],
'group_id' : ["A","B","C","D","E"]})
The output I'm looking for is:
df2 = pd.DataFrame({'ID': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'sample_1' : [5,6,7,8,9,10,11,12,13,14],
'group_id' : ["A","B","C","D","E","A","B","C","D","E"]})
I have tried to slice the dataframe and concat using pd.concat but it was giving NaN values.
My original dataset is large.
You could do this using stack: Set the index to the columns you don't want to modify, call stack, sort by the "sample" column, then reset your index:
df.set_index(['ID','group_id']).stack().sort_values(0).reset_index([0,1]).reset_index(drop=True)
ID group_id 0
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
Using pd.wide_to_long:
res = pd.wide_to_long(df, stubnames='sample_', i='ID', j='group_id')
res.index = res.index.droplevel(1)
res = res.rename(columns={'sample_': 'sample_1'}).reset_index()
print(res)
ID group_id sample_1
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
The function you are looking for is called melt
For example:
df2 = pd.melt(df, id_vars=['ID', 'group_id'], value_vars=['sample_1', 'sample_2'], value_name='sample_1')
df2 = df2.drop('variable', axis=1)

Categories