i have a dataframe and want to group 2 columns, which is working fine.
df.groupby(["Sektor, CustomerID"]).count().head(10)
_Order_ID_ Order_timezone Order_weight
AE 1298772 1 1 1
1298788 1 1 1
1298840 2 2 2
1298912 1 1 1
AT 1038570 1 1 1
1040424 1 1 1
1040425 3 3 3
1040426 2 2 2
1040427 1 1 1
1040428 1 1 1
1040429 2 2 2
Now the grouped dataframe is sorted by the CustomerID values. But i want to sort it by the count(). So that i have the Sektor then the CustomerIDs but the CustomerIds that occure the most should be at the top. So descending.
Expected Output should be:
_Order_ID_ Order_timezone Order_weight
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
1040428 1 1 1
How do i do that?
Use:
df1 = df.groupby(["Sektor", "CustomerID"]).count()
If need 10 rows in ouput:
df1 = df1.sort_values(['Sektor','_Order_ID_'], ascending=[True, False]).head(10)
print (df1)
_Order_ID_ Order_timezone Order_weight
Sektor CustomerID
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
If need 10 rows (if exist) per groups by Sektor:
df1 = df1.sort_values(['Sektor','_Order_ID_'], ascending=[True, False]).groupby('Sektor').head(10)
print (df1)
_Order_ID_ Order_timezone Order_weight
Sektor CustomerID
AE 1298840 2 2 2
1298772 1 1 1
1298788 1 1 1
1298912 1 1 1
AT 1040425 3 3 3
1040426 2 2 2
1040429 2 2 2
1038570 1 1 1
1040424 1 1 1
1040427 1 1 1
1040428 1 1 1
Related
I want get consecutive length labeled data
a
---
1
0
1
0
1
1
1
0
1
1
I want :
a | c
--------
1 1
0 0
1 2
1 2
0 0
1 3
1 3
1 3
0 0
1 2
1 2
then I can calculate the mean of "b" column by group "c". tried with shift and cumsum and cumcount all not work.
Use GroupBy.transform by consecutive groups and then set 0 if not 1 in a column:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.where(df.a.eq(1), 0))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
If there are only 0, 1 values is possible multiple by a:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.mul(df.a))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
I have the following dataframe:
Group from to
1 2 1
1 1 2
1 3 2
1 3 1
2 1 4
2 3 1
2 1 2
2 3 1
I want create a 4th column that counts the of unique combinations (from, to) within each group and drops any repeated combination within each group (leaves only one)
Expected output:
Group from to weight
1 2 1 1
1 1 2 1
1 3 2 1
1 3 1 1
2 1 4 1
2 3 1 2
2 1 2 1
In the expected output, the 2nd from 3, to 1 row in group 2 was dropped because it is a duplicate.
In your case we just need groupby with size
out = df.groupby(df.columns.tolist()).size().to_frame(name='weight').reset_index()
Out[258]:
Group from to weight
0 1 1 2 1
1 1 2 1 1
2 1 3 1 1
3 1 3 2 1
4 2 1 2 1
5 2 1 4 1
6 2 3 1 2
You can group by the 3 columns using .groupby() and take their size by GroupBy.size(), as follows:
df_out = df.groupby(['Group', 'from', 'to'], sort=False).size().reset_index(name='weight')
Result:
print(df_out)
Group from to weight
0 1 2 1 1
1 1 1 2 1
2 1 3 2 1
3 1 3 1 1
4 2 1 4 1
5 2 3 1 2
6 2 1 2 1
I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!
If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2
if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
I have a pandas data frame and group it by two columns (for example col1 and col2). For fixed values of col1 and col2 (i.e. for a group) I can have several different values in the col3. I would like to count the number of distinct values from the third columns.
For example, If I have this as my input:
1 1 1
1 1 1
1 1 2
1 2 3
1 2 3
1 2 3
2 1 1
2 1 2
2 1 3
2 2 3
2 2 3
2 2 3
I would like to have this table (data frame) as the output:
1 1 2
1 2 1
2 1 3
2 2 1
df.groupby(['col1','col2'])['col3'].nunique().reset_index()
In [17]: df
Out[17]:
0 1 2
0 1 1 1
1 1 1 1
2 1 1 2
3 1 2 3
4 1 2 3
5 1 2 3
6 2 1 1
7 2 1 2
8 2 1 3
9 2 2 3
10 2 2 3
11 2 2 3
In [19]: df.groupby([0,1])[2].apply(lambda x: len(x.unique()))
Out[19]:
0 1
1 1 2
2 1
2 1 3
2 1
dtype: int64