I want to Group By Date and Hour with an aggregation founction count and split the result for each different ID (in the columns) in output.
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
The output format is like
df_out = pd.DataFrame({ 'HR': [1,2,3,1,2,3],
'Date_': [1,1,1,2,2,2],
'GpID_0': [1,2,5,1,4,2],
'GpID_1': [1,2,5,1,4,2],
'GpID_2': [4,2,5,1,4,2],
})
Tried:
# 1st try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
# 2nd try
df_g = df.groupby(["Hr", "Date_","GpId"], observed=False).count().fillna(0).unstack(-1)
# 3rd try
df_g = df.groupby(["Hr", "Date_"], observed=False).count().fillna(0).unstack()
Nothing accurate yet
I Believed you tried to do something like this
In [1]:
import pandas as pd
df = pd.DataFrame({'GpID': [1,1,0,1,1,0,1,1,2,2,1,1,2,1,1,0,1,2,0,1,1],
'HR': [1,1,1,1,1,1,1, 2,2,2,2,1,1,1, 2,2,2,2,3,3,3],
'Date_': [1,1,1,2,2,2,2, 2,2,2,2,3,3,3, 3,3,3,3,3,3,3]
})
df.loc[:,'Count']=1
pd.pivot_table(df, values='Count', index=['Date_', 'HR'], columns=['GpID'], aggfunc='count').fillna(0).reset_index()
Out [1]:
Date_ HR 0 1 2
0 1 1 1 2 0
1 2 1 1 3 0
2 2 2 0 2 2
3 3 1 0 2 1
4 3 2 1 2 1
5 3 3 1 2 0
Related
I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5
I have a dataframe:
df =
col1
Num
1
4
1
4
2
5
2
1
2
1
3
2
I want to add all the numbers and show the total.
So I will get:
col1
Sum
1
8
2
7
3
2
Try this:
df.groupby('col1').sum()
If you wanted the new column to have the name 'sum' as in your example you could do the following:
df1 = df.groupby('col1').sum()
df1.columns = ['Sum']
I have the following pandas dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1]})
I would like to create a new column (col_a_new) which will be the same as col_a but substitute with 0 the 1st out of the 2 consecutive 1s in col_a, by id
The resulting dataframe looks like this:
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,2],
'col_a': [0,1,1,0,1,1,1,0,1,1],
'col_a_new': [0,0,1,0,1,0,1,0,0,1]})
Any ideas ?
Other approach: Just group by id and define new values using appropriate conditions.
(foo.groupby("id").col_a
.transform(lambda series: [0 if i < len(series) - 1
and series.tolist()[i+1] == 1
else x for i, x in enumerate(series.tolist())]))
# group by id and non-consecutive clusters of 0/1 in col_a
group = foo.groupby(["id", foo["col_a"].ne(foo["col_a"].shift()).cumsum()])
# get cumcount and count of groups
foo_cumcount = group.cumcount()
foo_count = group.col_a.transform(len)
# set to zero all first ones of groups with two ones, otherwise use original value
foo["col_a_new"] = np.where(foo_cumcount.eq(0)
& foo_count.gt(1)
& foo.col_a.eq(1),
0, foo.col_a)
# result
id col_a col_a_new
0 1 0 0
1 1 1 0
2 1 1 1
3 1 0 0
4 1 1 1
5 2 1 0
6 2 1 1
7 2 0 0
8 2 1 0
9 2 1 1
I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']
You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Is there a shorter way of dropping a column MultiIndex level (in my case, basic_amt) except transposing it twice?
In [704]: test
Out[704]:
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
In [705]: test.reset_index(level=0, drop=True)
Out[705]:
basic_amt
Faculty NSW QLD VIC All
0 1 1 2 4
1 0 1 0 1
2 1 0 2 3
In [711]: test.transpose().reset_index(level=0, drop=True).transpose()
Out[711]:
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
Another solution is to use MultiIndex.droplevel with rename_axis (new in pandas 0.18.0):
import pandas as pd
cols = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']],
names = [None, 'Faculty'])
idx = pd.Index(['All', 'Full Time', 'Part Time'])
df = pd.DataFrame([(1,1,2,4),
(0,1,0,1),
(1,0,2,3)], index = idx, columns=cols)
print (df)
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
df.columns = df.columns.droplevel(0)
#pandas 0.18.0 and higher
df = df.rename_axis(None, axis=1)
#pandas bellow 0.18.0
#df.columns.name = None
print (df)
NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['NSW', 'QLD', 'VIC', 'All'], dtype='object')
If you need both column names, use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print (df)
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All'], dtype='object')
Zip levels together
Here is an alternative solution which zips the levels together and joins them with underscore.
Derived from the above answer, and this was what I wanted to do when I found this answer. Thought I would share even if it does not answer the exact above question.
["_".join(pair) for pair in df.columns]
gives
['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All']
Just set this as a the columns
df.columns = ["_".join(pair) for pair in df.columns]
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
How about simply reassigning df.columns:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
For example:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']])
index = pd.Index(['All', 'Full Time', 'Part Time'], name = 'Faculty')
df = pd.DataFrame([(1,1,2,4),
(0,01,0,1),
(1,0,2,3)])
df.columns = columns
df.index = index
Before:
print(df)
basic_amt
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
After:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
print(df)
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3