I have the following data:
Name Date Class Attended
AB 15-02-2019 3
CD 15-02-2019 2
AB 19-02-2019 4
CD 19-02-2019 2
AB 15-02-2019 1
CD 19-02-2019 3
I need output like:
Name 15-02-2019 19-02-2019
AB 4 (3+1) 4
CD 2 5 (2+3)
Use groupby() and then pivot()
df.groupby(['Name', 'Date']).sum().reset_index().pivot(index='Name', columns='Date', values='Class Attended').reset_index()
Out[12]:
Date Name 15-02-2019 19-02-2019
0 AB 4 4
1 CD 2 5
Related
I have a data frame df where I would like to create new column ID which is a diagonal combination of two other columns ID1 & ID2.
This is the data frame:
import pandas as pd
df = pd.DataFrame({'Employee':[5,5,5,20,20],
'Department':[4,4,4,6,6],
'ID':['AB','CD','EF','XY','AA'],
'ID2':['CD','EF','GH','AA','ZW']},)
This is how the initial data frame looks like:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
If I group df by Employee & Department:
df2=df.groupby(["Employee","Department"])
I would have only two option of groups, groups containing two rows or groups containing three rows.
The column ID would be the sum of ID1 of the first row & ID2 of the next row & for the last row of the group, ID would take the value of the previous ID.
Expected output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
I thought about using shift()
df2["ID"]=df["ID1"]+df["ID2"].shift(-1)
But I could not quite figure it out. Any ideas ?
(df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
almost your code, but we first groupby and then shift up. Lastly forward fill for those last rows per group.
In [24]: df
Out[24]:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
In [25]: df["ID"] = (df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
In [26]: df
Out[26]:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
You can groupby.shift, concatenate, and ffill:
df['ID'] = (df['ID1']+df.groupby(['Employee', 'Department'])['ID2'].shift(-1)
).ffill()
output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz
>> df
Foo Bar Number Date
0 abc None NaN NaT
1 abcdefg None NaN NaT
2 abcd this 1111222 3/8/2017
3 abcd that 1233336 3/3/2017
4 abcd what 1346554 3/3/2017
5 abcde that 8889995 3/9/2017
6 abcde this 1849552 3/8/2017
7 abcd that 7418652 3/3/2017
8 abcdef this 4865154 3/7/2017
>> df.groupby(['Foo']).size().reset_index(name='Total')
If I do it this way, the row is counted as having one value, which it does and I understand that. I'm not sure how to include the row in the Total, but not actually count None/NaN/NaT value?
Returns:
Foo Total
0 abc 1
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 1
Expected result:
Foo Total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0
You could drop nulls first, then reindex with the unique values of the Foo column at the end with a fill value.
(df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Or alternatively, you could make your Foo column Categorical.
df.Foo = pd.Categorical(df.Foo)
df.dropna().groupby('Foo').size().reset_index(name='total')
Demo
>>> (df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Foo total
0 abc 0
1 abcdefg 0
2 abcd 4
3 abcde 2
4 abcdef 1
############################################################################
>>> df.Foo = pd.Categorical(df.Foo)
>>> df.dropna().groupby('Foo').size().reset_index(name='total')
Foo total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0
I have a dataframe that looks something like this:
df = pd.DataFrame({'Name':['a','a','a','a','b','b','b'], 'Year':[1999,1999,1999,2000,1999,2000,2000], 'Name_id':[1,1,1,1,2,2,2]})
Name Name_id Year
0 a 1 1999
1 a 1 1999
2 a 1 1999
3 a 1 2000
4 b 2 1999
5 b 2 2000
6 b 2 2000
What I'd like to have is a new column 'yr_name_id' that increases for each unique Name_id-Year combination and then begins anew with each new Name_id.
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
I've tried a variety of things and looked here, here and at a few posts on groupby and enumerate.
At first I tried creating a unique dictionary after combining Name_id and Year and then using map to assign values, but when I try to combine Name_id and Year as strings via:
df['yr_name_id'] = str(df['Name_id']) + str(df['Year'])
The new column has a non-unique syntax of 0 0 1\n1 1\n2 1\n3 1\n4 2\n5 2... which I don't really understand.
A more promising approach that I think I just need help with the lambda is by using groupby
df['yr_name_id'] = df.groupby(['Name_id', 'Year'])['Name_id'].transform(lambda x: )#unsure from this point
I am very unfamiliar with lambda's so any guidance on how I might do this would be greatly appreciated.
IIUC you can do it this way:
In [99]: df['yr_name_id'] = pd.Categorical(pd.factorize(df['Name_id'].astype(str) + '-' + df['Year'].astype(str))[0] + 1)
In [100]: df
Out[100]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4
In [101]: df.dtypes
Out[101]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
But looking at your desired DF, it looks like you want to categorize just a Year column, not a combination of Name_id + Year
In [102]: df['yr_name_id'] = pd.Categorical(pd.factorize(df.Year)[0] + 1)
In [103]: df
Out[103]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
In [104]: df.dtypes
Out[104]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
Use itertools.count:
from itertools import count
counter = count(1)
df['yr_name_id'] = (df.groupby(['Name_id', 'Year'])['Name_id']
.transform(lambda x: next(counter)))
Output:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4