Sum two columns in a grouped data frame using shift() - python

I have a data frame df where I would like to create new column ID which is a diagonal combination of two other columns ID1 & ID2.
This is the data frame:
import pandas as pd
df = pd.DataFrame({'Employee':[5,5,5,20,20],
'Department':[4,4,4,6,6],
'ID':['AB','CD','EF','XY','AA'],
'ID2':['CD','EF','GH','AA','ZW']},)
This is how the initial data frame looks like:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
If I group df by Employee & Department:
df2=df.groupby(["Employee","Department"])
I would have only two option of groups, groups containing two rows or groups containing three rows.
The column ID would be the sum of ID1 of the first row & ID2 of the next row & for the last row of the group, ID would take the value of the previous ID.
Expected output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
I thought about using shift()
df2["ID"]=df["ID1"]+df["ID2"].shift(-1)
But I could not quite figure it out. Any ideas ?

(df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
almost your code, but we first groupby and then shift up. Lastly forward fill for those last rows per group.
In [24]: df
Out[24]:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
In [25]: df["ID"] = (df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
In [26]: df
Out[26]:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW

You can groupby.shift, concatenate, and ffill:
df['ID'] = (df['ID1']+df.groupby(['Employee', 'Department'])['ID2'].shift(-1)
).ffill()
output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW

Related

How to use pandas to expand a groupby table to have the rows repeated

I have data that is grouped by as the following:
before
I would like to expand the dataframe to be ungrouped into a table that looks like the image below:
after
What would be the best way to repeat these items to get a simpler table?
I have tried to use unstack but I would like the columns to stay the same as they currently are.
There are a couple of different ways to interpret the images.
If this is a dataframe that you've aggregated and that's what's in pandas (what it looks like), and you just want to display with repeated labels I think you're just looking for a .reset_index()?
If you export it by to_excel,to_csv,to_markdown or the like, the labels will be repeated.
If it's an excel table with just null values, you'd want to fill the na's using the method ffill:
my_df["id"] = my_df["id"].fillna(method="ffill")
here is a similar answer
Below demonstrates what the code does in each of the scenarios:
reset_index
>>> df
a
id colb
a AA 1
BB 2
b CC 3
DD 4
c EE 5
FF 6
d GG 7
>>>
>>> df.reset_index()
id colb a
0 a AA 1
1 a BB 2
2 b CC 3
3 b DD 4
4 c EE 5
5 c FF 6
6 d GG 7
testing export
>>> df
a
id colb
a AA 1
BB 2
b CC 3
DD 4
c EE 5
FF 6
d GG 7
>>> df.to_csv("testing_if_labels_repeat.csv")
>>> pd.read_csv("testing_if_labels_repeat.csv")
id colb a
0 a AA 1
1 a BB 2
2 b CC 3
3 b DD 4
4 c EE 5
5 c FF 6
6 d GG 7
if the source is an table with null values
>>> df = pd.read_excel("table_file.xlsx")
>>> df
id colb cola
0 a AA 1
1 NaN BB 2
2 b CC 3
3 NaN DD 4
4 c EE 5
5 NaN FF 6
6 d GG 7
>>> df["id"] = df["id"].fillna(method="ffill")
>>> df
id colb cola
0 a AA 1
1 a BB 2
2 b CC 3
3 b DD 4
4 c EE 5
5 c FF 6
6 d GG 7
>>>
Is that helpful? or are you trying for something different?

How to find the next row that have a value in column in a dataframe pandas?

I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9

Fill down zeros for specific column where there are no values (Python)

I have a dataframe, df, where I would like to fill down zeroes in a specific column where there are no values.
Data
id type count ppr
aa cool 12 9
aa hey 7
aa hi 12 7
bb no 7
bb yes 7
Desired
id type count ppr
aa cool 12 9
aa hey 0 7
aa hi 12 7
bb no 0 7
bb yes 0 7
Doing
df['count'] = df.fillna(0).astype(int)
However, this only gives output of a single column and not the full dataset
Any suggestion is appreciated
If you have no values, maybe it's because you have an empty string:
>>> df['count']
0 12
1
2 12
3
4
Name: count, dtype: object # <- HERE
So, this should work:
df['count'] = df['count'].replace('', 0).astype(int)
>>> df
id type count ppr
0 aa cool 12 9
1 aa hey 0 7
2 aa hi 12 7
3 bb no 0 7
4 bb yes 0 7
df['count']=df['count'].fillna(0).astype(int)
Here is the code that worked for me (and that should have been added to the question to make solving it easier
import pandas as pd
import numpy as np
df_sample =\
pd.DataFrame([["day1","day2","day1","day2","day1","day2"],
[None,160,None,180,110,None]] ).T
df_sample.columns = ["day","count"]
df_sample['count']=df_sample['count'].fillna(0).astype(int)
print(df_sample)
Seen from your sample code of fillna() that works for a single column. I suppose your "no values" are actually NaN values.
Further seen from your last paragraph that you mentioned difficulty in applying to the full dataset, I further suppose you want to apply to multiple columns. As such, see below:
If you want to apply to ALL numeric type columns and try to change them to integer type if possible, you can try:
df.loc[:, df.select_dtypes(include='number').columns] = df.select_dtypes(include='number').fillna(0, downcast='infer')
Demo
# Before conversion:
print(df)
id type count ppr
0 aa cool 12.0 9.0
1 aa hey NaN 7.0
2 aa hi 12.0 NaN
3 bb no NaN 7.0
4 bb yes NaN 7.0
df.loc[:, df.select_dtypes(include='number').columns] = df.select_dtypes(include='number').fillna(0, downcast='infer')
# After conversion:
print(df)
id type count ppr
0 aa cool 12 9
1 aa hey 0 7
2 aa hi 12 0
3 bb no 0 7
4 bb yes 0 7

Drop duplicate rows, but only if column equals NaN

I only want to drop rows where two columns (ID, Code) are duplicates, but the third column (Descrip) is equal to 'NaN'. My dataframe, df (Shown below) relfects my intial dataframe and df2 is what I want instead.
df:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 NaN CC
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 NaN CC
6 4 SS
6 NaN SS
df2:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 4 SS
I know using df.drop(subset['ID', 'Code'], keep='first'), would remove the duplicate rows, but I only want this where 'Decrip' == 'NaN'.
You can use groupby and take the max value (every number is larger than NaN):
df2 = df.groupby(["ID", "Code"])["Descrip"].max().reset_index()
I think you could use:
df = df[~(df.duplicated(['ID','Code'], False) & df['Descrip'].isna())]
Where (and I'll try my best to explain as to my understanding):
df.duplicated(['ID','Code'], False) - Returns a boolean if there is any duplicate in the subset ID and Code, where False ensures all rows are considered. Documentation here.
df['Descrip'].isna() - Checks wheather or not Descrip holds NaN. Documentation here
df[~(....first point above .... & .... second point above ....)] - The tilde stands for not operator to invert the boolean mask and the ampersand chains these two expressions together with bitwise and, together filtering out the rows of interest. Documentation here.
Result:
ID Descrip Code
0 1 NaN CC
1 1 3 SS
2 2 4 CC
3 2 7 SS
5 3 1 CC
6 3 NaN SS
7 4 20 CC
8 4 22 SS
9 5 15 CC
10 5 10 SS
11 6 100 CC
13 6 4 SS

Unique Value in Rows to Column Header with Conditions

I have the following data:
Name Date Class Attended
AB 15-02-2019 3
CD 15-02-2019 2
AB 19-02-2019 4
CD 19-02-2019 2
AB 15-02-2019 1
CD 19-02-2019 3
I need output like:
Name 15-02-2019 19-02-2019
AB 4 (3+1) 4
CD 2 5 (2+3)
Use groupby() and then pivot()
df.groupby(['Name', 'Date']).sum().reset_index().pivot(index='Name', columns='Date', values='Class Attended').reset_index()
Out[12]:
Date Name 15-02-2019 19-02-2019
0 AB 4 4
1 CD 2 5

Categories