groupby results in empty datafram - python

Issue:
groupby on below data, resulted in empty dataframe, not sure how to fix, please help thanks.
data:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 1
NaN a 12 12-09 09 1
NaN b 23 23-87 87 1
NaN b 23 23-87 87 1
NaN b 34 34-76 76 1
groupby:
group_df = df.groupby(['business_group', 'business_unit','cost_center','GL_code','profit_center'],
as_index=False).count()
expected result:
business_group business_unit cost_center GL_code profit_center count
NaN a 12 12-09 09 2
NaN b 23 23-87 87 2
NaN c 34 34-76 76 1
result received:
Empty DataFrame
Columns: [business_group, business_unit, cost_center, GL_code, profit_center, count]
Index: []

That's because the NaN in business_group is the null values, and groupby() by default will drop all NaN values. You can pass dropna=False into groupby():
group_df = df.groupby(['business_group', 'business_unit',
'cost_center','GL_code','profit_center'],dropna=False,
as_index=False).count()
Output:
business_group business_unit cost_center GL_code profit_center count
0 NaN a 12 12-09 9 2
1 NaN b 23 23-87 87 2
2 NaN b 34 34-76 76 1

Related

Add the values of several columns when the number of columns exceeds 3 - Pandas

I have a pandas dataframe with several columns of dates, numbers and bill amounts. I would like to add the amounts of the other invoices with the 3rd one and change the invoice number by "1111".
Here is an example:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Bill4
Date 4
ID Bill 4
Bill5
Date 5
ID Bill 5
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
3
23
2001-11-23
6
76
2011-08-19
12
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
21
2009-05-04
9
Nan
Nan
Nan
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
And I would like to get this :
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
4
6
2000-10-04
1
45
2000-11-05
2
150
1999-12-05
1111
6
8
2016-05-03
7
39
2017-08-09
8
59
2018-07-14
1111
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
This example is a sample of my data, I may have many more than 5 columns.
Thanks for your help
with a little of data manipulation, you should be able to do it as:
df = df.replace('Nan', np.nan)
idx_col_bill3 = 7
step = 3
idx_col_bill3_id = 10
cols = df.columns
bills = df[cols[range(idx_col_bill3,len(cols), step)]].sum(axis=1)
bills.replace(0, nan, inplace=True)
df = df[cols[range(idx_col_bill3_id)]]
df['Bill3'] = bills
df['ID Bill 3'].iloc._setitem_with_indexer(df['ID Bill 3'].notna(),1111)

I want to change the column value for a specific index

df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN

How do I split cell contents that contain "=" the left part into column titles and the right into row values, using pandas

Origin:
0 1
PASS AC=24;AF=1;AN=24;DP=39;ExcessHet=3.0103
PASS AC=14;AF=1;AN=14;FS=0;MLEAC=2
What I want:
0 AC AF AN DP ExcessHet FS MLEAC
PASS 24 1 24 39 3.0103 NAN NAN
PASS 14 1 14 NAN NAN 0 2
enter image description here
Thanks!
IIUC, str.split explode, unstack and concat
m = df['1'].str.split(';').explode().str.split('=',expand=True)
m1 = m.groupby([m[0],m.index])[1].agg(list).apply(pd.Series).unstack(0)
m1.columns = m1.columns.get_level_values(1)
df1 = pd.concat([df,m1],axis=1).drop('1',axis=1)
print(df1)
0 AC AF AN DP ExcessHet FS MLEAC
0 PASS 24 1 24 39 3.0103 NaN NaN
1 PASS 14 1 14 NaN NaN 0 2

Resolve complementary missing values between rows

I have a df that looks like this
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 NaN yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 NaN yes 3 no 2 no 2
I'm looking for a way to add the ID_2 column value to all rows where ID matches (i.e., for Participant 1, fill in the NaN values with the values from the other row where ID=Participant 1). I've looked into using combine but that doesn't seem to work for this particular case.
Expected output:
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 yes 3 no 2 no 2
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 yes 3 no 2 no 2
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
or
Day ID ID_2 AS D E AS1 D1 E1
29 72 Participant 1 PS 6 42 NaN NaN NaN NaN NaN NaN
35 78 Participant 1 PS 6 42 yes 3 no 2 no 2
49 22 Participant 2 PS 1 89 NaN NaN NaN NaN NaN NaN
85 18 Participant 2 PS 1 89 yes 3 no 2 no 2
I think you could try
df.ID_2 = df.groupby('ID').ID_2.ffill()
# 29 PS 6 42
# 35 PS 6 42
# 49 PS 1 89
# 85 PS 1 89
Not tested, but something like this should work - can't copy your df into my browser.
print(df)
Day ID ID_2 AS D E AS1 D1 E1
0 72 Participant_1 PS_6_42 NaN NaN NaN NaN NaN NaN
1 78 Participant_1 NaN yes 3.0 no 2.0 no 2.0
2 22 Participant_2 PS_1_89 NaN NaN NaN NaN NaN NaN
3 18 Participant_2 NaN yes 3.0 no 2.0 no 2.0
df2 = df.set_index('ID').groupby('ID').transform('ffill').transform('bfill').reset_index()
print(df2)
ID Day ID_2 AS D E AS1 D1 E1
0 Participant_1 72 PS_6_42 yes 3 no 2 no 2
1 Participant_1 78 PS_6_42 yes 3 no 2 no 2
2 Participant_2 22 PS_1_89 yes 3 no 2 no 2
3 Participant_2 18 PS_1_89 yes 3 no 2 no 2

How to move every element in a column by n range in a dataframe using python?

I have a dataframe df that looks like below:
No A B value
1 23 36 1
2 45 23 1
3 34 12 2
4 22 76 NaN
...
I would like to shift each of the value in "value" column by 2. And the first row "value" should not be shifted.
I have already tried the normal shift, which directly shifts everthing by 2.
df['value']=df['value'].shift(2)
i expect the below result:
No A B value
1 23 36 1
2 45 23 Nan
3 34 12 Nan
4 22 76 1
5 10 12 Nan
6 34 2 Nan
7 21 11 2
...
In your case
df['Newvalue']=pd.Series(df.value.values,index=np.arange(len(df))*3)
df
Out[41]:
No A B value Newvalue
0 1 23 36 1.0 1.0
1 2 45 23 1.0 NaN
2 3 34 12 2.0 NaN
3 4 22 76 NaN 1.0

Categories