How to count number of rows dropped in a pandas dataframe - python

How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)

# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)

Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7

Related

Zero Division in one row causing error in all other rows Pandas

Considering the two columns of this df:
ColA ColB
0 0.0 0.0
1 5.523288 0.0
2 6.115068 0.0
3 6.90411 1.0
4 9.172603 2.0
5 10.849315 2.0
6 12.230137 0.0
7 11.210959 0.0
8 10.849315 1.0
9 2.169863 0.0
Attempting to generate a calculated column in this way:
df['Result'] = df['ColB']/df['ColA']
This attempt raises 'float division by zero' error as expected because of the calculation in the first row being a division by zero. Saw this and then did this to navigate around that temporarily:
try:
df['Result'] = df['ColB']/df['ColA']
except ZeroDivisionError:
df['Result'] = 0
However, this code consistently producing this result (i.e. all rows are zeros)
ColA ColB Result
0 0.0 0.0 0
1 5.523288 0.0 0
2 6.115068 0.0 0
3 6.90411 1.0 0
4 9.172603 2.0 0
5 10.849315 2.0 0
6 12.230137 0.0 0
7 11.210959 0.0 0
8 10.849315 1.0 0
9 2.169863 0.0 0
Starting at Index Row 3, the Result column should be producing float values that are not merely zero. I also inserted "some error" in the above try except and all the values in the Result column displayed "some error."
I am at a loss as to why pandas is not bypassing the error and producing valid results in the appropriate rows.
Try this -
df['Result'] = (df['ColB']/df['ColA']).fillna(0)
ColA ColB Result
0 0.000000 0.0 0.000000
1 5.523288 0.0 0.000000
2 6.115068 0.0 0.000000
3 6.904110 1.0 0.144841
4 9.172603 2.0 0.218041
5 10.849315 2.0 0.184343
6 12.230137 0.0 0.000000
7 11.210959 0.0 0.000000
8 10.849315 1.0 0.092172
9 2.169863 0.0 0.000000
Check out the documentation here
Regarding this -
try:
df['Result'] = df['ColB']/df['ColA']
except ZeroDivisionError:
df['Result'] = 0
I am actually no able to reproduce the result that you are facing. Here is what I get as expected.
ColA ColB Result
0 0.000000 0.0 NaN #<- 0/0 still throws nan
1 5.523288 0.0 0.000000 #<- divide by 0
2 6.115068 0.0 0.000000
3 6.904110 1.0 0.144841
4 9.172603 2.0 0.218041
5 10.849315 2.0 0.184343
6 12.230137 0.0 0.000000 #<- divide by 0
7 11.210959 0.0 0.000000
8 10.849315 1.0 0.092172
9 2.169863 0.0 0.000000

Best way to reassemble a pandas data frame

Need to reassemble a data frame that is the result of a group by operation. It is assumed to be ordered.
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 NaN NaN 2 NaN
2 1.0 1.0 1 NaN
3 NaN NaN 2 NaN
4 NaN NaN 3 NaN
5 2.0 3.0 1 NaN
6 NaN NaN 2 2.0
And looking for something like this
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
Wondering if there is an elegant way to resolve it.
import pandas as pd
import numpy as np
def refill_frame(df, cols):
while df[cols].isnull().values.any():
for col in cols:
if col in list(df):
#print (col)
df[col]= np.where(df[col].isnull(), df[col].shift(1), df[col])
return df
df = pd.DataFrame({'Major': [0, None, 1, None, None,2, None],
'Minor': [0, None, 1, None, None,3, None],
'RelType': [1, 2, 1, 2,3, 1,2],
'SomeNulls': [1, None,None, None,None,None,2]
})
print (df)
cols2fill =['Major', 'Minor']
df = refill_frame(df, cols2fill)
print (df)
If I understand the question correctly, You could do a transform on the specific columns:
df.loc[:, ['Major', 'Minor']] = df.loc[:, ['Major', 'Minor']].transform('ffill')
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
You could also use the fill_direction function from pyjanitor:
# pip install pyjanitor
import janitor
df.fill_direction({"Major":"down", "Minor":"down"})
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0

How to merge rows with combination of values in a DataFrame

I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0

How to combine dataframe rows

I have the following code:
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
fileName= input("Enter file name here (Case Sensitve) > ")
df = pd.read_excel(fileName +'.xlsx', sheetname=None, ignore_index=True)
xl = pd.ExcelFile(fileName +'.xlsx')
SystemCount= len(xl.sheet_names)
df1 = pd.DataFrame([])
for y in range(1, int(SystemCount)+ 1):
df = pd.read_excel(xl,'System ' + str(y))
df['System {0}'.format(y)] = "1"
df1 = df1.append(df)
df1 = df1.sort_values(['Email'])
df = df1['Email'].value_counts()
df1['Count'] = df1.groupby('Email')['Email'].transform('count')
print(df1)
Which prints something like this:
Email System 1 System 2 System 3 System 4 Count
test_1_#test.com NaN 1 NaN NaN 1
test_2_#test.com NaN NaN 1 NaN 3
test_2_#test.com 1 NaN NaN NaN 3
test_2_#test.com NaN NaN NaN 1 3
test_3_#test.com NaN 1 NaN NaN 1
test_4_#test.com NaN NaN 1 NaN 1
test_5_#test.com 1 NaN NaN NaN 3
test_5_#test.com NaN NaN 1 NaN 3
test_5_#test.com NaN NaN NaN 1 3
How do I combine this, so the email only shows once, with all marked systems?
I would like the output to look like this:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
If I understand it clearly
df1=df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
d=dict(zip(df1.columns[1:],['sum']*df1.columns[1:].str.contains('System').sum()+['first']))
df1.fillna(0).groupby('Email').agg(d)
Out[95]:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
It'd be easier to get help if you would post code to generate your input data.
But you probably want a GroupBy:
df2 = df1.groupby('Email').sum()

How to pivot with binning with complicated condition in pandas

I have dataframe like below
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1
first I would like to binning with age
age
[0~4]
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
Then sum up and count days by grouping with type
sum count
a 6 2
b 9 3
c 0 0
d 0 0
e 0 0
f 0 0
Then I would like to apply this method to another binns.
[5~9]
[11~14]
My desired result is below
[0~4] [5~9] [10~14]
sum count sum count sum count
a 6 2 0 0 1 1
b 9 3 0 0 0 0
c 0 0 1 1 0 0
d 0 0 4 1 0 0
e 0 0 0 0 2 1
f 0 0 0 1 0 0
How can this be done?
It is very complicated for me..
Consider a pivot_table with pd.cut if you do not care too much about column ordering as count and sum are not paired together under the bin. With manipulation you can change such ordering.
df['bin'] = pd.cut(df.age, [0,4,9,14])
pvtdf = df.pivot_table(index='type', columns=['bin'], values='days',
aggfunc=('count', 'sum')).fillna(0)
# count sum
# bin (0, 4] (4, 9] (9, 14] (0, 4] (4, 9] (9, 14]
# type
# a 2.0 0.0 1.0 6.0 0.0 1.0
# b 3.0 0.0 0.0 9.0 0.0 0.0
# c 0.0 1.0 0.0 0.0 1.0 0.0
# d 0.0 1.0 0.0 0.0 4.0 0.0
# e 0.0 0.0 1.0 0.0 0.0 2.0
# f 0.0 1.0 0.0 0.0 0.0 0.0
We'll use some stacking and groupby operations to get us to the desired output.
string_ = io.StringIO('''age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1''')
df = pd.read_csv(string_, sep='\s+')
df['age_bins'] = pd.cut(df['age'], [0,4,9,14])
df_stacked = df.groupby(['age_bins', 'type']).agg({'days': np.sum,
'type': 'count'}).transpose().stack().fillna(0)
df_stacked.rename(index={'days': 'sum', 'type': 'count'}, inplace=True)
>>> df_stacked
age_bins (0, 4] (4, 9] (9, 14]
type
sum a 6.0 0.0 1.0
b 9.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 4.0 0.0
e 0.0 0.0 2.0
f 0.0 0.0 0.0
count a 2.0 0.0 1.0
b 3.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 1.0 0.0
e 0.0 0.0 1.0
f 0.0 1.0 0.0
This doesn't produce the exact output you listed, but it's similar, and I think it will be easier to index and retrieve data from. Alternatively, you could do use the following to get something like the desired output.
>>> df_stacked.unstack(level=0)
age_bins (0, 4] (4, 9] (9, 14]
count sum count sum count sum
type
a 2.0 6.0 0.0 0.0 1.0 1.0
b 3.0 9.0 0.0 0.0 0.0 0.0
c 0.0 0.0 1.0 1.0 0.0 0.0
d 0.0 0.0 1.0 4.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 2.0
f 0.0 0.0 1.0 0.0 0.0 0.0

Categories