Best way to reassemble a pandas data frame - python

Need to reassemble a data frame that is the result of a group by operation. It is assumed to be ordered.
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 NaN NaN 2 NaN
2 1.0 1.0 1 NaN
3 NaN NaN 2 NaN
4 NaN NaN 3 NaN
5 2.0 3.0 1 NaN
6 NaN NaN 2 2.0
And looking for something like this
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
Wondering if there is an elegant way to resolve it.
import pandas as pd
import numpy as np
def refill_frame(df, cols):
while df[cols].isnull().values.any():
for col in cols:
if col in list(df):
#print (col)
df[col]= np.where(df[col].isnull(), df[col].shift(1), df[col])
return df
df = pd.DataFrame({'Major': [0, None, 1, None, None,2, None],
'Minor': [0, None, 1, None, None,3, None],
'RelType': [1, 2, 1, 2,3, 1,2],
'SomeNulls': [1, None,None, None,None,None,2]
})
print (df)
cols2fill =['Major', 'Minor']
df = refill_frame(df, cols2fill)
print (df)

If I understand the question correctly, You could do a transform on the specific columns:
df.loc[:, ['Major', 'Minor']] = df.loc[:, ['Major', 'Minor']].transform('ffill')
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0
You could also use the fill_direction function from pyjanitor:
# pip install pyjanitor
import janitor
df.fill_direction({"Major":"down", "Minor":"down"})
Major Minor RelType SomeNulls
0 0.0 0.0 1 1.0
1 0.0 0.0 2 NaN
2 1.0 1.0 1 NaN
3 1.0 1.0 2 NaN
4 1.0 1.0 3 NaN
5 2.0 3.0 1 NaN
6 2.0 3.0 2 2.0

Related

Drop nan of each column in Pandas DataFrame

I have a dataframe as example:
A B C
0 1
1 1
2 1
3 1 2
4 1 2
5 1 2
6 2 3
7 2 3
8 2 3
9 3
10 3
11 3
And I would like to remove nan values of each column to get the result:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
Do I have an easy way to do that?
You can apply a custom sorting function for each column that doesn't actually sort numerically, it justs moves all the NaN values to the end of the column. Then, dropna:
df = df.apply(lambda x: sorted(x, key=lambda v: isinstance(v, float) and np.isnan(v))).dropna()
Output:
>>> df
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0
Given
>>> df
A B C
0 1.0 NaN NaN
1 1.0 NaN NaN
2 1.0 NaN NaN
3 1.0 2.0 NaN
4 1.0 2.0 NaN
5 1.0 2.0 NaN
6 NaN 2.0 3.0
7 NaN 2.0 3.0
8 NaN 2.0 3.0
9 NaN NaN 3.0
10 NaN NaN 3.0
11 NaN NaN 3.0
use
>>> df.apply(lambda s: s.dropna().to_numpy())
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0

Replacing NAN value in a pandas dataframe from values in other records of same group

I have a dataframe df
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [np.nan, 1, 2,np.nan,2,np.nan,np.nan],
'B': [10, np.nan, np.nan,5,np.nan,np.nan,7],
'C': [1,1,2,2,3,3,3]})
which looks like :
A B C
0 NaN 10.0 1
1 1.0 NaN 1
2 2.0 NaN 2
3 NaN 5.0 2
4 2.0 NaN 3
5 NaN NaN 3
6 NaN 7.0 3
I want to replace all the NAN values in column A and B with the value from other records which are from the same group as mentioned in column C.
My expected output is :
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3
How can I do the same in pandas dataframe ?
Use GroupBy.apply with forward and back filling missing values:
df[['A','B']] = df.groupby('C')['A','B'].apply(lambda x: x.ffill().bfill())
print (df)
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

If I have a Pandas data frame like this:
1 2 3 4 5 6 7
1 NaN 1 1 1 NaN 1 1
2 NaN NaN 1 1 1 1 1
3 NaN NaN NaN 1 NaN 1 1
4 1 1 NaN NaN 1 1 NaN
How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:
1 2 3 4 5 6 7
1 NaN 1 2 3 NaN 1 2
2 NaN NaN 1 2 3 4 5
3 NaN NaN NaN 1 NaN 1 2
4 1 2 NaN NaN 1 2 NaN
You could do:
# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)
# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)
# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)
# set the result
result = (cumulative - restart)
result[mask] = np.nan
# display the result
print(result)
Output
1 2 3 4 5 6 7
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
You can do with stack and unstack
s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1 2.0
2 NaN NaN 1.0 2.0 3.0 4 5.0
3 NaN NaN NaN 1.0 NaN 1 2.0
4 3.0 4.0 NaN NaN 1.0 2 NaN
I came up with a slightly different answer here that might be helpful.
For as single series I made this function to to do the cumsum-reset on nulls.
def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
"""
For a pandas series with null values,
do a cumsum and reset the cumulative sum when a null value is encountered.
Example)
input: [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
return: [1, 2, 0, 1, 3, 6, 0, 1, 0]
"""
cumulative = srs.cumsum().fillna(method='ffill')
restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
.fillna(method='ffill').fillna(0))
result = (cumulative - restart)
return result.replace(0, np.nan)
Then for the full dataframe, just apply this function row-wise
df = pd.DataFrame([
[np.nan, 1, 1, 1, np.nan, 1, 1],
[np.nan, np.nan, 1, 1, 1, 1, 1],
[np.nan, np.nan, np.nan, 1, np.nan, 1, 1],
[1, 1, np.nan, np.nan, 1, 1, np.nan],
])
df.apply(cumsum_reset_on_null, axis=1)
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN
One of the way can be:
sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])
Output of sample
1 2 3 4 5 6 7
1 NaN 1.0 1.0 1.0 NaN 1 1.0
2 NaN NaN 1.0 1.0 1.0 1 1.0
3 NaN NaN NaN 1.0 NaN 1 1.0
4 1.0 1.0 NaN NaN 1.0 1 NaN
Following code would do:
#numr = number of rows
#numc = number of columns
numr,numc = sample.shape
for i in range(numr):
s=0
flag=0
for j in range(numc):
if np.isnan(sample.iloc[i,j]):
flag=1
else:
if flag==1:
s=sample.iloc[i,j]
flag=0
else:
s+=sample.iloc[i,j]
sample.iloc[i,j]=s
Output:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1.0 2.0
2 NaN NaN 1.0 2.0 3.0 4.0 5.0
3 NaN NaN NaN 1.0 NaN 1.0 2.0
4 1.0 2.0 NaN NaN 1.0 2.0 NaN

How to combine dataframe rows

I have the following code:
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
fileName= input("Enter file name here (Case Sensitve) > ")
df = pd.read_excel(fileName +'.xlsx', sheetname=None, ignore_index=True)
xl = pd.ExcelFile(fileName +'.xlsx')
SystemCount= len(xl.sheet_names)
df1 = pd.DataFrame([])
for y in range(1, int(SystemCount)+ 1):
df = pd.read_excel(xl,'System ' + str(y))
df['System {0}'.format(y)] = "1"
df1 = df1.append(df)
df1 = df1.sort_values(['Email'])
df = df1['Email'].value_counts()
df1['Count'] = df1.groupby('Email')['Email'].transform('count')
print(df1)
Which prints something like this:
Email System 1 System 2 System 3 System 4 Count
test_1_#test.com NaN 1 NaN NaN 1
test_2_#test.com NaN NaN 1 NaN 3
test_2_#test.com 1 NaN NaN NaN 3
test_2_#test.com NaN NaN NaN 1 3
test_3_#test.com NaN 1 NaN NaN 1
test_4_#test.com NaN NaN 1 NaN 1
test_5_#test.com 1 NaN NaN NaN 3
test_5_#test.com NaN NaN 1 NaN 3
test_5_#test.com NaN NaN NaN 1 3
How do I combine this, so the email only shows once, with all marked systems?
I would like the output to look like this:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
If I understand it clearly
df1=df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
d=dict(zip(df1.columns[1:],['sum']*df1.columns[1:].str.contains('System').sum()+['first']))
df1.fillna(0).groupby('Email').agg(d)
Out[95]:
System1 System2 System3 System4 Count
Email
test_1_#test.com 0.0 1.0 0.0 0.0 1
test_2_#test.com 1.0 0.0 1.0 1.0 3
test_3_#test.com 0.0 1.0 0.0 0.0 1
test_4_#test.com 0.0 0.0 1.0 0.0 1
test_5_#test.com 1.0 0.0 1.0 1.0 3
It'd be easier to get help if you would post code to generate your input data.
But you probably want a GroupBy:
df2 = df1.groupby('Email').sum()

Pandas: Drop Rows, Columns If More Than Half Are NaN

I have a Pandas DataFrame called df with 1,460 rows and 81 columns. I want to remove all columns where at least half the entries are NaN and to do something similar for rows.
From the Pandas docs, I attempted this:
train_df.shape //(1460, 81)
train_df.dropna(thresh=len(train_df)/2, axis=1, inplace=True)
train_df.shape //(1460, 77)
Is this the correct way of doing it? It seems to remove 4 columns but I'm surprised. I would have thought len(train_df) gets me the number of rows so I've passed the wrong value to thresh...?
How would I do the same thing for rows (removing rows where at least half the columns are NaN)?
Thanks!
I guess you did the right thing but forgot to add the .index.
The line should look like this:
train_df.dropna(thresh=len(train_df.index)/2, axis=1, inplace=True)
Hope that helps.
Using count and loc. count(axis=) ignores NaNs for counting.
In [4135]: df.loc[df.count(1) > df.shape[1]/2, df.count(0) > df.shape[0]/2]
Out[4135]:
0
0 0.382991
1 0.428040
7 0.441113
Details
In [4136]: df
Out[4136]:
0 1 2 3
0 0.382991 0.658090 0.881214 0.572673
1 0.428040 0.258378 0.865269 0.173278
2 0.579953 NaN NaN NaN
3 0.117927 NaN NaN NaN
4 0.597632 NaN NaN NaN
5 0.547839 NaN NaN NaN
6 0.998631 NaN NaN NaN
7 0.441113 0.527205 0.779821 0.251350
In [4137]: df.count(1) > df.shape[1]/2
Out[4137]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [4138]: df.count(0) < df.shape[0]/2
Out[4138]:
0 False
1 True
2 True
3 True
dtype: bool
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.choice([1, np.nan], size=(10, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 1.0
3 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 1.0 NaN
5 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
6 NaN NaN 1.0 NaN NaN 1.0 1.0 NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN 1.0 NaN 1.0 NaN NaN
8 1.0 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
9 NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solution 1
This assumes you make the calculation for rows and columns before you drop either rows or columns.
n = df.notnull()
df.loc[n.mean(1) > .5, n.mean() > .5]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Solution 2
Similar concept but using numpy tools.
v = np.isnan(df.values)
r = np.count_nonzero(v, 1) < v.shape[1] // 2
c = np.count_nonzero(v, 0) < v.shape[0] // 2
df.loc[r, c]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Try this code, it would do !
df.dropna(thresh = df.shape[1]/3, axis = 0, inplace = True)

Categories