Find missing data in pandas dataframe and fill with NA

Find missing data in pandas dataframe and fill with NA - python

I have a dataframe in pandas with company name and date as multi-index.
companyname date emp1 emp2 emp3..... emp80
Where emp1, emp2 is the count of phone calls made by emp1 and 2 respectively on that date. Now there are dates when no employee made a call. Means there are rows where all the column values are 0. I want to fill these values by NA.
Should I manually write the names of all columns in some function? Any suggestions how to achieve this?

You can check that the entire row is 0 with all:
In [11]: df = pd.DataFrame([[1, 2], [0, 4], [0, 0], [7, 8]])
In [12]: df
Out[12]:
0 1
0 1 2
1 0 4
2 0 0
3 7 8
In [13]: (df == 0).all(1)
Out[13]:
0 False
1 False
2 True
3 False
dtype: bool
Now you can assign all the entries in these rows to NaN using loc:
In [14]: df.loc[(df == 0).all(1)] = np.nan
In [15]: df
Out[15]:
0 1
0 1 2
1 0 4
2 NaN NaN
3 7 8

Related

How to keep columns based on a given row values

Here how the datalooks like in df dataframe:
A B C D
0.js 2 1 1 -1
1.js 3 -5 1 -4
total 5 -4 2 -5
And I would get new dataframe df1:
A C
0.js 2 1
1.js 3 1
total 5 2
So basically it should look like this:
df1 = df[df["total"] > 0]
but it should filter on row instead of column and I can't figure it out..

You want to use .loc[:, column_mask] i.e.
In [11]: df.loc[:, df.sum() > 0]
Out[11]:
A C
total 5 2
# or
In [12]: df.loc[:, df.iloc[0] > 0]
Out[12]:
A C
total 5 2

Use .where to set negative values to NaN and then dropna setting axis = 1:
df.where(df.gt(0)).dropna(axis=1)
A C
total 5 2

You can use, loc with boolean indexing or reindex:
df.loc[:, df.columns[(df.loc['total'] > 0)]]
OR
df.reindex(df.columns[(df.loc['total'] > 0)], axis=1)
Output:
A C
0.js 2 1
1.js 3 1
total 5 2

KeyError when Assigning value using For Loop by Pandas

I have a long list of data, that meaningful data being sandwiched between 0 values, here is how it looks like
0
0
1
0
0
2
3
1
0
0
0
0
1
0
The length of 0 and meaningful value sequence is variable. I want to extract the meaningful sequence, each of them into a row in a dataframe. For example, the above data can be extracted to this:
1
2 3 1
1
I used this code to 'slice' the meaningful data:
import pandas as pd
import numpy as np
raw = pd.read_csv('data.csv')
df = pd.DataFrame(index=np.arange(0, 10000),columns = ['DT01', 'DT02', 'DT03', 'DT04', 'DT05', 'DT06', 'DT07', 'DT08', 'DT02', 'DT09', 'DT10', 'DT11', 'DT12', 'DT13', 'DT14', 'DT15', 'DT16', 'DT17', 'DT18', 'DT19', 'DT20',])
a = 0
b = 0
n=0
for n in range(0,999999):
if raw.iloc[n].values > 0:
df.iloc[a,b] = raw.iloc[n].values
a=a+1
if raw [n+1] == 0:
b=b+1
a=0
but I keep getting KeyError: n, while n is the row after the first row has a value different than 0.
Where is the problem with me code? And is there any way to improve it, in term of speed and memory cost?
Thank you very much

You can use:
df['Group'] = df['col'].eq(0).cumsum()
df = df.loc[ df['col'] != 0]
df = df.groupby('Group')['col'].apply(list)
print (df)
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: col, dtype: object
df = pd.DataFrame(df.groupby('Group')['col'].apply(list).values.tolist())
print (df)
0 1 2
0 1 NaN NaN
1 2 3.0 1.0
2 1 NaN NaN

Let's try this outputs a dataframe:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: x.reset_index(drop=True)).unstack(1)
Output:
0 1 2
0 1.0 NaN NaN
1 2.0 3.0 1.0
2 1.0 NaN NaN
Or a string:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(lambda x: ' '.join(x.astype(str)))
Output:
0 1
1 2 3 1
2 1
dtype: object
Or as a list:
df.groupby(df[0].eq(0).cumsum().mask(df[0].eq(0)),as_index=False)[0]\
.apply(list)
Output:
0 [1]
1 [2, 3, 1]
2 [1]
dtype: object

Try this , I break down the steps
df.LIST=df.LIST.replace({0:np.nan})
df['Group']=df.LIST.isnull().cumsum()
df=df.dropna()
df.groupby('Group').LIST.apply(list)
Out[384]:
Group
2 [1]
4 [2, 3, 1]
8 [1]
Name: LIST, dtype: object
Data Input
df = pd.DataFrame({'LIST' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})

Let's start with packing your original data into a pandas dataframe (in real life, you will probably use pd.read_csv() to generate this dataframe):
raw = pd.DataFrame({'0' : [0,0,1,0,0,2,3,1,0,0,0,0,1,0]})
The default index will help you locate zero spans:
s1 = raw.reset_index()
s1['index'] = np.where(s1['0'] != 0, np.nan, s1['index'])
s1['index'] = s1['index'].fillna(method='ffill').fillna(0).astype(int)
s1[s1['0'] != 0].groupby('index')['0'].apply(list).tolist()
#[[1], [2, 3, 1], [1]]

How to apply a condition to a large number of columns in a pandas dataframe

I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?

Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value

Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2

as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]

Counting the number of missing/NaN in each row

I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?

You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64

To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?

This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0

How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1

I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find missing data in pandas dataframe and fill with NA - python

Related

How to keep columns based on a given row values

KeyError when Assigning value using For Loop by Pandas

How to apply a condition to a large number of columns in a pandas dataframe

Counting the number of missing/NaN in each row

Pandas indexing by both boolean `loc` and subsequent `iloc`

Categories

Resources