Pandas: Find empty/missing values and add them to DataFrame

Pandas: Find empty/missing values and add them to DataFrame - python

I have dataframe where column 1 should have all the values from 1 to 169. If a value doesnt exists, I'd like to add a new row to my dataframe which contains the said value (and some zeros).
I can't get the following code to work, even tho there are no errors:
for i in range(1,170):
if i in df.col1 is False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
Any advices?

It would be better to do something like:
In [37]:
# create our test df, we have vales 1 to 9 in steps of 2
df = pd.DataFrame({'a':np.arange(1,10,2)})
df['b'] = np.NaN
df['c'] = np.NaN
df
Out[37]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
In [38]:
# now set the index to a, this allows us to reindex the values with optional fill value, then reset the index
df = df.set_index('a').reindex(index = np.arange(1,10), fill_value=0).reset_index()
df
Out[38]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
So just to explain the above:
In [40]:
# set the index to 'a', this allows us to reindex and fill missing values
df = df.set_index('a')
df
Out[40]:
b c
a
1 NaN NaN
3 NaN NaN
5 NaN NaN
7 NaN NaN
9 NaN NaN
In [41]:
# now reindex and pass fill_value for the extra rows we want
df = df.reindex(index = np.arange(1,10), fill_value=0)
df
Out[41]:
b c
a
1 NaN NaN
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
6 0 0
7 NaN NaN
8 0 0
9 NaN NaN
In [42]:
# now reset the index
df = df.reset_index()
df
Out[42]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
If you modified your loop to the following then it would work:
In [63]:
for i in range(1,10):
if any(df.a.isin([i])) == False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
df
Out[63]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
EDIT
If you wanted the missing rows to appear at the end of the df then you could just create a temporary df with the full range of values and other columns set to zero and then filter this df based on the values that are missing in the other df and concatenate them:
In [70]:
df_missing = pd.DataFrame({'a':np.arange(10),'b':0,'c':0})
df_missing
Out[70]:
a b c
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 0 0
7 7 0 0
8 8 0 0
9 9 0 0
In [73]:
df = pd.concat([df,df_missing[~df_missing.a.isin(df.a)]], ignore_index=True)
df
Out[73]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
5 0 0 0
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0

The expression if i in df.col1 is False always evaluates to false. I think it is looking in the index. Also I think you need to use pandas.concat in modern versions of pandas instead of assigning to df.loc[].
I would recommend gathering all missing values in a list then concatenating them to the dataframe at the end. For instance
>>> df = pd.DataFrame({'col1': range(5) + [i + 6 for i in range(5)], 'col2': range(10)})
>>> print df
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
>>> to_add = []
>>> for i in range(11):
... if i not in df.col1.values:
... to_add.append([i, 0])
... else:
... continue
...
>>> pd.concat([df, pd.DataFrame(to_add, columns=['col1', 'col2'])])
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
0 5 0
I assume you don't care about the index values of the rows you add.

Related

count sets of consecutive true values in a column

Let's say that I have a dataframe as follow:
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
Then, I convert it into a boolean form:
df.eq(1)
Out[213]:
A
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 True
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 False
19 False
20 True
21 True
What I want is to count consecutive sets of True values in the column. In this example, the output would be:
df
Out[215]:
A count
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance

You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:
# A side-by-side comparison:
>>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
A A
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 1 3
8 1 3
9 0 4
10 1 5
11 1 5
12 1 5
13 1 5
14 1 5
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
20 1 7
21 1 7
Thus, group by that grouper, calculate sums, replace zero with NaN + dropna, and reset the index:
df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)
Output:
>>> df
A B
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN

I propose an alternative way that makes use of the split string function.
Let's transform the Series df.A into a string and then split it where the zeros are.
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
ll = ''.join(df.A.astype('str').tolist()).split('0')
The list ll looks like
print(ll)
['11111', '', '11', '11111', '', '', '', '', '11']
now we count the lengths of every string and put it into a list
[len(item) for item in ll if len(item)>0]
This is doable if the Series is not too long.

How to get the null rows of certain columns in python? [duplicate]

This question already has answers here:
How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?
(6 answers)
Closed 2 years ago.
I am facing a issue with null rows. I want only the null rows of only certain columns of a data frame. Is it possible to get the null rows?
In [57]: df
Out[57]:
a b c d e
0 0 1 2 3 4
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan
Now I want nulls in b,c,e the result should be this one:
Out[57]:
a b c d e
1 0 NaN 0 1 5
2 0 0 NaN NaN 5
3 0 1 2 5 Nan
4 0 1 2 6 Nan

You could use isna() for axis=1.
df = pd.DataFrame({"a":[0,0,0,0,0], "b":[1,np.NaN,0,1,1], "c":[2,0,np.NaN,2,2], "d":[3,1,np.NaN,5,6], "e":[4,5,5,np.NaN,np.NaN]})
>>> df[df.isna().any(axis=1)]
a b c d e
1 0 NaN 0.0 1.0 5.0
2 0 0.0 NaN NaN 5.0
3 0 1.0 2.0 5.0 NaN
4 0 1.0 2.0 6.0 NaN
The same could be done using isnull() function
df[df.isnull().any(axis=1)]

Turn columns' values to headers of columns with values 1 and 0 ( accordingly) [python]

I got a column of the form :
0 q4
1 4
2 3
3 1
4 2
5 1
6 5
7 1
8 3
The column represents the answers of users to a question of 5 choices (1-5).
I want to turn this into a matrix of 5 columns where the indexes are the 5 possible answers and the values are 1 or 0 according to the user's given answer.
Visualy i want a matrix of the form:
0 q4_1 q4_2 q4_3 q4_4 q4_5
1 Nan Nan Nan 1 Nan
2 Nan Nan 1 Nan Nan
3 1 Nan Nan Nan Nan
4 Nan 1 Nan Nan Nan
5 1 Nan Nan Nan Nan

for i in range(1,6):
df['q4_'+str(i)]=np.where(df.q4==i, 1, 0)
def df['q4']
Output:
>>> print(df)
q4_1 q4_2 q4_3 q4_4 q4_5
0 0 0 0 1 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 0 1 0 0

I think pivot is the way to go. You'd have to prepopulate the df with the info you want in the new table.
Also, I don't understand why you want only 5 rows but I added it as well in iloc. If you remove it, you will have this data for your entire index (up to 8).
import pandas as pd
df = pd.DataFrame({'q4': [4, 3, 1, 2, 1, 5, 1, 3]})
df.index += 1
df['values'] = 1
df = df.reset_index().pivot(index='q4', columns='index', values='values').T.iloc[:5]
prints
q4 1 2 3 4 5
index
1 NaN NaN NaN 1.0 NaN
2 NaN NaN 1.0 NaN NaN
3 1.0 NaN NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 1.0 NaN NaN NaN NaN

Iterating over rows subtracting by 1 in dataframe pandas

I have a pandas dataframe that I would like to iterate from the last non Null value and then subtract 1 from that value for all following rows.
z = pd.DataFrame({'l':range(10),'r':[4,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]\
,'gh':[np.nan,np.nan,np.nan,np.nan,15,np.nan,np.nan,np.nan,np.nan,np.nan],\
'gfh':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2]})
df = z.transpose().copy()
df.reset_index(inplace=True)
df.drop(['index'],axis=1, inplace=True)
df.columns = ['a','b','c','d','e','f','g','h','i','j']
In [8]: df
Out[8]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the above dataframe that I would like to reduce by 1 for everyrow till the last column. For example row 2 the value is 15, so I want 14, 13,12,11,10 to follow. Nothing will follow the 2 in the first row since there are no columns left. Also, the 4 in the last row would be 3,2,1,0,0,0,0 etc.
I reached my desired output by doing the following.
for index, row in df.iterrows():
df.iloc[index,df.columns.get_loc(df.iloc[index].last_valid_index())+1:] =\
[(df.iloc[index,m.columns.get_loc(df.iloc[index].last_valid_index()):][0]-(x+1)).astype(int) \
for x in range((df.shape[1]-1)-df.columns.get_loc(df.iloc[index].last_valid_index()))]
df[df < 0] = 0
This gives me the desired output
In [13]: df
Out[13]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 14 13 12 11 10
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
BUT. In my realworld data I have 50K plus columns and the above code takes WAAAY too long.
Can anyone please suggest how I can make this run faster?
I believe the solution would be to somehow tell the code that once the subtaction equals zero move on to the next row. but Idk how to do that since even if I use max(0,subtraction formula) the code still waste time subtracting.
Thank you.

I don't know how fast it will be, but you could experiment with ffill, fillna, and cumsum. For example:
>>> df
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
This is a little tricky. First we figure out which cells we need to fill, by forward-filling the rightmost element and seeing whether it's null (there might be a faster way to use last_valid_index tests, but this is the first thing that occurred to me)
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> mask
a b c d e f g h i j
0 False False False False False False False False False False
1 False False False False False True True True True True
2 False False False False False False False False False False
3 False True True True True True True True True True
If we fill the empty spots with -1, we can get the values we want by cumulative summing to the right:
>>> (df.fillna(-1).cumsum(axis=1))
a b c d e f g h i j
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -7
1 -1 -2 -3 -4 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 -1 -2 -3 -4 -5
Many of those values we don't want, but that's okay, because we're only going to insert the ones we need. We should clip to 0, though:
>>> df.fillna(-1).cumsum(axis=1).clip_lower(0)
a b c d e f g h i j
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 0 0 0 0 0
and finally we can use the original ones where mask is False, and the new values where mask is True:
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
(Note: this assumes the rows we need to fill look like the ones in your example. If they're messier we'd have to do a little more work, but the same techniques will apply.)

How to replace values in pandas DataFrame respecting index alignment

I want to replace some missing values in a dataframe with some other values, keeping the index alignment.
For example, in the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.repeat(['a','b','c'],4), 'B':np.tile([1,2,3,4],3),'C':range(12),'D':range(12)})
df = df.iloc[:-1]
df.set_index(['A','B'], inplace=True)
df.loc['b'] = np.nan
df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
I would like to replace the missing values of 'b' rows matching them with the corresponding indices of 'c' rows.
The result should look like
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10

You can use fillna with the values dictionary to_dict from relevant c rows, like this:
# you can of course use .loc
>>> df.ix['b'].fillna(value=df.ix['c'].to_dict(), inplace=True)
C D
B
1 8 8
2 9 9
3 10 10
4 NaN NaN
Result:
>>> df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Find empty/missing values and add them to DataFrame - python

Related

count sets of consecutive true values in a column

How to get the null rows of certain columns in python? [duplicate]

Turn columns' values to headers of columns with values 1 and 0 ( accordingly) [python]

Iterating over rows subtracting by 1 in dataframe pandas

How to replace values in pandas DataFrame respecting index alignment

Categories

Resources