Pandas dataframe insert missing row and fill with previous row - python

I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN

Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN

If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5

Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN

Related

Fill Nan based on multiple column condition in Pandas

The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0

How to iterate over an array using a lambda function with pandas apply

I have the following dataset:
0 1 2
0 2.0 2.0 4
0 1.0 1.0 2
0 1.0 1.0 3
3 1.0 1.0 5
4 1.0 1.0 2
5 1.0 NaN 1
6 NaN 1.0 1
and what I want to do is insert a new column that iterates over each row, and if there is a NaN then give it a 0, if not then copy the value from column '2' to get this:
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
The following code is what I have so far, which works fine but does not iterate over the values of column '2'.
df.isna().sum(axis=1).apply(lambda x: df[2].iloc[x] if x==0 else 0)
if I use df.iloc[x] I get
0 4
1 4
2 4
3 4
4 4
5 0
6 0
How can I iterate over the column '2'?
Try the below code with np.where with isna and any:
>>> df['3'] = np.where(df[['0', '1']].isna().any(1), 0, df['2'])
>>> df
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
>>>

Python: how to select indexes from pandas dataframe?

I have a dataframe that looks like the following:
df
0 1 2 3 4
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 0.0 NaN 4.0
2 NaN 2.0 0.0 NaN 5.0
3 NaN NaN NaN 0.0 NaN
4 NaN 0.0 3.0 NaN 0.0
I would like to have a dataframe of all the couples of values different from NaN. The dataframe should be like the following
df
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 5.0
4 3 3 0.0
5 4 1 0.0
6 4 2 3.0
7 4 4 0.0
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('i','j')).reset_index(name='val')
print (df)
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0
Like this:
In [379]: df.stack().reset_index(name='val').rename(columns={'level_0':'i', 'level_1':'j'})
Out[379]:
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0

Replacing values in dataframe by values from other rows by "target row"

I am trying to do the following: When the value of 'content' is NaN, replace its value from that of the target row. Below is my code which does that by iterating over all rows, but it is ugly and slow. I suspect there should be a more elegant/fast way to do this with mask, so I figured someone may inspire me on this:
Inputs:
import pandas as pd
d = {'content': [1, 3, None, 6, 1, 59, None], 'target': [0,1,0,2,4,5,4]}
df = pd.DataFrame(data=d)
print(df)
for index, row in df.iterrows():
if df.loc[index,'content']!=df.loc[index,'content']: # To detect NaN
df.loc[index,'content']=df.loc[df.loc[index,'target'],'content']
print(df)
outputs:
content target
0 1.0 0
1 3.0 1
2 NaN 0
3 6.0 2
4 1.0 4
5 59.0 5
6 NaN 4
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 1.0 4
Thanks in advance!
Note: Only when the content of the row is "NaN", the content should be changed to that of the target row.
Additional Question: Can I do the same thing, whenever the content is 59 or 6? Thanks a lot!
By using fillna
df.content=df.content.fillna(df.target)
df
Out[268]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 2.0 2
4 1.0 4
5 59.0 5
6 5.0 5
EDIT
df.ffill()
Out[487]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
I guess you need this
df.content.reindex(df.target)
Out[492]:
target
0 1.0
1 3.0
0 1.0
2 6.0
4 1.0
5 59.0
5 59.0
Name: content, dtype: float64
After assign it back
df.content=df.content.reindex(df.target).values
df
Out[494]:
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
Let me edit again
df.content.fillna(df.content.reindex(df.target).reset_index(drop=True))
Out[508]:
0 1.0
1 3.0
2 1.0
3 6.0
4 1.0
5 59.0
6 1.0
Name: content, dtype: float64

Pandas: get two different rows with same pair of values in two different columns

I have two columns _Id and _ParentId with this example data. Using this I want to group _Id with _ParentId.
_Id _ParentId
1 NaN
2 NaN
3 1.0
4 2.0
5 NaN
6 2.0
After grouping the result should be shown as below.
_Id _ParentId
1 NaN
3 1.0
2 NaN
4 2.0
6 2.0
5 NaN
The main aim for this is to group which _Id belongs to which _ParentId (e.g _Id 3 belongs to _Id 1).
I have attempted to use groupby and duplicated but I can't seem to get the results shown above.
Use sort_values on temp
In [3188]: (df.assign(temp=df._ParentId.combine_first(df._Id))
.sort_values(by='temp').drop('temp', 1))
Out[3188]:
_Id _ParentId
0 1 NaN
2 3 1.0
1 2 NaN
3 4 2.0
5 6 2.0
4 5 NaN
Details
In [3189]: df._ParentId.combine_first(df._Id)
Out[3189]:
0 1.0
1 2.0
2 1.0
3 2.0
4 5.0
5 2.0
Name: _ParentId, dtype: float64
In [3190]: df.assign(temp=df._ParentId.combine_first(df._Id))
Out[3190]:
_Id _ParentId temp
0 1 NaN 1.0
1 2 NaN 2.0
2 3 1.0 1.0
3 4 2.0 2.0
4 5 NaN 5.0
5 6 2.0 2.0
Your expected output is quite the same as input, just that IDs 4 and 6 are together, with NaNs being at different places. Its not possible to have that expected output.
Here is how group-by would ideally work:
print("Original: ")
print(df)
df = df.fillna(-1) # if not replaced with another character , the grouping won't show NaNs.
df2 = df.groupby('_Parent')
print("\nAfter grouping: ")
for key, item in df2:
print (df2.get_group(key))
Output:
Original:
_Id _Parent
0 1 NaN
1 2 NaN
2 3 1.0
3 4 2.0
4 5 NaN
5 6 2.0
After grouping:
_Id _Parent
0 1 0.0
1 2 0.0
4 5 0.0
_Id _Parent
2 3 1.0
_Id _Parent
3 4 2.0
5 6 2.0

Categories