Pandas - Remove leading and trailing zeroes from each row - python

I would like to remove preceding and trialing zero values row-wise in my df and then have them shift to be 'aligned'.
Probably best demonstrated with the below example.
Initial df:
index c1 c2 c3 c4 c5 c6 c7 c8
1 0 0 1 2 3 4 5 0
2 0 0 0 1 2 3 4 5
3 0 1 2 3 0 0 0 0
4 0 0 1 2 3 4 0 0
5 1 2 3 4 5 6 7 0
6 0 0 0 1 0 0 4 0
Output:
index c1 c2 c3 c4 c5 c6 c7
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3
4 1 2 3 4
5 1 2 3 4 5 6 7
6 1 0 0 4
Note that there is potential to be zeroes within the "string" of true values so need to stop at the first / reverse first occurrence. Is this possible? Thanks.

Using np.trim_zeros:
Trim the leading and/or trailing zeros from a 1-D array or sequence.
out = pd.DataFrame([np.trim_zeros(i) for i in df.values], index=df.index)
out.columns = df.columns[:len(out.columns)]
c1 c2 c3 c4 c5 c6 c7
index
1 1 2 3 4.0 5.0 NaN NaN
2 1 2 3 4.0 5.0 NaN NaN
3 1 2 3 NaN NaN NaN NaN
4 1 2 3 4.0 NaN NaN NaN
5 1 2 3 4.0 5.0 6.0 7.0
6 1 0 0 4.0 NaN NaN NaN

You can use this:
df_out = df.apply(lambda x: pd.Series(x.loc[x.mask(x == 0).first_valid_index():
x.mask(x == 0).last_valid_index()].tolist()),
axis=1)
df_out.set_axis(df.columns[df_out.columns], axis=1, inplace=False)
Output:
c1 c2 c3 c4 c5 c6 c7
index
1 1.0 2.0 3.0 4.0 5.0 NaN NaN
2 1.0 2.0 3.0 4.0 5.0 NaN NaN
3 1.0 2.0 3.0 NaN NaN NaN NaN
4 1.0 2.0 3.0 4.0 NaN NaN NaN
5 1.0 2.0 3.0 4.0 5.0 6.0 7.0
6 1.0 0.0 0.0 4.0 NaN NaN NaN
N

Related

How to insert multiple rows to a pandas DF with a missing value?

I have a DF:
df = pd.DataFrame({"A":[0,1,3,5,6], "B":['B0','B1','B3','B5','B6'], "C":['C0','C1','C3','C5','C6']})
I’m trying to insert 10 empty rows at the position where the number is missed from the continuous sequence of column A. For the 10 rows, values of column A, B and C's are the missed number, Nan, and Nan, respectively. Like this:
A B C
0 B0 C0
1 B1 C1
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
2 NaN NaN
3 B3 C3
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
4 NaN NaN
5 B5 C5
6 B6 C6
I've played with index, but this adds only 1 row:
df1 = df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'],
df.iloc[-1]['A']+1)})).reset_index().drop(['index'], axis=1)
Thanks in advance!
Let's try to repeat the indices where the values diff is above 1 and concat:
N = 10
out = (pd.concat([df, df[['A']].loc[df.index.repeat(df['A'].diff(-1).lt(-1).mul(N-1))]])
.sort_index(kind='stable')
)
Output:
A B C
0 0 B0 C0
1 1 B1 C1
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
1 1 NaN NaN
2 3 B3 C3
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
2 3 NaN NaN
3 5 B5 C5
4 6 B6 C6
One approach could be as follows:
First, use df.set_index to make column A the index.
Next, use range for a range that runs from 0 through to the max of A (i.e. 6).
Now, apply df.reindex based on np.repeat. We use a loop to feed a 1 to the repeats parameter for all the values that exist in A, for all the ones that are missing, we use 10.
Finally, chain df.reset_index.
df.set_index('A', inplace=True)
rng = range(df.index.max()+1)
df = df.reindex(np.repeat(rng,[1 if i in df.index else 10 for i in rng]))\
.reset_index(drop=False)
print(df)
A B C
0 0 B0 C0
1 1 B1 C1
2 2 NaN NaN
3 2 NaN NaN
4 2 NaN NaN
5 2 NaN NaN
6 2 NaN NaN
7 2 NaN NaN
8 2 NaN NaN
9 2 NaN NaN
10 2 NaN NaN
11 2 NaN NaN
12 3 B3 C3
13 4 NaN NaN
14 4 NaN NaN
15 4 NaN NaN
16 4 NaN NaN
17 4 NaN NaN
18 4 NaN NaN
19 4 NaN NaN
20 4 NaN NaN
21 4 NaN NaN
22 4 NaN NaN
23 5 B5 C5
24 6 B6 C6

Fill Nan based on multiple column condition in Pandas

The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0

Pandas dataframe insert missing row and fill with previous row

I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN

How can i compare a DataFrame with other DataFrame's columns?

I have two different DataFrame's. df is the complete one and sample is for the comparing. Here is the data i have:
sample.tail()
T1 C C1 C2 C3
0 1 5 0.0 7.0 5.0
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 4.0 6.0 6.0
1 0 0 5 5.0 4.0 6.0
2 0 1 7 5.0 5.0 4.0
3 1 1 0 7.0 5.0 5.0
4 1 1 5 0.0 7.0 5.0
I have selected some columns from sample df and trying to find values in df matches the sample
Here what i did so far but no luck:
cols = sample.columns
df = df[df[cols] == sample[cols]]
and i am getting the following error:
ValueError: Can only compare identically-labeled DataFrame objects
Can you kindly help me to findout the solution for this?
EDIT: Expected output
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 0.0 7.0 5.0
21 1 1 5 0.0 7.0 5.0
27 1 0 5 0.0 7.0 5.0
34 1 1 5 0.0 7.0 5.0
42 1 1 5 0.0 7.0 5.0
47 1 0 5 0.0 7.0 5.0
51 1 1 5 0.0 7.0 5.0
You can see that All data matches with sample dataframe except T2. This is expected output for me
Thanks
Using pd.Index.intersection, you can use
cols = sample.columns.intersection(df.columns)
df[df[cols].apply(tuple, axis=1).isin(sample[cols].apply(tuple, axis=1))]

Forward fill Pandas df only if an entire line is made of Nan

I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0

Categories