Partial fillna with condition python - python

I have the following data frame which I want to apply bfill as follows:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
Nan
Nan
3.0
50
4.0
10
Nan
5.0
10
I want to bfill Nan in the amount column as per percentage in the percentage column i.e., if the corresponding percentage is 50 then fill 50% of Nan before the number (partial fill). e.g. amount with 3.0 value have a percentage of 50 so out of 4 Nan entries, only 50% are to be bfill.
proposed output:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
3.0
3.0
3.0
50
4.0
10
Nan
5.0
10
Please help.

Create groups according to NaNs
df['group_id'] = df.amount.where(df.amount.isna(), 1).cumsum().bfill()
Create a filling function
def custom_fill(x):
# Calculate number of rows to be filled
max_fill_rows = math.floor(x.iloc[-1, 1] * (x.shape[0] - 1) / 100)
# Fill only if number of rows to fill is not zero
return x.bfill(limit=max_fill_rows) if max_fill_rows else x
Fill the DataFrame
df.groupby('group_id').apply(custom_fill)
Output
amount percentage group_id
0 NaN NaN 1.0
1 1.0 20.0 1.0
2 2.0 10.0 2.0
3 NaN NaN 3.0
4 NaN NaN 3.0
5 3.0 50.0 3.0
6 3.0 50.0 3.0
7 3.0 50.0 3.0
8 4.0 10.0 4.0
9 NaN NaN 5.0
10 5.0 10.0 5.0
PS: Don't forget to import the required libraries
import math

Related

Is there a way to forward fill with ascending logic in pandas / numpy?

What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

How to freeze first numbers in sequences between NaNs in Python pandas dataframe

Is there a Pythonic way to, in a timeseries dataframe, by column, go down and pick the first number in a sequence, and then push it forward until the next NaN, and then take the next non-NaN number and push that one down until the next NaN, and so on (retaining the indices and NaNs).
For example, I would like to convert this dataframe:
DF = pd.DataFrame(data={'A':[np.nan,1,3,5,7,np.nan,2,4,6,np.nan], 'B':[8,6,4,np.nan,np.nan,9,7,3,np.nan,3], 'C':[np.nan,np.nan,4,2,6,np.nan,1,5,2,8]})
A B C
0 NaN 8.0 NaN
1 1.0 6.0 NaN
2 3.0 4.0 4.0
3 5.0 NaN 2.0
4 7.0 NaN 6.0
5 NaN 9.0 NaN
6 2.0 7.0 1.0
7 4.0 3.0 5.0
8 6.0 NaN 2.0
9 NaN 3.0 8.0
To this dataframe:
Result = pd.DataFrame(data={'A':[np.nan,1,1,1,1,np.nan,2,2,2,np.nan], 'B':[8,8,8,np.nan,np.nan,9,9,9,np.nan,3], 'C':[np.nan,np.nan,4,4,4,np.nan,1,1,1,1]})
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0
I know I can use a loop to iterate down the columns to do this, but would appreciate some help on how to do it in a more efficient Pythonic way on a very large dataframe. Thank you.
IIUC:
# where DF is not NaN
mask = DF.notna()
Result = (DF.shift(-1) # fill the original NaN's with their next value
.mask(mask) # replace all the original non-NaN with NaN
.ffill() # forward fill
.fillna(DF.iloc[0]) # starting of the the columns with a non-NaN
.where(mask) # replace the original NaN's back
)
Output:
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0

Create a column that has the same length of the longest column in the data at the same time

I have the following data:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
dataFrame = pandas.DataFrame(data).transpose()
Output:
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 NaN 4.0 4.0
4 NaN 5.0 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
Is it possible to create a 4th column AT THE SAME TIME the others columns are created in data, which has the same length as the longest column of this dataframe (3rd one)?
The data of this column doesn't matter. Assume it's 8. So this is the desired output can be:
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
In my script the dataframe keeps changing every time. This means the longest columns keeps changing with it.
Thanks for reading
This is quite similar to answers from #jpp, #Cleb, and maybe some other answers here, just slightly simpler:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]] + [[]]
This will automatically give you a column of NaNs that is the same length as the longest columnn, so you don't need the extra work of calculating the length of the longest column. Resulting dataframe:
0 1 2 3
0 1.0 1.0 1.0 NaN
1 2.0 2.0 2.0 NaN
2 3.0 3.0 3.0 NaN
3 NaN 4.0 4.0 NaN
4 NaN 5.0 5.0 NaN
5 NaN NaN 6.0 NaN
6 NaN NaN 7.0 NaN
Note that this answer is less general than some others here (such as by #jpp & #Cleb) in that it will only fill with NaNs. If you want some default fill values other than NaN, you should use one of their answers.
You can append to a list which then immediately feeds the pd.DataFrame constructor:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data + [[8]*max(map(len, data))]).transpose()
print(df)
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
But this is inefficient. Pandas uses NumPy to hold underlying series and setting a series to a constant value is trivial and efficient; you can simply use:
df[3] = 8
It is not entirely clear what you mean by at the same time, but the following would work:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
# get the longest list in data
data.append([8] * max(map(len, data)))
pd.DataFrame(data).transpose()
yielding
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If you'd like to do it as you create the DataFrame, simply chain a call to assign:
pd.DataFrame(data).T.assign(**{'3': 8})
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
You can do a def (read comments):
def f(df):
l=[8]*df[max(df,key=lambda x:df[x].count())].count()
df[3]=l+[np.nan]*(len(df)-len(l))
# the above two lines can be just `df[3] = another solution currently for this problem`
return df
dataFrame = f(pandas.DataFrame(data).transpose())
Then now:
print(dataFrame)
Returns:
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
If at you mean at the same time as running pd.DataFrame, the data has to be prepped before it is loaded to your frame.
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
longest = max(len(i) for i in data)
dummy = [8 for i in range(longest)] #dummy data filled with 8
data.append(dummy)
dataFrame = pd.DataFrame(data).transpose()
The example above gets the longest element in your list and creates a dummy to be added onto it before creating your dataframe.
One solution is to add an element to the list that is passed to the dataframe:
pd.DataFrame(data + [[np.hstack(data).max() + 1] * len(max(data))]).T
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If data is to be modified just:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
data = data + [[np.hstack(data).max() + 1] * len(max(data))]
pd.DataFrame(data).T

Is there a way to get a "union" of several columns of pandas DataFrame?

I am not looking for merging/concatenating columns or replacing some values with other values (although...maybe yes?). But I have a large dataframe (>100 rows and columns) and I would like to extract columns that are "almost identical", i.e. that have >2 values (at the same index) in common and not no different values at other indexes (if there is a value in one column, there must be either the same value or a NaN in the other).
Here is an example of such a dataframe:
a = np.random.randint(1,10,10)
b = np.array([np.nan,2,np.nan,3,np.nan,6,8,1,2,np.nan])
c = np.random.randint(1,10,10)
d = np.array([7,2,np.nan,np.nan,np.nan,6,8,np.nan,2,2])
e = np.array([np.nan,2,np.nan,np.nan,np.nan,6,8,np.nan,np.nan,2])
f = np.array([np.nan,2,np.nan,3.0,7,np.nan,8,np.nan,np.nan,2])
df = pd.DataFrame({'A':a,'B':b,'C':c,'D':d,'E':e, 'F':f})
df.ix[3:6,'A']=np.nan
df.ix[4:8,'C']=np.nan
EDIT
keys=['S01_o4584','S02_o2531','S03_o7812','S03_o1122','S04_o5210','S04_o3212','S05_o4665','S06_o7425','S07_o3689','S08_o2371']
df['index']=keys
df = df.set_index('index')
A B C D E F
index
S01_o4584 8.0 NaN 9.0 7.0 NaN NaN
S02_o2531 8.0 2.0 5.0 2.0 2.0 2.0
S03_o7812 1.0 NaN 5.0 NaN NaN NaN
S03_o1122 NaN 3.0 6.0 NaN NaN 3.0
S04_o5210 NaN NaN NaN NaN NaN 7.0
S04_o3212 NaN 6.0 NaN 6.0 6.0 NaN
S05_o4665 NaN 8.0 NaN 8.0 8.0 8.0
S06_o7425 1.0 1.0 NaN NaN NaN NaN
S07_o3689 8.0 2.0 NaN 2.0 NaN NaN
S08_o2371 3.0 NaN 9.0 2.0 2.0 2.0
As you see, columns B, D (and newly E) have identical values at locations (indexes) S02_o2531,S04_o3212,S05_o4665 and S08_o2371, whereas at other location, one has a value while the other has s NaN.
My desired output would be:
index BD*E*
S01_o4584 7
S02_o2531 2
S03_o7812 NaN
S03_o1122 3
S04_o5210 NaN
S04_o3212 6
S05_o4665 8
S06_o7425 1
S07_o3689 2
S08_o2371 2
However, I can't combine columns that would then have two different values for the same beginning of the index: as you can see, column F also shares some of the indexes, but a new one is at S04_o5210, but the previous combined columns already have a value at "S04_" (index S04_o3212).
Is there a reasonably pythonic way to do it? I.e. 1) find the columns based on the condition that the values in them must be either identical or np.nan, not different. 2) set a condition that a column cannot be combined if it has the same beginning of the index of previously included values (I may probably need to split the string into two columns and do multiindex???) 3) combine them into the new Series/DataFrame.
def almost(df):
i, j = np.triu_indices(len(df.columns), 1)
v = df.values
d = v[:, i] - v[:, j]
m = (np.where(np.isnan(d), 0, d) == 0).all(0)
return pd.concat(
[
df.iloc[:, i_].combine_first(
df.iloc[:, j_]
).rename(
tuple(df.columns[[i_, j_]])
) for i_, j_ in zip(i[m], j[m])],
axis=1
)
almost(df)
B
D
0 7.0
1 2.0
2 NaN
3 3.0
4 NaN
5 6.0
6 8.0
7 1.0
8 2.0
9 2.0
how it works
i and j represent every combination of columns using numpy to get the indices of an upper triangle.
slice the underlying numpy array df.values with i and j and subtract them. Where the differences are nan, means one or the other were nan. Otherwise, difference should be zero if respective elements are the same.
since we can tolerate nan in one or the other, fill them with zero using np.where.
find where all rows are zero with (x == 0).all(0).
use the mask above to slice i and j and identify the columns that were matches.
build a dataframe of all matches with a pd.MultiIndex for columns that show what matches what.
cooler example
np.random.seed([3,1415])
m, n = 20, 26
df = pd.DataFrame(
np.random.randint(10, size=(m, n)),
columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
).mask(np.random.choice([True, False], (m, n), p=(.6, .4)))
df
almost(df)
A D G H I J K
J X K M N J K V S X
0 6.0 7.0 3.0 NaN 4.0 6.0 NaN 6.0 NaN 7.0
1 3.0 3.0 2.0 6.0 4.0 NaN 2.0 6.0 2.0 2.0
2 3.0 0.0 NaN 2.0 4.0 3.0 NaN 3.0 4.0 0.0
3 4.0 4.0 3.0 5.0 5.0 4.0 3.0 4.0 3.0 3.0
4 7.0 NaN NaN 7.0 3.0 7.0 NaN 7.0 NaN NaN
5 NaN NaN 2.0 0.0 5.0 NaN 2.0 2.0 2.0 2.0
6 NaN 8.0 NaN NaN 9.0 2.0 2.0 1.0 NaN 8.0
7 NaN 7.0 NaN 9.0 9.0 6.0 6.0 NaN NaN 7.0
8 NaN NaN 8.0 3.0 1.0 NaN NaN NaN 4.0 NaN
9 0.0 0.0 8.0 2.0 NaN 3.0 3.0 NaN NaN NaN
10 0.0 0.0 NaN 6.0 1.0 NaN NaN 8.0 NaN NaN
11 NaN NaN 3.0 NaN 9.0 3.0 3.0 NaN 3.0 3.0
12 5.0 NaN NaN NaN 6.0 5.0 NaN 5.0 8.0 NaN
13 NaN NaN NaN NaN 7.0 5.0 5.0 NaN NaN NaN
14 NaN NaN 6.0 4.0 8.0 8.0 8.0 NaN 0.0 NaN
15 8.0 8.0 7.0 NaN NaN NaN NaN NaN 2.0 NaN
16 4.0 4.0 4.0 4.0 9.0 9.0 9.0 6.0 4.0 NaN
17 NaN 4.0 NaN 4.0 2.0 8.0 8.0 4.0 NaN 4.0
18 NaN NaN 2.0 7.0 NaN NaN NaN NaN NaN NaN
19 NaN 7.0 6.0 3.0 5.0 NaN NaN 7.0 NaN 7.0
It sounds like the sticking point is how to detect "almost identical" columns, which are columns that only differ (if at all) in what values are missing. Given two column names, how do you check if they are almost identical? Note that if we find a difference that counts, it must be at an index for which neither column has NaN. In other words, the trick is to discard rows with a missing value and compare the rest:
tocheck = df[["B", "D"]].dropna()
if all(tocheck.B == tocheck.D):
print("B, D are almost identical")
Let's use this to iterate over all pairs of columns, and merge the ones that match:
for a, b in itertools.combinations(df.columns, 2):
if a not in df.columns or b not in df.columns: # Was one deleted already?
continue
tocheck = df[[a, b]].dropna()
if all(tocheck[a] == tocheck[b]):
print(b, "->", a)
df[a] = df[a].combine_first(df[b])
del df[b]
Note (in case you haven't noticed) that when multiple columns end up being merged, it's possible to have order-dependent behavior. For example:
A B C
0 NaN 1 2
1 10 NaN NaN
Here you could either merge B or C into A, but not both. Such problems aside, multiple columns can be merged into one since the merged column is saved in place of one of the compared columns.
et voila
test = df.B == df.D
df.loc[test,'myunion'] = df.loc[test, 'B']
df.loc[!test ,'myunion'] = df.loc[!test, 'B'].fillna(0) + df.loc[!test, 'D'].fillna(0)

Categories