Is it possible to specify a ffill for an entire row? What I mean by this, is to condition on one value [Check] in the row to see if the row should be fforwardfilled.
My main goal is to keep row integrity intact (i.e. I only want to to forwardfilling an entire row into the next one). For the sake of simplicity assume that each row corresponds to an event, I want to forwardfill the data from the past event if the new event does not have data (in Val1). I do not want to mix data from past events as I forwardfill it should be noted, that nan values might be legitimate values for an event and should be forward filled as well.
First Example:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 nan nan nan nan
3 2.00 2.00 4.00 3.00 3.00
Should become
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 4.00 nan 3.00 4.00
3 2.00 2.00 4.00 3.00 3.00
and not:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 2.00 3.00 4.00
2 2.00 4.00 2.00 3.00 4.00
3 2.00 2.00 4.00 3.00 3.00
Second example:
Check Val1 Val2 Val3 Val4
0 2.00 3.00 2.00 2.00 3.00
1 2.00 4.00 nan 3.00 4.00
2 2.00 4.00 nan nan nan
3 2.00 2.00 4.00 3.00 3.00
Should remain unchanged.
Use for replace only one NaNs per columns - replace fitst all values and then check consecutive NaNs, which are set by mask to NaNs:
df = df.ffill().mask((df.ffill(limit=1) * df.bfill(limit=1)).isnull())
print (df)
0 1 2 3 4
0 2.0 3.0 2.0 2.0 3.0
1 2.0 4.0 NaN 3.0 4.0
2 2.0 4.0 NaN 3.0 4.0
3 2.0 2.0 4.0 3.0 3.0
Related
snippet of the dataframe is as follows. but actual dataset is 200000 x 130.
ID 1-jan 2-jan 3-jan 4-jan
1. 4 5 7 8
2. 2 0 1 9
3. 5 8 0 1
4. 3 4 0 0
I am trying to compute Mean Absolute Deviation for each row value like this.
ID 1-jan 2-jan 3-jan 4-jan mean
1. 4 5 7 8 12.5
1_MAD 8.5 7.5 5.5 4.5
2. 2 0 1 9 6
2_MAD.4 6 5 3
.
.
I tried this,
new_df = pd.DataFrame()
for rows in (df['ID']):
new_df[str(rows) + '_mad'] = mad(df3.loc[row_value][1:])
new_df.T
where mad is a function that compares the mean to each value.
But, this is very time consuming since i have a large dataset and i need to do in a quickest way possible.
pd.concat([df1.assign(mean1=df1.mean(axis=1)).set_index(df1.index.astype('str'))
,df1.assign(mean1=df1.mean(axis=1)).apply(lambda ss:ss.mean1-ss,axis=1)
.T.add_suffix('_MAD').T.assign(mean1='')]).sort_index().pipe(print)
1-jan 2-jan 3-jan 4-jan mean1
ID
1.0 4.00 5.00 7.00 8.00 6.0
1.0_MAD 2.00 1.00 -1.00 -2.00
2.0 2.00 0.00 1.00 9.00 3.0
2.0_MAD 1.00 3.00 2.00 -6.00
3.0 5.00 8.00 0.00 1.00 3.5
3.0_MAD -1.50 -4.50 3.50 2.50
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD -1.25 -2.25 1.75 1.75
IIUC use:
#convert ID to index
df = df.set_index('ID')
#mean to Series
mean = df.mean(axis=1)
from toolz import interleave
#subtract all columns by mean, add suffix
df1 = df.sub(mean, axis=0).abs().rename(index=lambda x: f'{x}_MAD')
#join with original with mean and interleave indices
df = pd.concat([df.assign(mean=mean), df1]).loc[list(interleave([df.index, df1.index]))]
print (df)
1-jan 2-jan 3-jan 4-jan mean
ID
1.0 4.00 5.00 7.00 8.00 6.00
1.0_MAD 2.00 1.00 1.00 2.00 NaN
2.0 2.00 0.00 1.00 9.00 3.00
2.0_MAD 1.00 3.00 2.00 6.00 NaN
3.0 5.00 8.00 0.00 1.00 3.50
3.0_MAD 1.50 4.50 3.50 2.50 NaN
4.0 3.00 4.00 0.00 0.00 1.75
4.0_MAD 1.25 2.25 1.75 1.75 NaN
It's possible to specify axis=1 to apply the mean calculation across columns:
df['mean_across_cols'] = df.mean(axis=1)
I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0
Below is an example DataFrame.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I want to split this into new dataframes when the row in column 0 changes.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I've tried adapting the following solutions without any luck so far. Split array at value in numpy
Split a large pandas dataframe
Looks like you want to groupby the first colum. You could create a dictionary from the groupby object, and have the groupby keys be the dictionary keys:
out = dict(tuple(df.groupby(0)))
Or we could also build a list from the groupby object. This becomes more useful when we only want positional indexing rather than based on the grouping key:
out = [sub_df for _, sub_df in df.groupby(0)]
We could then index the dict based on the grouping key, or the list based on the group's position:
print(out[0])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
Based on
I want to split this into new dataframes when the row in column 0 changes.
If you only want to group when value in column 0 changes , You can try:
d=dict([*df.groupby(df['0'].ne(df['0'].shift()).cumsum())])
print(d[1])
print(d[2])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
0 1 2 3 4
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
I will use GroupBy.__iter__:
d = dict(df.groupby(df['0'].diff().ne(0).cumsum()).__iter__())
#d = dict(df.groupby(df[0].diff().ne(0).cumsum()).__iter__())
Note that if there are repeated non-consecutive values different groups will be created, if you only use groupby(0) they will be grouped in the same group
When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.
I have a question how to select different column(create new series) based on another column value. raw data as following:
DEST_ZIP5 EXP_EDD_FRC_DAY GND_EDD_FRC_DAY \
0 00501 5 6
1 00544 5 6
2 01001 4 8
3 01001 4 8
4 01001 4 8
EXP_DAY_2 EXP_DAY_3 EXP_DAY_4 EXP_DAY_5 ... \
0 0.0 1.00 1.00 1.0 ...
1 0.0 1.00 1.00 1.0 ...
2 0.0 0.85 1.00 1.0 ...
3 0.0 1.00 1.00 1.0 ...
4 0.0 0.85 0.85 1.0 ...
GND_DAY_3 GND_DAY_4 GND_DAY_5 GND_DAY_6 GND_DAY_7 GND_DAY_8 \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 0.0 0.0 0.16 0.33 0.83 1.00
3 0.0 0.0 0.00 0.14 0.71 0.85
4 0.1 0.1 0.20 0.40 0.40 0.60
I want to have two new data serize which get the number value of for responding column.
(the row 1, EXP_EDD_FRC_DAY =5, so, return df[EXP_DAY_5].
GND_EDD_FRC_DAY =6, return df[GND_DAY_6]
DEST_ZIP5 EXP_percentage GND_percentage \
0 00501 1.0 NaN
1 00544 1.0 NaN
2 01001 1.0 1.00
3 01001 1.0 0.85
4 01001 0.85 0.60
I found function lookup. Not not sure how to use that.
Thank you very much
-
IIUC:
c = df['EXP_EDD_FRC_DAY'].astype(str).radd('GND_DAY_')
new_series = pd.Series(df.lookup(df.index, c), df.index)