I have the following dataframe:
id indicator
1 NaN
1 NaN
1 1
1 NaN
1 NaN
1 NaN
In reality, I have several more ids. My question now is, how do I do a forward or backward fill for a specific range, e.g. for only the next/last 2 observations. My dataframe should look like this:
id indicator
1 NaN
1 NaN
1 1
1 1
1 1
1 NaN
I know the command
df.groupby("id")["indicator"].fillna(value=None, method="ffill")
However, this fills all the missing values instead of just the next two observations. Anyone knows a solution?
I think DataFrameGroupBy.ffill or DataFrameGroupBy.bfill with limit parameter is nicer:
df.groupby("id")["indicator"].ffill(limit=3)
df.groupby("id")["indicator"].bfill(limit=3)
Sample:
#5 value is in the end of group, so only one value is filled
df['filled'] = df.groupby("id")["indicator"].ffill(limit=2)
print (df)
id indicator filled
0 1 NaN NaN
1 1 NaN NaN
2 1 1.0 1.0
3 1 NaN 1.0
4 1 NaN 1.0
5 1 NaN NaN
6 1 NaN NaN
7 1 NaN NaN
8 1 4.0 4.0
9 1 NaN 4.0
10 1 NaN 4.0
11 1 NaN NaN
12 1 NaN NaN
13 2 NaN NaN
14 2 NaN NaN
15 2 1.0 1.0
16 2 NaN 1.0
17 2 NaN 1.0
18 2 NaN NaN
19 2 5.0 5.0
20 2 NaN 5.0
21 3 3.0 3.0
22 3 NaN 3.0
23 3 NaN 3.0
24 3 NaN NaN
25 3 NaN NaN
almost there,
straight from the doc
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
df.groupby("id")["indicator"].fillna(value=None,method="ffill",limit=3)
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
I am having trouble selecting the rowns I want and printing the colums of choice.
So I have 8 columns and what I am looking to do is take all the rows where column 8 is not NA and is equal to 2, and print only columns 2 to 5.
I have tried this:
df.where(df['dhch'].notnull())[['scchdg', 'dhch']]
Here I have just entered 2 columns to check that the conditional statement that dhch is not NA worked and I got as expected:
scchdg dhch
0 3 1
1 -1 2
2 -1 2
3 1 1
4 3 1
... ...
12094 -9 1
12095 1 1
12096 4 1
12097 3 1
12098 4 1
[12099 rows x 2 columns]
And when I check the conditional value I get expected output (i.e., values of 2 and nans in the dhch col:
df.where(df['dhch']==2)[['scchdg', 'dhch']]
Out[50]:
scchdg dhch
0 NaN NaN
1 -1.0 2.0
2 -1.0 2.0
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
But When I combine these, I just get piles of NAs
df.where(df['dhch'].notnull() & df['dhch']==2)[['scchdg', 'dhch']]
Out[51]:
scchdg dhch
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
What am I doing wrong please?
in R what I want to do is as follows:
df[!is.na(df$dhch) & df$dhch==2, c('scchdg', 'dhch')]
But how do i do exactly this in Python please?
If I have a Pandas Data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 1 1
2 1 NaN NaN 1 NaN 1
3 NaN 1 1 NaN 1 1
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
How do I count each group of ones and assign a value based on the number of groups in each row? Such that I get a data frame like this:
0 1 2 3 4 5
1 NaN NaN 1 NaN 2 2
2 1 NaN NaN 2 NaN 3
3 NaN 1 NaN NaN 2 2
4 1 1 1 1 1 1
5 NaN NaN NaN NaN NaN NaN
It is a little bit hard to finding a simple way
s=df.isnull().cumsum(1) # cumsum get the null
s=s[df.notnull()].apply(lambda x : pd.factorize(x)[0],1)+1 # then we need assign the groukey
df=s.mask(s==0)# and mask 0 as NaN
df
0 1 2 3 4 5
1 NaN NaN 1.0 NaN 2.0 2.0
2 1.0 NaN NaN 2.0 NaN 3.0
3 NaN 1.0 1.0 NaN 2.0 2.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN NaN NaN NaN NaN NaN
I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 NaN 1.264 1
3 NaN 0.000 1
4 0.0 NaN 1
5 NaN 1.264 1
6 NaN 6.704 1
7 1.0 NaN 1
8 0.0 NaN 1
9 NaN 11.746 2
10 1.0 NaN 2
11 0.0 NaN 2
12 NaN 16.961 3
13 1.0 NaN 3
14 NaN 11.832 3
15 0.0 NaN 3
16 NaN 17.082 3
17 NaN 22.435 3
18 NaN 28.707 3
19 NaN 34.216 3
I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)
You can apply separate methods to each column. For example:
# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))
>>> df
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 1.0 1.2640 1
3 1.0 0.0000 1
4 0.0 0.6320 1
5 0.0 1.2640 1
6 0.0 6.7040 1
7 1.0 6.7040 1
8 0.0 6.7040 1
9 NaN 11.7460 2
10 1.0 11.7460 2
11 0.0 11.7460 2
12 NaN 16.9610 3
13 1.0 14.3965 3
14 1.0 11.8320 3
15 0.0 14.4570 3
16 0.0 17.0820 3
17 0.0 22.4350 3
18 0.0 28.7070 3
19 0.0 34.2160 3
Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()
Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>