I am trying to use the dropna function in pandas. I would like to use it for a specific column.
I can only figure out how to use it to drop NaN if ALL rows have ALL NaN values.
I have a dataframe (see below) that I would like to drop all rows after the first occurance of an NaN in a specific column, column "A"
current code, only works if all row values are NaN.
data.dropna(axis = 0, how = 'all')
data
Original Dataframe
data = pd.DataFrame({"A": (1,2,3,4,5,6,7,"NaN","NaN","NaN"),"B": (1,2,3,4,5,6,7,"NaN","9","10"),"C": range(10)})
data
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 NaN NaN 7
8 NaN 9 8
9 NaN 10 9
What I would like the output to look like:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
Any help on this is appreciated.
Obviously I am would like to do it in the cleanest most efficient way possible.
Thanks!
use iloc + argmax
data.iloc[:data.A.isnull().values.argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 4 3
4 5.0 5 4
5 6.0 6 5
6 7.0 7 6
or with a different syntax
top_data = data[:data['A'].isnull().argmax()]
Re: accepted answer. If column in question has no NaNs, argmax returns 0 and thus df[:argmax] will return an empty dataframe.
Here's my workaround:
max_ = data.A.isnull().argmax()
max_ = len(data) if max_ == 0 else max_
top_data = data[:max_]
Related
I have a timeseries with several products. For each product I want to remove null extremes, and in the middle I want to substitute double 0 to np.nan. Here is an example:
Date Id Units Should be
1 a 0 remove row
2 a 5 5
3 a 0 np.nan
4 a 0 np.nan
5 a 1 1
6 a 3 3
1 b 4 4
2 b 2 2
3 b 0 0
4 b 4 4
5 b 0 remove row
6 b 0 remove row
I tried using groupby and for to getting indexes, but I wasnt able to combine the rules.
You can use:
## PART 1: remove the external 0s
# get rows with 0
m = df['Units'].ne(0)
# get masks to identify the middle values
m1 = m.groupby(df['Id']).cummax()
m2 = m[::-1].groupby(df['Id']).cummax()
# slice the "internal" rows
out = df[m1&m2]
## PART2: replace stretches of 2 0s
g = m.ne(m.groupby(df['Id']).shift()).cumsum()
m3 = df.groupby(['Id', g]).transform('size').eq(2)
out.loc[m2&~m, 'Units'] = pd.NA
output:
Date Id Units Should be
1 2 a 5.0 5
2 3 a NaN np.nan
3 4 a NaN np.nan
4 5 a 1.0 1
5 6 a 3.0 3
6 1 b 4.0 4
7 2 b 2.0 2
8 3 b NaN 0
9 4 b 4.0 4
I have a dataframe like this
ID Value
1 5
1 6
1 Nan
2 Nan
2 8
2 4
2 nan
2 10
3 nan
Expected output:
ID Value
1 5
1 6
1 7
2 Nan
2 8
2 4
2 2
2 10
3 nan
I want to do something like this:
df.groupby('ID')['Value'].shift(all).interpolate()
Currently, I am using this code, but it also takes the below rows into account.
df.groupby('ID')['Value'].interpolate()
I have the following two dataframes:
test=pd.DataFrame({"x":[1,2,3,4,5],"y":[6,7,8,9,0]})
test2=pd.DataFrame({"z":[1],"p":[6]})
which result respectively in:
x y
0 1 6
1 2 7
2 3 8
3 4 9
4 5 0
and
z p
0 1 6
What is the best way to create a column "s" in table test that is equal to:
test["s"]=test["x"]*test2["z"]+test2["p"]
when I try the above expression I get the following output:
x y s
0 1 6 7.0
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 0 NaN
but I want the result along all the rows. I have researched something about the apply method or so called vectorized operations but I don't really know how to undertake the problem.
Expected output:
x y s
0 1 6 7.0
1 2 7 8.0
2 3 8 9.0
3 4 9 10.0
4 5 0 11.0
Thanks in advance
Here is my solution, I took Trenton_M suggestions.
test=pd.DataFrame({"x":[1,2,3,4,5],"y":[6,7,8,9,0]})
test2=pd.DataFrame({"z":[1],"p":[6]})
Multiplication process:
test["s"] = test['x'] * test2.z.loc[0] + test2.p.loc[0]
test
Output:
x y s
0 1 6 7
1 2 7 8
2 3 8 9
3 4 9 10
4 5 0 11
Use scalar multiplication, like this:
test['s'] = test.x * test2.z[0] + test2.p[0]
I have a data frame (sample, not real):
df =
A B C D E F
0 3 4 NaN NaN NaN NaN
1 9 8 NaN NaN NaN NaN
2 5 9 4 7 NaN NaN
3 5 7 6 3 NaN NaN
4 2 6 4 3 NaN NaN
Now I want to fill NaN values with previous couple(!!!) values of row (fill Nan with left existing couple of numbers and apply to the whole row) and apply this to the whole dataset.
There are a lot of answers concerning filling the columns. But in
this case I need to fill based on rows.
There are also answers related to fill NaN based on other column, but
in my case number of columns are more than 2000. This is sample data
Desired output is:
df =
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
IIUC, a quick solution without reshaping the data:
df.iloc[:,::2] = df.iloc[:,::2].ffill(1)
df.iloc[:,1::2] = df.iloc[:,1::2].ffill(1)
df
Output:
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Idea is reshape DataFrame for possible forward and back filling missing values with stack and modulo and integer division of 2 of array by length of columns:
c = df.columns
a = np.arange(len(df.columns))
df.columns = [a // 2, a % 2]
#if possible some pairs missing remove .astype(int)
df1 = df.stack().ffill(axis=1).bfill(axis=1).unstack().astype(int)
df1.columns = c
print (df1)
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Detail:
print (df.stack())
0 1 2
0 0 3 NaN NaN
1 4 NaN NaN
1 0 9 NaN NaN
1 8 NaN NaN
2 0 5 4.0 NaN
1 9 7.0 NaN
3 0 5 6.0 NaN
1 7 3.0 NaN
4 0 2 4.0 NaN
1 6 3.0 NaN
I want to replace some missing values in a dataframe with some other values, keeping the index alignment.
For example, in the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.repeat(['a','b','c'],4), 'B':np.tile([1,2,3,4],3),'C':range(12),'D':range(12)})
df = df.iloc[:-1]
df.set_index(['A','B'], inplace=True)
df.loc['b'] = np.nan
df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
I would like to replace the missing values of 'b' rows matching them with the corresponding indices of 'c' rows.
The result should look like
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10
You can use fillna with the values dictionary to_dict from relevant c rows, like this:
# you can of course use .loc
>>> df.ix['b'].fillna(value=df.ix['c'].to_dict(), inplace=True)
C D
B
1 8 8
2 9 9
3 10 10
4 NaN NaN
Result:
>>> df
C D
A B
a 1 0 0
2 1 1
3 2 2
4 3 3
b 1 8 8
2 9 9
3 10 10
4 NaN NaN
c 1 8 8
2 9 9
3 10 10