Given the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[np.nan,1,2],'b':[np.nan,np.nan,4]})
a b
0 NaN NaN
1 1.0 NaN
2 2.0 4.0
How do I return rows where both columns 'a' and 'b' are null without having to use pd.isnull for each column?
Desired result:
a b
0 NaN NaN
I know this works (but it's not how I want to do it):
df.loc[(pd.isnull(df['a']) & (pd.isnull(df['b'])]
I tried this:
df.loc[pd.isnull(df[['a', 'b']])]
...but got the following error:
ValueError: Cannot index with multidimensional key
Thanks in advance!
You are close:
df[~pd.isnull(df[['a', 'b']]).all(1)]
Or
df[df[['a','b']].isna().all(1)]
How about:
df.dropna(subset=['a','b'], how='all')
With your shown samples, please try following. Using isnull function here.
mask1 = df['a'].isnull()
mask2 = df['b'].isnull()
df[mask1 & mask2]
Above answer is with creating 2 variables for better understanding. In case you want to use conditions inside df itself and don't want to create condition variables(mask1 and mask2 in this case) then try following.
df[df['a'].isnull() & df['b'].isnull()]
Output will be as follows.
a b
0 NaN NaN
You can use dropna() with parameter as how=all
df.dropna(how='all')
Output:
a b
1 1.0 NaN
2 2.0 4.0
Since the question was updated, you can then create masking either using df.isnull() or using df.isna() and filter accordingly.
df[df.isna().all(axis=1)]
a b
0 NaN NaN
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()
There are a few questions here on this topic, but none seem to be helpful in my case. Here's a dumbed down version of what I want:
This is the csv file of interest: http://pastebin.com/rP7tPDse
I'm creating the pivot table as:
piv = pd.read_csv("test.csv",delimiter = "\s+").pivot_table('z','x','y')
And this returns
y 0.0 1.0 1.3 2.0
x
0.0 1.0 5.0 NaN 4.0
1.0 3.0 4.0 NaN 6.0
1.5 NaN NaN 7.0 NaN
2.0 3.0 5.0 NaN 7.0
I would like to find a slice of this array as a pivot_table, such as:
y 1.3 2.0
x
0.0 NaN 4.0
1.0 NaN 6.0
Based on the x and y values. I want to include the NaN's as well, to do processing on them later. Help much appreciated.
EDIT: updating the question to be more specific.
I'm looking to extract a pivot table that has values denoted by the column 'z' and indexed by 'x' and 'y', with the condition that:
All x values between arbitrary xmin and xmax
All y values between arbitrary ymin and ymax
From piv, as defined above, I want to do something like:
piv.loc[(piv.y <= 2.0) &
(piv.y >= 1.3) &
(piv.x >= 0.0) &
(piv.x <= 1.2)]
And this would yield me the example answer, above.
Also, in the actual dataset, which I did not post here, there are many more columns. 'x', 'y' and 'z' are just some of them.
When I copied dataframe, the columns were strings and rows were floats.
To get the columns as float
df.columns = df.columns.astype(float)
Now you can pd.IndexSlice
df.loc[pd.IndexSlice[0:1], pd.IndexSlice[1.3:2]]
I'm trying to find the most recent index with a value that is not 'NaN' relative to the current index. So, say I have a DataFrame with 'NaN' values like this:
A B C
0 2.1 5.3 4.7
1 5.1 4.6 NaN
2 5.0 NaN NaN
3 7.4 NaN NaN
4 3.5 NaN NaN
5 5.2 1.0 NaN
6 5.0 6.9 5.4
7 7.4 NaN NaN
8 3.5 NaN 5.8
If I am currently at index 4, I have the values:
A B C
4 3.5 NaN NaN
I want to know the last known value of 'B' relative to index 4, which is at index 1:
A B C
1 5.1 -> 4.6 NaN
I know I can get a list of all indexes with NaN values using something like:
indexes = df.index[df['B'].apply(np.isnan)]
But this seems inefficient in a large database. Is there a way to tail just the last one relative to the current index?
You may try something like this, convert the index to a series that have the same NaN values as column B and then use ffill() which carries the last non missing index forward for all subsequent NaNs:
import pandas as pd
import numpy as np
df['Last_index_notnull'] = df.index.to_series().where(df.B.notnull(), np.nan).ffill()
df['Last_value_notnull'] = df.B.ffill()
df
Now at index 4, you know the last non missing value is 4.6 and index is 1.
some useful methods to know
last_valid_index
first_valid_index
for columns B as of index 4
df.B.ix[:4].last_valid_index()
1
you can use this for all columns in this way
pd.concat([df.ix[:i].apply(pd.Series.last_valid_index) for i in df.index],
axis=1).T