Given the following pandas DataFrame where some indices are NaN, how to drop the third and eight row since their index is NaN? Thanks
import pandas as pd
import numpy as np
data = list('abcdefghil')
indices = [0, 1, np.nan, 3, 4, 5, 6, np.nan, 8, 9]
df = pd.DataFrame(data, index=indices, columns=['data'])
You can call dropna on the index:
In[68]:
df.loc[df.index.dropna()]
Out[68]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
Note that the presence of NaN makes the index dtype float, to change it to int cast the type:
In[70]:
df = df.loc[df.index.dropna()]
df.index = df.index.astype(int)
df
Out[70]:
data
0 a
1 b
3 d
4 e
5 f
6 g
8 i
9 l
You can also call notnull on the index would also work (somehow undocumented)
In[71]:
df = df.loc[df.index.notnull()]
df.index = df.index.astype(int)
df
Out[71]:
data
0 a
1 b
3 d
4 e
5 f
6 g
8 i
9 l
there is also isna:
In[78]:
df.loc[~df.index.isna()]
Out[78]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
and the more readable inverse notna:
In[79]:
df.loc[df.index.notna()]
Out[79]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
As commented by #jpp you can use the top-level notnull also:
In[80]:
df.loc[pd.notnull(df.index)]
Out[80]:
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
There is also top-level isna, notna, and isnull but I'm not going to display those, you can check the docs
You can use the following:
df = df[df.index.isnull() == False]
You might want to reset the index after
Using np.isnan and taking the negative:
res = df[~np.isnan(df.index)]
print(res)
data
0.0 a
1.0 b
3.0 d
4.0 e
5.0 f
6.0 g
8.0 i
9.0 l
Related
import pandas
import numpy
names = ['a', 'b', 'c']
df = pandas.DataFrame([1, 2, 3, numpy.nan, numpy.nan, 4, 5, 6, numpy.nan, numpy.nan, 7, 8, 9])
For the above one, how will the condition change? Can someone please explain?
how can I get this,
df1 =
0
0 1.0
1 2.0
2 3.0
df2 =
0
4 4.0
5 5.0
6 6.0
df3 =
0
8 7.0
9 8.0
10 9.0
You can generate a temporary column, remove NaNs, and group by the temporary column:
dataframes = {f'df{idx+1}': d for idx, (_, d) in enumerate(df.dropna().groupby(df.assign(cond=df.isna().cumsum()).dropna()['cond']))}
Output:
>>> dataframes
{'df1': 0
0 1.0
1 2.0
2 3.0,
'df2': 0
5 4.0
6 5.0
7 6.0,
'df3': 0
10 7.0
11 8.0
12 9.0}
>>> dataframes['df1']
0
0 1.0
1 2.0
2 3.0
>>> dataframes['df2']
0
5 4.0
6 5.0
7 6.0
>>> dataframes['df3']
0
10 7.0
11 8.0
12 9.0
I have two data frames and I want to make a single data frame.
I si the index and V is the value that I am interested.
df1 is like
I V
A 4
B 5
C 8
D 6
F 2
df2 is like
I V
A 8
C 6
D 9
E 4
G 7
I want the output like
I V1 v2
A 4 8
B 5 -
C 8 6
D 6 9
E - 4
F 2 -
G - 7
Is there a direct method in Pandas that can do this? or do I have to use a loop to iterate through the set of all indexes and enter value cell by cell?
as you can see df1 and df2 has few unique rows.
I am really sorry about the formatting of these tables.
I was not able to figure out how to format this yet.
EDIT: Yes I initially posted this with the wrong data for df1.
at the end I used merge.
Yes, you can use merge for what you want:
df1 = pd.DataFrame({"C1": ["A","B", "C", "D", "F" ] , "C2": [4,5,8,6,2]})
df2 = pd.DataFrame({"C1": ["A","C", "D", "E", "G" ], "C2": [8,6,9,4,7]})
pd.merge(df1, df2, on="C1", how="outer").sort_values("C1")
This gives the following
C1 C2_x C2_y
0 A 4.0 8.0
1 B 5.0 NaN
2 C 8.0 6.0
3 D 6.0 9.0
5 E NaN 4.0
4 F 2.0 NaN
6 G NaN 7.0
You don't even need to merge. Just construct a new DataFrame with df1 and df2 as columns.
index2 = 'abcdef'
index1 = 'abcdeg'
df1 = pd.DataFrame(index=list(index1), data=list(range(len(index1))))
df2 = pd.DataFrame(index=list(index2), data=list(range(len(index2))))
pd.DataFrame(data={'a': df1.iloc[:, 0], 'b': df2.iloc[:, 0]})
a b
a 0.0 0.0
b 1.0 1.0
c 2.0 2.0
d 3.0 3.0
e 4.0 4.0
f NaN 5.0
g 5.0 NaN
I'd like to calculate the percentile rank based on which Group they belongs to. I have written the following codes and was able to calculate, say zscore as there is only one input. What shall I do with a function which has two arguments? Thanks.
import pandas as pd
import scipy.stats as stats
import numpy as np
funZScore = lambda x: (x - x.mean()) / x.std()
funPercentile = lambda x, y: stats.percentileofscore(x[~np.isnan(x)], y)
A = pd.DataFrame({'Group' : ['A','A','A','A','B','B','B'],
'Value' : [4, 7, None, 6, 2, 8, 1]})
# Compute the Z-score by group
A['Z'] = A.groupby('Group')['Value'].apply(funZScore)
print(A)
Group Value Z
0 A 4.0 -1.091089
1 A 7.0 0.872872
2 A NaN NaN
3 A 6.0 0.218218
4 B 2.0 -0.440225
5 B 8.0 1.144586
6 B 1.0 -0.704361
# compute the percentile rank by group
# how to put two arguments into groupby apply?
# I hope to get something like below
Group Value Z P
0 A 4.0 -1.091089 33.33
1 A 7.0 0.872872 100
2 A NaN NaN NaN
3 A 6.0 0.218218 66.67
4 B 2.0 -0.440225 66.67
5 B 8.0 1.144586 100
6 B 1.0 -0.704361 33.33
I think need:
d = A.groupby('Group')['Value'].apply(list).to_dict()
print (d)
{'A': [4.0, 7.0, nan, 6.0], 'B': [2.0, 8.0, 1.0]}
A['P'] = A.apply(lambda x: funPercentile(np.array(d[x['Group']]), x['Value']), axis=1)
print (A)
Group Value Z P
0 A 4.0 -1.091089 33.333333
1 A 7.0 0.872872 100.000000
2 C NaN NaN NaN
3 A 6.0 0.218218 66.666667
4 B 2.0 -0.440225 66.666667
5 B 8.0 1.144586 100.000000
6 B 1.0 -0.704361 33.333333
I have a list of columns in a dataframe that shouldn't be empty.
I want to remove any rows that are empty in any of these columns. My solution would be to iterate through the required columns and set the column 'excluded' to the error message that the user will be shown before excluding them (I will present these to the user in the form of a report at the end of the process)
I'm currently trying something like this:
for col in requiredColumns:
df[pd.isnull(df[col])]['excluded'] = df[pd.isnull(df[col])]['excluded'].apply(lambda x: str(x) + col + ' empty, excluded')
but no luck - the columns aren't updated. The filter by itself (to get only the empty rows) works, the update part doesn't seem to be working.
I'm used to SQL:
UPDATE df SET e = e & "empty, excluded" WHERE NZ(col, '') = ''
If you need to update a panda based on multiple conditions:
You can simply use .loc
>>> df
A B C
0 2 40 800
1 1 90 600
2 6 80 700
3 1998 70 55
4 1 90 300
5 7 80 700
6 4 20 300
7 1998 20 2
8 7 10 100
9 1998 60 2
>>> df.loc[(df['A'] > 7) & (df['B'] > 69) , 'C'] = 75
This will set 'C' = 75 where 'A' > 7 and 'B' > 69
One way is to use numpy functions to create a column with the desired marker.
Setup
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [1, np.nan, 2, 3, 4, 5],
'B': [2, 3, np.nan, 5, 1, 9],
'C': [5, 8, 1, 9, np.nan, 7]})
A B C
0 1.0 2.0 5.0
1 NaN 3.0 8.0
2 2.0 NaN 1.0
3 3.0 5.0 9.0
4 4.0 1.0 NaN
5 5.0 9.0 7.0
Solution
df['test'] = np.any(np.isnan(df.values), axis=1)
A B C test
0 1.0 2.0 5.0 False
1 NaN 3.0 8.0 True
2 2.0 NaN 1.0 True
3 3.0 5.0 9.0 False
4 4.0 1.0 NaN True
5 5.0 9.0 7.0 False
Explanation
np.isnan returns a Boolean array corresponding to whether the elements of a numpy array are null.
Use np.any or np.all, as required, to determine which rows are in scope.
Use df.values to extract underlying numpy array from dataframe. For selected columns, you can use df[['A', 'B']].values.
I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN