I am taking a column from a csv file and inputting the data from it into an array using pandas. However, many of the cells are empty and get saved in the array as 'nan'. I want to either identify the empty cells so I can skip them or remove them all from the array after. Something like the following pseudo-code:
if df.row(column number) == nan
skip
or
if df.row(column number) != nan
do stuff
Basically how do I identify if a cell from the csv file is empty.
Best is to get rid of the NaN rows after you load it, by indexing:
df = df[df['column_to_check'].notnull()]
For example to get rid of NaN values found in column 3 in the following dataframe:
>>> df
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df[df[3].notnull()]
0 1 2 3 4
0 1.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
pd.isnull() and pd.notnull() are standard ways of checking individual null values if you're iterating over a DataFrame row by row and indexing by column as you suggest in your code above. You could then use this expression to do whatever you like with that value.
Example:
import pandas as pd
import numpy as np
a = np.nan
pd.isnull(a)
Out[4]: True
pd.notnull(a)
Out[5]: False
If you want to manipulate all (or certain) NaN values from a DataFrame, handling missing data is a big topic when working with tabular data and there are many methods of doing so. I'd recommend chapter 7 from this book. Here are its contents:
The first section would be most pertinent to your question.
If you just want to exclude missing values, you can use pd.DataFrame.dropna()
Below is an example based on the one describes by #sacul:
>>> import pandas as pd
>>> df
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN
3 NaN 1.0 1.0 NaN NaN
4 1.0 NaN NaN 1.0 1.0
>>> df.dropna(axis=0, subset=['3'])
0 1 2 3 4
0 0.0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0 1.0
4 1.0 NaN NaN 1.0 1.0
axis=0 indicates that rows containing NaN are excluded.
subset=['3'] indicate to only consider columns "3".
See the link above for details.
Related
I need to populate NaN values of my df by a static 0, starting from the first non-nan value.
In a way, combining method="ffill" (identify the first value per column, and only act on following NaN values) with value=0 (populating by 0, not the variable quantity in df).
How can I do that? This post is close, but not it: How to replace NaNs by preceding or next values in pandas DataFrame?
Example df
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 NaN 3.0 NaN
3 NaN NaN 4.0
Desired output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 0.0
3 0.0 0.0 4.0
If possible, df.fillna(value=0, method='ffill') would be great. But that returns ValueError: Cannot specify both 'value' and 'method'.
Edit: Oh, and time matters. We are talking ~60M rows and 4k columns - so looping is out of the question, and masking only if really, really fast
You can try mask(), ffill() and fillna():
df=df.fillna(df.mask(df.ffill().notna(),0))
#OR via where
df=df.fillna(df.where(df.ffill().isna(),0))
output:
0 1 2
0 NaN NaN NaN
1 6.0 NaN 1.0
2 0.0 3.0 4.0
3 0.0 0.0 0.0
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
This is my empty dataframe.
one two three four
0 1.0 4.0 2.4 6.4
1 2.0 3.0 4.4 4.1
2 3.0 2.0 7.0 1.0
3 4.0 1.0 9.0 5.0
I need to fill these values into my empty dataframe.
So lets say'TERMID' takes the value from 'one', 'TYP' the value of 'two', 'ZUW' the value from 'three' and last but not least 'SEOKommentar' takes the value from 'four'
The empty dataframe needs to get filled row by row, and the ones which are not filled should say NaN.
How can I do this in an accurate way?
IIUC, you can rename the second dataframe and then reindex the columns to the original empty dataframe columns:
Creating the empty data frame:
s = 'TERMID,NAME,TYP,NAMECHANGE,ALIASES,SUCHREGELN,ZUW,SEOTEXT1,SEOTEXT2,SEOKommentar,DBIKommentar'
df = pd.DataFrame(columns=s.split(','))
Empty DataFrame
Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
Index: []
Solution (df1 is the second dataframe in your example):
d = {'one': 'TERMID', 'two': 'TYP', 'three': 'ZUW', 'four': 'SEOKommentar'}
df = df1.rename(columns=d).reindex(columns=df.columns)
TERMID NAME TYP NAMECHANGE ALIASES SUCHREGELN ZUW SEOTEXT1 \
0 1.0 NaN 4.0 NaN NaN NaN 2.4 NaN
1 2.0 NaN 3.0 NaN NaN NaN 4.4 NaN
2 3.0 NaN 2.0 NaN NaN NaN 7.0 NaN
3 4.0 NaN 1.0 NaN NaN NaN 9.0 NaN
SEOTEXT2 SEOKommentar DBIKommentar
0 NaN 6.4 NaN
1 NaN 4.1 NaN
2 NaN 1.0 NaN
3 NaN 5.0 NaN
I have a Pandas DataFrame called df with 1,460 rows and 81 columns. I want to remove all columns where at least half the entries are NaN and to do something similar for rows.
From the Pandas docs, I attempted this:
train_df.shape //(1460, 81)
train_df.dropna(thresh=len(train_df)/2, axis=1, inplace=True)
train_df.shape //(1460, 77)
Is this the correct way of doing it? It seems to remove 4 columns but I'm surprised. I would have thought len(train_df) gets me the number of rows so I've passed the wrong value to thresh...?
How would I do the same thing for rows (removing rows where at least half the columns are NaN)?
Thanks!
I guess you did the right thing but forgot to add the .index.
The line should look like this:
train_df.dropna(thresh=len(train_df.index)/2, axis=1, inplace=True)
Hope that helps.
Using count and loc. count(axis=) ignores NaNs for counting.
In [4135]: df.loc[df.count(1) > df.shape[1]/2, df.count(0) > df.shape[0]/2]
Out[4135]:
0
0 0.382991
1 0.428040
7 0.441113
Details
In [4136]: df
Out[4136]:
0 1 2 3
0 0.382991 0.658090 0.881214 0.572673
1 0.428040 0.258378 0.865269 0.173278
2 0.579953 NaN NaN NaN
3 0.117927 NaN NaN NaN
4 0.597632 NaN NaN NaN
5 0.547839 NaN NaN NaN
6 0.998631 NaN NaN NaN
7 0.441113 0.527205 0.779821 0.251350
In [4137]: df.count(1) > df.shape[1]/2
Out[4137]:
0 True
1 True
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool
In [4138]: df.count(0) < df.shape[0]/2
Out[4138]:
0 False
1 True
2 True
3 True
dtype: bool
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.choice([1, np.nan], size=(10, 10)))
df
0 1 2 3 4 5 6 7 8 9
0 1.0 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 NaN
1 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 1.0
3 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN 1.0 NaN
5 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
6 NaN NaN 1.0 NaN NaN 1.0 1.0 NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN 1.0 NaN 1.0 NaN NaN
8 1.0 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
9 NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Solution 1
This assumes you make the calculation for rows and columns before you drop either rows or columns.
n = df.notnull()
df.loc[n.mean(1) > .5, n.mean() > .5]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Solution 2
Similar concept but using numpy tools.
v = np.isnan(df.values)
r = np.count_nonzero(v, 1) < v.shape[1] // 2
c = np.count_nonzero(v, 0) < v.shape[0] // 2
df.loc[r, c]
5 6 9
1 1.0 1.0 1.0
4 1.0 NaN NaN
8 NaN 1.0 1.0
9 1.0 1.0 1.0
Try this code, it would do !
df.dropna(thresh = df.shape[1]/3, axis = 0, inplace = True)
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.