In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.
it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest
the first value was a NaN so I had to use bfill method instead. Thanks everyone
Related
let's say the dataframe has 300 columns, :
col1 col2 col3.... col300
A 2 5 50
A NaN NaN 32
B 5 4 NaN
I want to fill in the blanks with means of corresponding groups in col1, but I want to keep col4 and col5 unchanged and actually want to keep the Nan values in those columns.
I am using the following code to fill the NaN values for the entire dataframe
df.groupby("col1").transform(lambda x: x.fillna(x.mean()))
what can I do to add the col4-col5 as exceptions?
You can drop col3, col4:
means = df.drop(['col3','col4'], axis=1).groupby('col1').transform('mean')
df = df.fillna(means)
I imported an excel and now I need multiply certain values from the list but if the value from the first column is NaN, Python should take another column for the calculation. I got the following Code:
if pd['Column1'] == 'NaN':
pd['Column2'] * pd['Column3']
else:
pd['Column1'] * pd['Column3']
Thank you for your help.
You can use isna() together with any() or all(). Here is an example:
import pandas as pd
import numpy as np
#generating test data assuming all the values in Col1 are 'NaN'
df = pd.DataFrame({'Col1':[np.nan,np.nan,np.nan,np.nan], 'Col2':[1,2,3,4], 'Col3':[2,3,4,5]})
if df['Col1'].isna().all(): # you can also use 'any()' instead of all()
df['Col4'] = df['Col2']*df['Col3']
else:
df['Col4'] = df['Col1']*df['Col3']
print(df)
Output:
Col1 Col2 Col3 Col4
0 NaN 1 2 2
1 NaN 2 3 6
2 NaN 3 4 12
3 NaN 4 5 20
I have a large dataframe and I want to search 144 of the columns to check if there are any negative values in them. If there is even one negative value in a column, I want to replace the whole column with np.nan. I then want to use the new version of the dataframe for later analysis.
I've tried a varied of methods but can't seem to find one that works. I think this is almost there but I can't seem to find a solution to what I'm trying to do.
clean_data_df.loc[clean_data_df.cols < 0, cols] = np.nan #cols is a list of the column names I want to check
null_columns=clean_data_df.columns[clean_data_df.isnull().any(axis=1)]
clean_data_df[null_columns] = np.nan
When I run the above code I get the following error: AttributeError: 'DataFrame' object has no attribute 'cols'
Thanks in advance!
You could use a loop to iterate over the columns:
for i in col:
if df[i].isna().any():
df[i] = np.nan
Minumum reproducible example:
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[1,2,3]})
for i in df:
if df[i].isna().any():
df[i] = np.nan
print(df)
Output:
a b c
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 3
Idea is filter only filtered rows by cols by DataFrame.lt and DataFrame.any and then add all another columns filled by False in Series.reindex, last set values by DataFrame.loc, here first : means all rows:
df = pd.DataFrame({'a':list('abc'), 'b':[-2,-1,-3],'c':[1,2,3]})
cols = ['b','c']
df.loc[:, df[cols].lt(0).any().reindex(df.columns, fill_value=False)] = np.nan
print(df)
a b c
0 a NaN 1
1 b NaN 2
2 c NaN 3
Detail:
print(df[cols].lt(0).any())
b True
c False
dtype: bool
print (df[cols].lt(0).any().reindex(df.columns, fill_value=False))
a False
b True
c False
dtype: bool
I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs
Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')