.fillna empties the whole column instead of repalcing null values - python

I have a dataframe with a column named rDREB% which contains missing values, as shown:count of cells with value of columns. I tried:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0, inplace=True)
After executing the code, the whole column will be empty when i check. Isn't the code supposed to replace only null value with 0? i am confused.
before the code
after the code
P.S. i am also trying to replace other columns with missing values, i.e. ScoreVal, PlayVal, rORB%, OBPM, BPM...

Using inplace means fillna returns nothing, which you're assinging to your column. Either remove inplace, or don't assign the return value to the column:
playersData['rDREB%'] = playersData['rDREB%'].fillna(0)
or
playersData['rDREB%'].fillna(0, inplace=True)
The first approach is recommended. See this question for more info: In pandas, is inplace = True considered harmful, or not?

Related

Missing Value detection fails for completely empty cells python pandas

I have created a function for which the input is a pandas dataframe.
It should return the row-indices of the rows with a missing value.
It works for all the defined Missingness values except when the cell is entirely empty - even though I tried to specify this in the missing_values List as [...,""] .
What could be the issue here? Or is there even a more intuitive way to solve this in general?
def missing_values(x):
df=x
missing_values = ["NaN","NAN","NA","Na","n/a", "na", "--","-"," ","","None","0","-inf"] #common ways to indicate missingness
observations = df.shape[0] # Gives number of observations (rows)
variables = df.shape[1] # Gives number of variables (columns)
row_index_list = []
#this goes through each observation in the first row
for n in range(0,variables): #this iterates over all variables
column_list = [] #creates a list for each value per variable
for i in range(0,observations): #now this iterates over every observation per variable
column_list.append(df.iloc[i,n]) #and adds the values to the list
for i in range(0,len(column_list)): #now for every value
if column_list[i] in missing_values: #it is checked, whether the value is a Missing one
row_index_list.append(column_list.index(column_list[i])) #and if yes, the row index is appended
finished = list(set(row_index_list)) #set is used to make sure the index only appears once if there are multiple occurences in one row and then it is listed
return finished
There might be spurious whitespace, so try adding strip() on this line:
if column_list[i].strip() in missing_values: #it is checked, whether the value is a Missing one
Also a simpler way to get the indexes of rows containing missing_values is with isin() and any(axis=1):
x = x.replace('\s+', '', regex=True)
row_index_list = x[x.isin(missing_values).any(axis=1)].index
When you import a file to Pandas using for example read_csv or read_excel, the missing variable (literally missing) is then can only be specify using np.nan or other type of null value with the numpy library.
(Sorry my bad right here, I was really silly when doing np.nan == np.nan)
You can replace the np.nan value first with:
df = df.replace(np.nan, 'NaN')
then your function can catch it.
Another way is to use isna() in pandas,
df.isna()
This will return the same DataFrame but with cell contains boolean value, True for each cell that is np.nan
If you do df.isna().any(),
This will return a Series with True value for any columns that contains null value.
If you want to retrieved the ID, simply adding the parameter axis = 1 to any():
df.isna().any(axis = 1)
This will return a Series show all the rows with np.nan value.
Now you have the boolean values that indicate which row contains null values. If you add these boolean value to a list and apply that on the DF.index this will took out the index value of the rows containing null.
booleanlist = df.isna().any(axis =1).tolist()
null_row_id = df.index[booleanlist]

Pandas: How to replace Zero values in a column with the mean of that column, For all columns with Zero Value

I have a dataframe with multiple values as zero.
I want to replace the values that are zero with the mean values of that column Without repeating code.
I have columns called runtime, budget, and revenue that all have zero and i want to replace those Zero values with the mean of that column.
Ihave tried to do it one column at a time like this:
print(df['budget'].mean())
-> 14624286.0643
df['budget'] = df['budget'].replace(0, 14624286.0643)
Is their a way to write a function to not have to write the code multiple time for each zero values for all columns?
So this is pandas dataframe I will using mask make all 0 to np.nan , then fillna
df=df.mask(df==0).fillna(df.mean())
Same we can achieve directly using replace method. Without fillna
df.replace(0,df.mean(axis=0),inplace=True)
Method info:
Replace values given in "to_replace" with "value".
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with .loc or .iloc which require
you to specify a location to update with some value.
How about iterating through all columns and replacing them?
for col in df.columns:
val = df[col].mean()
df[col] = df[col].replace(0, val)

Pandas - df.fillna(df.mean()) not working on multiindex DataFrame

Here is my code:
df.head(20)
df = df.fillna(df.mean()).head(20)
Below is the result:
There are many NaN.
I want to replace NaN by average number, I used df.fillna(df.mean()), but useless.
What's the problem??
I have got it !! before replace the NaN number, I need to reset index at first.
Below is code:
df = df.reset_index()
df = df.fillna(df.mean())
now everything is okay!
This worked for me
for i in df.columns:
df.fillna(df[i].mean(),inplace=True)
Each column in your DataFrame has at least one non-numeric value (in the rows #0 and partially #1). When you apply .mean() to a DataFrame, it skips all non-numeric columns (in your case, all columns). Thus, the NaNs are not replaced. Solution: drop the non-numeric rows.
I think the problem may be your columns are not float or int type. check with df.dtypes. if return object then mean wont work. change the type using df.astype()

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

.fillna column if two cells are empty in Pandas

Can somebody tell me why in my for loop
df_all = pd.read_csv("assembly_summary.txt", delimiter='\t', index_col=0)
for row in df_all.index:
if pd.isnull(df_all.infraspecific_name[row]) and pd.isnull(df_all.isolate[row]):
df_all.infraspecific_name.fillna('NA', inplace=True)
print(df_all[['infraspecific_name', 'isolate']])
.fillna fills the specified cell even when the column referred to in the second part of the if statement is not null?
I am trying to use .fillna ONLY if both of the cells referred to in my if statement are null.
I also tried changing the second to last line to df_all.infraspecific_name[row].fillna('NA', inplace=True) Which doesn't work either.
df_all.loc[row,['infraspecific_name']].fillna('NA', inplace=True) corrects the problem, but then when both cells infraspecific_name and isolate ARE null, it doesn't fill the cell with 'NA'
I am not sure if my lack of understanding is in Python loops or Pandas.
The .csv file I am using can be found at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
Since you are indexing your first col, you could use update:
df_all['infraspecific_name']
Returns a Series of only the specified column. The following will perform .fillna only on select (elements) rows [where condition True]
[(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA')
You can achieve all your steps in one line by combining the above and preceding it all with update.
df_all.update(df_all['infraspecific_name'][(df_all['infraspecific_name'].isnull()) & (df_all['isolate'].isnull())].fillna('NA'))
Number of rows changed
len(df_all[df_all['infraspecific_name'] == 'NA'])
1825
The rest of the dataframe should be intact.
This should get you what you want
csvfile = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt'
df_all = pd.read_csv(csvfile, delimiter='\t', index_col=0)
mask = df_all[['infraspecific_name', 'isolate']].isnull().all(axis=1)
df_all.loc[mask, 'infraspecific_name'] = 'NA'
the 3rd line uses these values df_all[['infraspecific_name', 'isolate']] then for each value tests for nulls .isnull(). Then the last part .all(axis=1) is finding out if all columns in each row have Truth values in them.
The 4th line is using that mask to find the locations of the values that need changing.

Categories