This code filters all the columns in a Dask dataframe where the column type is int or float, and then fills with zero if there's a NaN:
df_dask = df_dask.select_dtypes(include=['int64', 'float64'])
df_dask = df_dask.where(df_dask.notnull(), 0)
print(df_dask.compute())
Problem is that the original dataframe has string columns that I need to keep in the final dataframe, but they are dropped in the first filter.
How to keep all the columns and only set with zero where column is numeric and value is NaN?
Why not just use standard fillna method on the specified columns?
Something like:
select_cols = df_dask.select_dtypes(include=['int64', 'float64']).columns
for c in select_cols:
df_dask[c] = df_dask[c].fillna(0)
Related
The problem is, when I transpose the DataFrame, the header of the transposed DataFrame becomes the Index numerical values and not the values in the "id" column. See below original data for examples:
Original data that I wanted to transpose (but keep the 0,1,2,... Index intact and change "id" to "id2" in final transposed DataFrame).
DataFrame after I transpose, notice the headers are the Index values and NOT the "id" values (which is what I was expecting and needed)
Logic Flow
First this helped to get rid of the numerical index that got placed as the header: How to stop Pandas adding time to column title after transposing a datetime index?
Then this helped to get rid of the index numbers as the header, but now "id" and "index" got shuffled around: Reassigning index in pandas DataFrame & Reassigning index in pandas DataFrame
But now my id and index values got shuffled for some reason.
How can I fix this so the columns are [id2,600mpe, au565...]?
How can I do this more efficiently?
Here's my code:
DF = pd.read_table(data,sep="\t",index_col = [0]).transpose() #Add index_col = [0] to not have index values as own row during transposition
m, n = DF.shape
DF.reset_index(drop=False, inplace=True)
DF.head()
This didn't help much: Add indexed column to DataFrame with pandas
If I understand your example, what seems to happen to you is that you transpose takes your actual index (the 0...n sequence as column headers. First, if you then want to preserve the numerical index, you can store that as id2.
DF['id2'] = DF.index
Now if you want id to be the column headers then you must set that as an index, overriding the default one:
DF.set_index('id',inplace=True)
DF.T
I don't have your data reproduced, but this should give you the values of id across columns.
I have a pivot table, unfortunately I am unable to cast the column to an int value due to NaN values, and it represents a year in the data. Is there a way to use a function to manipulate the columns (lambda?) in the creation of the pivot table?
submissions_by_country = df_maa_lu.pivot_table(index=["COUNTRY_DISPLAY_LABEL"], columns=["APPROVAL_YEAR"], values='LU_NUMBER_NO_SUFFIX', aggfunc='nunique', fill_value=0)
#smackenzie,
Is it possible to replace the value and recast. For e.g: Assuming your dataframe is called df
df.replace(to_replace =np.nan, value =0.)
df.astype(float)
If retaining np.nan is important, you can replace with a unique value like -999. and then upon changing the datatype, replace it again..
If you just wanted to updated the value in a column i.e. pd.Series instead of the entire dataframe, you could try that like this, not sure if dictionaries would allow NaN to be the key though:
df['Afghanistan'] = w['Afghanistan'].map({np.NaN: 0.0})
Can you post a sample dataset to work with?
I have dataframe that looks like this:
And once I run following code: DF= DF.groupby('CIF').mean() (and fill NaN with zeros)
I get following dataframe:
Why are two columns 'CYCLE' and 'BALANCE.GEL' disappearing?
Because there are mixed missing values, numeric and strings repr of numbers, so columns are removed.
So try convert all columns without CIF to numbers and because CIF column is converted to index is possible aggregate by mean per index:
DF= DF.set_index('CIF').astype(float).mean(level=0)
If first solution failed then use to_numeric with errors='coerce' for convert non numbers to NaNs:
DF= DF.set_index('CIF').apply(pd.to_numeric, errors='coerce').mean(level=0)
I have a dataframe with multiple values as zero.
I want to replace the values that are zero with the mean values of that column Without repeating code.
I have columns called runtime, budget, and revenue that all have zero and i want to replace those Zero values with the mean of that column.
Ihave tried to do it one column at a time like this:
print(df['budget'].mean())
-> 14624286.0643
df['budget'] = df['budget'].replace(0, 14624286.0643)
Is their a way to write a function to not have to write the code multiple time for each zero values for all columns?
So this is pandas dataframe I will using mask make all 0 to np.nan , then fillna
df=df.mask(df==0).fillna(df.mean())
Same we can achieve directly using replace method. Without fillna
df.replace(0,df.mean(axis=0),inplace=True)
Method info:
Replace values given in "to_replace" with "value".
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with .loc or .iloc which require
you to specify a location to update with some value.
How about iterating through all columns and replacing them?
for col in df.columns:
val = df[col].mean()
df[col] = df[col].replace(0, val)
Suppose I have a slice of a column within a dataframe df where I want to replace float values with other float values. Only the values to replace are from another dataframe, `newdf.
I've tried using
df.loc[row index condition, [column to replace vals]] = newdf[column]
but for some reason the resulting values are all NaN. Why is this so?
The value from newdf need to align with the index of df. If newdf has the exact number of values you want to insert, you can try using .values:
df.loc[row index condition, [column to replace vals]] = newdf[column].values