I have dataframe that looks like this:
And once I run following code: DF= DF.groupby('CIF').mean() (and fill NaN with zeros)
I get following dataframe:
Why are two columns 'CYCLE' and 'BALANCE.GEL' disappearing?
Because there are mixed missing values, numeric and strings repr of numbers, so columns are removed.
So try convert all columns without CIF to numbers and because CIF column is converted to index is possible aggregate by mean per index:
DF= DF.set_index('CIF').astype(float).mean(level=0)
If first solution failed then use to_numeric with errors='coerce' for convert non numbers to NaNs:
DF= DF.set_index('CIF').apply(pd.to_numeric, errors='coerce').mean(level=0)
Related
I have a pandas dataframe with a column named ranking_pos. All the rows of this column look like this: #123 of 12,216.
The output I need is only the number of the ranking, so for this example: 123 (as an integer).
How do I extract the number after the # and get rid of the of 12,216?
Currently the type of the column is object, just converting it to integer with .astype() doesn't work because of the other characters.
You can use .str.extract:
df['ranking_pos'].str.extract(r'#(\d+)').astype(int)
or you can use .str.split():
df['ranking_pos'].str.split(' of ').str[0].str.replace('#', '').astype(int)
df.loc[:,"ranking_pos"] =df.loc[:,"ranking_pos"].str.replace("#","").astype(int)
This code filters all the columns in a Dask dataframe where the column type is int or float, and then fills with zero if there's a NaN:
df_dask = df_dask.select_dtypes(include=['int64', 'float64'])
df_dask = df_dask.where(df_dask.notnull(), 0)
print(df_dask.compute())
Problem is that the original dataframe has string columns that I need to keep in the final dataframe, but they are dropped in the first filter.
How to keep all the columns and only set with zero where column is numeric and value is NaN?
Why not just use standard fillna method on the specified columns?
Something like:
select_cols = df_dask.select_dtypes(include=['int64', 'float64']).columns
for c in select_cols:
df_dask[c] = df_dask[c].fillna(0)
I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))
I have a DataFrame that has columns with numbers, but these numbers are represented as strings. I want to find these columns automatically, without telling which column should be numeric. How can I do this in pandas?
You can utilise contains from pandas
>>> df.columns[df.columns.str.contains('.*[0-9].*', regex=True)]
The regex can be modified to accomodate a wide range of patterns you want to search
You can first filter using pd.to_numeric and then combine_first with original column:
df['COL_NAME'] = pd.to_numeric(df['COL_NAME'],errors='coerce').combine_first(df['COL_NAME'])
Here is my code:
df.head(20)
df = df.fillna(df.mean()).head(20)
Below is the result:
There are many NaN.
I want to replace NaN by average number, I used df.fillna(df.mean()), but useless.
What's the problem??
I have got it !! before replace the NaN number, I need to reset index at first.
Below is code:
df = df.reset_index()
df = df.fillna(df.mean())
now everything is okay!
This worked for me
for i in df.columns:
df.fillna(df[i].mean(),inplace=True)
Each column in your DataFrame has at least one non-numeric value (in the rows #0 and partially #1). When you apply .mean() to a DataFrame, it skips all non-numeric columns (in your case, all columns). Thus, the NaNs are not replaced. Solution: drop the non-numeric rows.
I think the problem may be your columns are not float or int type. check with df.dtypes. if return object then mean wont work. change the type using df.astype()