Dropping nan string-columns in panda dataframe - python

I would like to drop all "nan" columns (like the first one on the image) as it doesn't contain any information. I tried this using
df.dropna(how='all', axis=1, inplace=True)
which unfortunately had no effect. I am afraid that this might be the case because I had to convert my df into a string using
df = df.applymap(str)
This Thread suggests that dropna won't work in such a case, which makes sense to me.
I tried to loop over the columns using:
for i in range(len(list(df))):
if df.iloc[:,i].str.contains('nan').all():
df.drop(columns=i,axis=1, inplace=True)
which doesn't seem to work. Any help how to drop those columns (and rows as that also doesn't work) is much appreciated

IIUC, try:
df.replace('nan', np.nan).dropna(how='all', axis=1, inplace=True)
This will replace those string 'nan' with np.nan allowing dropna to function as expected.

Related

Python Pandas drop a specific raneg of columns that are all NaN

I'm attempting to drop a range of columns in a pandas dataframe that have all NaN. I know the following code:
df.dropna(axis=1, how='all', inplace = True)
Will search all the columns in the dataframe and drop the ones that have all NaN.
However, when I extend this code to a specific range of columns:
df[df.columns[48:179]].dropna(axis=1, how='all', inplace = True)
The result is the original dataframe with no columns removed. I also no for a fact that the selected range has multiple columns with all NaN's
Any idea what I might be doing wrong here?
Don't use inplace=True. Instead do this:
cols = df.columns[48:179]
df[cols] = df[cols].dropna(axis=1, how='all')
inplace=True can only used when you apply changes to the whole dataframe. I won't work in range of columns. Try to use dropna without inplace=True to see the results(in a jupyter notebook)

How to do not loose rows with NaN when stack/unstak?

I have a set of data running from 1945 to 2020, for a series of material produced in two countries. To create a dataframe I concat different df.
df = pd.concat([ProdCountry1['Producta'], [ProdCountry2['Producta'], [ProdCountry1['Productb'], [ProdCountry2['Productb'], ...] ...)
With axis=1, the keys and names, etc.
I get this kind of table:
Then I stack this dataframe to get out the NaNs in rows index (years), but then I loose the years 1946/1948/1949, which are with NaNs only.
df = df.stack()
Here is the kind of df I get when I unstack it:
So, my question is: how can I avoid loosing the years with NaN rows in my df? I need them to interpolate and work later in my notebook.
Thanks in advance for your help.
There is a dropna parameter to the stack method pass it as false
DataFrame.stack(level=- 1, dropna=True)
Cf Documentation for pandas.DataFrame.stack
Let us try dropna
df = df.dropna(how='all')

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

The code "df.dropna" in python erases my entire data frame, what is wrong with my code?

I want to drop all NaN variables in one of my columns but when I use df.dropna(axis=0, inplace=True) it erases my entire dataframe. Why is this happening?
I've used both df.dropna and df.dropna(axis=0, inplace=True) and it doesn't work to remove NaN.
I'm binning my data so i can run a gaussian model but I can't do that with NaN variables, I want to remove them and still have my dataframe to run the model.
Before and AFter
Not sure about your case, but sharing the solution that worked on my case:
The ones that didn't work:
df = df.dropna() #==> make the df empty.
df = df.dropna(axis=0, inplace=True) #==> make the df empty.
df.dropna(axis=0, inplace=True) #==> make the df empty.
The one that worked:
df.dropna(how='all',axis=0, inplace=True) #==> Worked very well...
Thanks to Anky above for his comment.
Default 'dropna' command uses 'how=any' , which means that it would delete each row which has 'any' NaN
This, as you found out, delete rows which have 'all' NaN columns
df.dropna(how='all', inplace=True)
or, more basic:
newDF = df.dropna(how='all')
For anyone in the future. Try changing axis=0 to axis=1
df.dropna(axis=1, how = 'all')

Finding median of entire pandas Data frame

I'm trying to find the median flow of the entire dataframe. The first part of this is to select only certain items in the dataframe.
There were two problems with this, it included parts of the data frame that aren't in 'states'. Also, the median was not a single value, it was based on row. How would I get the overall median of all the data in the dataframe?
Two options:
1) A pandas option:
df.stack().median()
2) A numpy option:
np.median(df.values)
The DataFrame you pasted is slightly messy due to some spaces. But you're going to want to melt the Dataframe and then use median() on the new melted Dataframe:
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2['value'].median())
Your Dataframe may be slightly different, but the concept is the same. Check the comment that I left about to understand pd.melt(), especially the value_vars and id_vars arguments.
Here is a very detailed way of how I went about cleaning and getting the correct answer:
# reading in on clipboard
df = pd.read_clipboard()
# printing it out to see and also the column names
print(df)
print(df.columns)
# melting the DF and then printing the result
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2)
# Creating a new DF so that no nulls are in there for ease of code readability
# using .copy() to avoid the Pandas warning about working on top of a copy
df3 = df2.dropna().copy()
# there were some funky values in the 'value' column. Just getting rid of those
df3.loc[df3.value.isin(['Columbia', 'of']), 'value'] = 99
# printing out the cleaned version and getting the median
print(df3)
print(df3['value'].median())

Categories