I'm attempting to drop a range of columns in a pandas dataframe that have all NaN. I know the following code:
df.dropna(axis=1, how='all', inplace = True)
Will search all the columns in the dataframe and drop the ones that have all NaN.
However, when I extend this code to a specific range of columns:
df[df.columns[48:179]].dropna(axis=1, how='all', inplace = True)
The result is the original dataframe with no columns removed. I also no for a fact that the selected range has multiple columns with all NaN's
Any idea what I might be doing wrong here?
Don't use inplace=True. Instead do this:
cols = df.columns[48:179]
df[cols] = df[cols].dropna(axis=1, how='all')
inplace=True can only used when you apply changes to the whole dataframe. I won't work in range of columns. Try to use dropna without inplace=True to see the results(in a jupyter notebook)
Related
I would like to drop all "nan" columns (like the first one on the image) as it doesn't contain any information. I tried this using
df.dropna(how='all', axis=1, inplace=True)
which unfortunately had no effect. I am afraid that this might be the case because I had to convert my df into a string using
df = df.applymap(str)
This Thread suggests that dropna won't work in such a case, which makes sense to me.
I tried to loop over the columns using:
for i in range(len(list(df))):
if df.iloc[:,i].str.contains('nan').all():
df.drop(columns=i,axis=1, inplace=True)
which doesn't seem to work. Any help how to drop those columns (and rows as that also doesn't work) is much appreciated
IIUC, try:
df.replace('nan', np.nan).dropna(how='all', axis=1, inplace=True)
This will replace those string 'nan' with np.nan allowing dropna to function as expected.
I am using drop in pandas with an inplace=True set. I am performing this on a duplicate dataframe, but the original dataframe is also being modified.
df1 = df
for col in df1.columns:
if df1[col].sum() > 1:
df1.drop(col,inplace=True,axis=1)
This is modifying my 'df' dataframe and don't seem to understand why.
Use df1 = df.copy(). Otherwise they are the same object in memory.
However, it would be better to generate a new DataFrame directly, e.g.
df1 = df.loc[:, df.sum() <= 0]
I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.
I'm trying to find the median flow of the entire dataframe. The first part of this is to select only certain items in the dataframe.
There were two problems with this, it included parts of the data frame that aren't in 'states'. Also, the median was not a single value, it was based on row. How would I get the overall median of all the data in the dataframe?
Two options:
1) A pandas option:
df.stack().median()
2) A numpy option:
np.median(df.values)
The DataFrame you pasted is slightly messy due to some spaces. But you're going to want to melt the Dataframe and then use median() on the new melted Dataframe:
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2['value'].median())
Your Dataframe may be slightly different, but the concept is the same. Check the comment that I left about to understand pd.melt(), especially the value_vars and id_vars arguments.
Here is a very detailed way of how I went about cleaning and getting the correct answer:
# reading in on clipboard
df = pd.read_clipboard()
# printing it out to see and also the column names
print(df)
print(df.columns)
# melting the DF and then printing the result
df2 = pd.melt(df, id_vars =['U.S.'])
print(df2)
# Creating a new DF so that no nulls are in there for ease of code readability
# using .copy() to avoid the Pandas warning about working on top of a copy
df3 = df2.dropna().copy()
# there were some funky values in the 'value' column. Just getting rid of those
df3.loc[df3.value.isin(['Columbia', 'of']), 'value'] = 99
# printing out the cleaned version and getting the median
print(df3)
print(df3['value'].median())
I have the following toy code:
import pandas as pd
df = pd.DataFrame()
df["foo"] = [1,2,3,4]
df2 = pd.DataFrame()
df2["bar"]=[4,5,6,7]
df = pd.concat([df,df2], ignore_index=True,axis=1)
print(list(df))
Output: [0,1]
Expected Output: [foo,bar] (order is not important)
Is there any way to concatenate two dataframes without losing the original column headers, if I can guarantee that the headers will be unique?
Iterating through the columns and then adding them to one of the DataFrames comes to mind, but is there a pandas function, or concat parameter that I am unaware of?
Thanks!
As stated in merge, join, and concat documentation, ignore index will remove all name references and use a range (0...n-1) instead. So it should give you the result you want once you remove ignore_index argument or set it to false (default).
df = pd.concat([df, df2], axis=1)
This will join your df and df2 based on indexes (same indexed rows will be concatenated, if other dataframe has no member of that index it will be concatenated as nan).
If you have different indexing on your dataframes, and want to concatenate it this way. You can either create a temporary index and join on that, or set the new dataframe's columns after using concat(..., ignore_index=True).
I don't think the accepted answer answers the question, which is about column headers, not indexes.
I am facing the same problem, and my workaround is to add the column names after the concatenation:
df.columns = ["foo", "bar"]