How to do not loose rows with NaN when stack/unstak? - python

I have a set of data running from 1945 to 2020, for a series of material produced in two countries. To create a dataframe I concat different df.
df = pd.concat([ProdCountry1['Producta'], [ProdCountry2['Producta'], [ProdCountry1['Productb'], [ProdCountry2['Productb'], ...] ...)
With axis=1, the keys and names, etc.
I get this kind of table:
Then I stack this dataframe to get out the NaNs in rows index (years), but then I loose the years 1946/1948/1949, which are with NaNs only.
df = df.stack()
Here is the kind of df I get when I unstack it:
So, my question is: how can I avoid loosing the years with NaN rows in my df? I need them to interpolate and work later in my notebook.
Thanks in advance for your help.

There is a dropna parameter to the stack method pass it as false
DataFrame.stack(level=- 1, dropna=True)
Cf Documentation for pandas.DataFrame.stack

Let us try dropna
df = df.dropna(how='all')

Related

Is there a way to return a pandas dataframe with a modified column?

Say I have a dataframe df with column "age".
Say "age" has some NaN values and I want to create two new dataframes, dfMean and dfMedian, which fill in the NaN values differently.
This is the way I would do it:
# Step 1:
dfMean = df
dfMean["age"].fillna(df["age"].mean(),inplace=True)
# Step 2:
dfMedian= df
dfMedian["age"].fillna(df["age"].median(),inplace=True)
I'm curious whether there's a way to do each of these steps in one line instead of two, by returning the modified dataframe without needing to copy the original. But I haven't been able to find anything so far. Thanks, and let me know if I can clarify or if you have a better title in mind for the question :)
Doing dfMean = dfMean["age"].fillna(df["age"].mean()) you create a Series, not a DataFrame.
To add two new Series (=columns) to your DataFrame, use:
df2 = df.assign(age_fill_mean=df["age"].fillna(df["age"].mean()),
age_fill_median=df["age"].fillna(df["age"].median()),
)
You alternatively can use alias Pandas.DataFrame.agg()
"Aggregate using one or more operations over the specified axis."
df.agg({'age' : ['mean', 'median']})
No, need 2 times defined new 2 DataFrames by DataFrame.fillna with dictionary for specify columns names for replacement missing values:
dfMean = df.fillna({'age': df["age"].mean()})
dfMedian = df.fillna({'age': df["age"].median()})
One line is:
dfMean,dfMedian=df.fillna({'age': df["age"].mean()}), df.fillna({'age': df["age"].median()})

Dropping nan string-columns in panda dataframe

I would like to drop all "nan" columns (like the first one on the image) as it doesn't contain any information. I tried this using
df.dropna(how='all', axis=1, inplace=True)
which unfortunately had no effect. I am afraid that this might be the case because I had to convert my df into a string using
df = df.applymap(str)
This Thread suggests that dropna won't work in such a case, which makes sense to me.
I tried to loop over the columns using:
for i in range(len(list(df))):
if df.iloc[:,i].str.contains('nan').all():
df.drop(columns=i,axis=1, inplace=True)
which doesn't seem to work. Any help how to drop those columns (and rows as that also doesn't work) is much appreciated
IIUC, try:
df.replace('nan', np.nan).dropna(how='all', axis=1, inplace=True)
This will replace those string 'nan' with np.nan allowing dropna to function as expected.

How to use pivot_table or unstack to create new data frame

I am very new to Python, so please do help me. I have a data frame as below:
and I want to unstack/pivot it to below:
I tried different methods, but I am not getting the desired result.
I first tried groupby with only 2 values for trial, but this doesn't work for what I want and it fills with NaN:
newdf = df.groupby(['ACCIDENT_NO', 'SEX'])['Age_Group'].value_counts().unstack()
I also tried pivot_table:
new_1 = new_df.pivot_table(index=['ACCIDENT_NO'], columns= [ 'SEX','Age_Group'], aggfunc=len, fill_value=0)
This at least fills the zero but here SEX becomes the main columns and the age groups are subdivided under it. I dont care about any subdivisions or anything ...I just want to unstack the categories as different columns as shown in the desired image above.
Try this one-liner using get_dummies and groupby:
df = pd.get_dummies(df, columns=df.columns[1:], prefix='', prefix_sep='').groupby('ACCIDENT_NO', as_index=False).sum()
And now printing df would return the desired result.

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

Possible optimization of going through pandas dataframe

I'm trying to find a way to optimize looping through pandas dataframe. The dataset contains ~450k rows with ~20 columns. The dataframe contains 3 locational variables as multiindex and I want to drop the rows where NaN columns exist within the group, otherwise fill NaN with mean of the group.
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to multiindex nan values
df = df.fillna({c:-1000 for c in LOC})
df = df.set_index(LOC).sort_index(level=[i for i in range(len(LOC))])
# Looping through subset with same (market, midmarket, submarket)
for k, v in df.copy().groupby(level=[i for i in range(len(LOC))]):
# If there is any column with all NaN value, drop it from df
if v.isnull().all().any():
df.drop(v.index.values)
# If there is at least one non-NaN value, fillna with mean
else:
df.loc[v.index.values] = df.loc[v.index.values].fillna(v.mean())
So if there is dataframe like this
before
and it should be converted like this, removing the rows with all NaN columns
after.
I apologize if this is redundant or not accordance with stack overflow question guideline. But if anyone has better solution for this, I would greatly appreciate it.
Thanks in advance.
There's no need to copy your entire dataframe. Nor is there a need to iterate GroupBy elements manually. Here's an alternative solution:
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to NaN values
df = df.fillna(-1000)
# Include only columns containing non-nulls
non_nulls = np.where(df.notnull().any())[0]
df = df.iloc[:, non_nulls]
# Fill columns with respective groupwise means
g = df.groupby(LOC)
for col in df.columns.difference(LOC):
df[col] = df[col].fillna(g[col].transform('mean'))

Categories