How to drop columns and rows with missing values? - python

I've been trying to take a pandas.Dataframe and drop its rows and columns with missing values simultaneously. While trying to use dropna and applying on both axis, I found out that this is no longer supported. So then I tried, using dropna, to drop the columns and then drop the rows and vice versa and obviously, the results come out different as the values no longer reflect the initial state accurately.
So to give an example I receive:
pandas.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [numpy.nan, 'Batmobile', 'Bullwhip'],
"weapon": [numpy.nan, 'Boomerang', 'Gun']})
and return:
pandas.DataFrame({"name": ['Batman', 'Catwoman']})
Any help will be appreciated.

Test if all values per columns and for rows use DataFrame.notna with DataFrame.any and DataFrame.loc:
m = df.notna()
df0 = df.loc[m.all(1), m.all()]
print (df0)
name
1 Batman
2 Catwoman

Related

Exploding two columns in pandas: ValueError

I have tried to explode two columns as follows
df2=df[['Name','Surname','Properties','Score']].copy()
df2 = df2.reset_index()
df2.set_index('Name').apply(pd.Series.explode).reset_index()
but I have got the error: ValueError: cannot reindex from a duplicate axis
My dataset looks like
Name Surname Properties Score
A. McLarry ['prop1','prop2'] [1,2]
G. Livingstone [] []
S. Silver ['prop5','prop3', 'prop2'] [2,55,2]
...
I would like to explode both Properties and Score. If you can tell me what I am doing wrong, it would be great!
Try using pd.Series.explode as an apply function, after setting ALL the other columns as index. Post that you can reset_index to get the columns back. -
df.set_index(['Name','Surname']).apply(pd.Series.explode).reset_index()

Remove duplicates in pandas. copy() and drop_duplicates() is removing rows that appear only once

As the question states. I am trying to get rid of duplicate rows in a df with 2 series/columns df['Offering Family', 'Major Offering'].
I hope to merge the subsequent df with another one I have based on the Major Offering column, thus only the offering family column will be transposed to the new df. I should note that I only want to get rid of rows with values that are repeated in both columns. If a value appears more than once in the Offering family column but the value in the major offering column is different, it should not be deleted. However, when I run the code below, I'm finding that I'm losing those sorts of values. Can anybody help?
df = pd.read_excel(pipelineEx, sheet_name='Data')
dfMO = df[['Offering Family', 'Major Offering']].copy()
dfMO.filter(['Offering Family', 'Major Offering'])
dfMO = df.drop_duplicates(subset=None, keep="first", inplace=False)
#dfMO.drop_duplicates(keep=False,inplace=True)
print(dfMO)
dfMO.to_excel("Major Offering.xlsx")
I have upated your code and as Aditya Chhabra mentioned, you are creating a copy and not using it.
df = pd.read_excel(pipelineEx, sheet_name='Data')
dfMO = df[['Offering Family', 'Major Offering']].copy()
dfMO.drop_duplicates(inplace=True)
print(dfMO)
dfMO.to_excel("Major Offering.xlsx")
Well there are a few things that are odd with the code you've shared.
Primarily, you created a dfM0 as a copy of df with only the two columns. But then you're applying the drop_duplicates() function on df, the original dataframe, and over-writing the dfM0 you created.
From what I understand, what you need is the dataframe to retain all unique combinations that could be made from values in the two columns. groupby() would be better suited for your purposes.
Try this:
cols = ['Offering Family', 'Major Offering']
dfM0 = df[cols].groupby(cols).count().reset_index()
reset_index() will return a copy, by default, so no additional keyword arguments are necessary.

How to do not loose rows with NaN when stack/unstak?

I have a set of data running from 1945 to 2020, for a series of material produced in two countries. To create a dataframe I concat different df.
df = pd.concat([ProdCountry1['Producta'], [ProdCountry2['Producta'], [ProdCountry1['Productb'], [ProdCountry2['Productb'], ...] ...)
With axis=1, the keys and names, etc.
I get this kind of table:
Then I stack this dataframe to get out the NaNs in rows index (years), but then I loose the years 1946/1948/1949, which are with NaNs only.
df = df.stack()
Here is the kind of df I get when I unstack it:
So, my question is: how can I avoid loosing the years with NaN rows in my df? I need them to interpolate and work later in my notebook.
Thanks in advance for your help.
There is a dropna parameter to the stack method pass it as false
DataFrame.stack(level=- 1, dropna=True)
Cf Documentation for pandas.DataFrame.stack
Let us try dropna
df = df.dropna(how='all')

Pandas. Selecting rows with missing values in multiple columns

Suppose we have a dataframe with the columns 'Race', 'Age', 'Name'. I want to create two 2 DF's:
1) Without missing values in columns 'Race' and 'Age'
2) Only with missing values in columns 'Race' and 'Age'
I wrote the following code
first_df = df[df[columns].notnull()]
second_df= df[df[columns].isnull()]
However this code does not work. I solved this problem using this code
first_df= df[df['Race'].isnull() & df['Age'].isnull()]
second_df = df[df['Race'].isnull() & df['Age'].isnull()]
But what if there are 10 columns ? Is there a way to write this code without logical operators, using only columns list ?
If select multiple columns get boolean DataFrame, then is necessary test if all columns are Trues by DataFrame.all or test if at least one True per rows by DataFrame.any:
first_df = df[df[columns].notnull().all(axis=1)]
second_df= df[df[columns].isnull().all(axis=1)]
You can also use ~ for invert mask:
mask = df[columns].notnull().all(axis=1)
first_df = df[mask]
second_df= df[~mask]
Step 1 : Make a new dataframe having dropped the missing data (NaN, pd.NaT, None) you can filter out incomplete rows.
DataFrame.dropna drops all rows containing at least one field with missing data
Assume new df as DF_updated and earlier as DF_Original
Step 2 : Now our solution DF will be difference between two DFs. It can be found by
pd.concat([DF_Original,DF_updated]).drop_duplicates(keep=False)

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

Categories