Pandas. Selecting rows with missing values in multiple columns - python

Suppose we have a dataframe with the columns 'Race', 'Age', 'Name'. I want to create two 2 DF's:
1) Without missing values in columns 'Race' and 'Age'
2) Only with missing values in columns 'Race' and 'Age'
I wrote the following code
first_df = df[df[columns].notnull()]
second_df= df[df[columns].isnull()]
However this code does not work. I solved this problem using this code
first_df= df[df['Race'].isnull() & df['Age'].isnull()]
second_df = df[df['Race'].isnull() & df['Age'].isnull()]
But what if there are 10 columns ? Is there a way to write this code without logical operators, using only columns list ?

If select multiple columns get boolean DataFrame, then is necessary test if all columns are Trues by DataFrame.all or test if at least one True per rows by DataFrame.any:
first_df = df[df[columns].notnull().all(axis=1)]
second_df= df[df[columns].isnull().all(axis=1)]
You can also use ~ for invert mask:
mask = df[columns].notnull().all(axis=1)
first_df = df[mask]
second_df= df[~mask]

Step 1 : Make a new dataframe having dropped the missing data (NaN, pd.NaT, None) you can filter out incomplete rows.
DataFrame.dropna drops all rows containing at least one field with missing data
Assume new df as DF_updated and earlier as DF_Original
Step 2 : Now our solution DF will be difference between two DFs. It can be found by
pd.concat([DF_Original,DF_updated]).drop_duplicates(keep=False)

Related

How to drop columns and rows with missing values?

I've been trying to take a pandas.Dataframe and drop its rows and columns with missing values simultaneously. While trying to use dropna and applying on both axis, I found out that this is no longer supported. So then I tried, using dropna, to drop the columns and then drop the rows and vice versa and obviously, the results come out different as the values no longer reflect the initial state accurately.
So to give an example I receive:
pandas.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [numpy.nan, 'Batmobile', 'Bullwhip'],
"weapon": [numpy.nan, 'Boomerang', 'Gun']})
and return:
pandas.DataFrame({"name": ['Batman', 'Catwoman']})
Any help will be appreciated.
Test if all values per columns and for rows use DataFrame.notna with DataFrame.any and DataFrame.loc:
m = df.notna()
df0 = df.loc[m.all(1), m.all()]
print (df0)
name
1 Batman
2 Catwoman

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

how to reindex the 'multi - groupbyed' dataframe?

I have a dataframe contains 4 columns, the first 3 columns are numerical variables which indicate the feature of the variable at the last column, and the last column are strings.
I want to merge the last string column by the previous 3 columns through the groupby function. Then it works(I mean the string which shares the same feature logged by the first three columns had been merged successfully)
Previously the length of the dataframe was 1200, and the length of the merged dataframe is 1100. I found the later df is multindexed. Which only contain 2 columns.(hierarchical index ) Thus I tried the reindex method by a generated ascending numerical list. Sadly I failed.
df1.columns
*[Out]Index(['time', 'column','author', 'text'], dtype='object')
series = df1.groupby(['time', 'column','author'])
['body_text'].sum()#merge the last column by the first 3 columns
dfx = series.to_frame()# get the new df
dfx.columns
*[Out]Index(['author', 'text'], dtype='object')
len(dfx)
*[Out]1100
indexs = list(range(1100))
dfx.reindex(index = indexs)
*[Out]Exception: cannot handle a non-unique multi-index!
Reindex here is not necessary, better is use DataFrame.reset_index or add parameter as_index=False to DataFrame.groupby
dfx = df1.groupby(['time', 'column','author'])['body_text'].sum().reset_index()
Or:
dfx = df1.groupby(['time', 'column','author'], as_index=False)['body_text'].sum()

Update dataframe according to another dataframe based on certain conditions

I have two dataframes df1 and df2. Df1 has columns A,B,C,D,E,F and df2 A,B,J,D,E,K. I want to update the second dataframe with the rows of the first but only when two first columns have the same value in both dataframes. For each row that the following two conditions are true:
df1.A = df2.A
df1.B = df2.B
then update accordingly:
df2.D = df1.D
df2.E = df1.E
My dataframes have different number of rows.
When I tried this code I get a TypeError :cannot do positional indexing with these indexers of type 'str'.
for a in df1:
for t in df2:
if df1.iloc[a]['A'] == df2.iloc[t]['A'] and df1.iloc[a]['B'] == df2.iloc[t]['B']:
df2.iloc[t]['D'] = df1.iloc[a]['D']
df2.iloc[t]['E'] = df1.iloc[a]['E']
The Question:
You'd be better served merging the dataframes than doing nested iteration.
df2 = df2.merge(df1[['A', 'B', 'D', 'E']], on=['A', 'B'], how='left', suffixes=['_old', ''])
df2['D'] = df2['D'].fillna(df2['D_old'])
df2['E'] = df2['E'].fillna(df2['E_old'])
del df2['D_old']
del df2['E_old']
The first row attaches columns to df2 with values for columns D and E from corresponding rows of df1, and renames the old columns.
The next two lines fill in the rows for which df1 had no matching row, and the next two delete the initial, now outdated versions of the columns.
The Error:
Your TypeError happened because for a in df1: iterates over the columns of a dataframe, which are strings here, while .iloc only takes integers. Additionally, though you didn't get to this point, to set a value you'd need both index and column contained within the brackets.
So if you did need to set values by row, you'd want something more like
for a in df1.iterrows():
for t in df2.iterrows():
if df1.loc[a, 'A'] == ...
Though I'd strongly caution against doing that. If you find yourself thinking about it, there's probably either a much faster, less painful way to do it in pandas, or you're better off using another tool less focused on tabular data.

Categories