I have hundreds of columns in a DataFrame and would like to drop rows where multiple columns are NaN. Meaning entire row is NaN for those columns.
I have tried to slice columns but the code is taking forever to run.
df = df.drop(df[(df.loc[:,'col1':'col100'].isna()) & (df.loc[:,'col120':'col220'].isna())].index)
Appreciate any help.
Part of your original question reads: "... would like to drop rows where multiple columns are NaN. Meaning entire row is NaN for those columns. "
Can I interpret this as, you want to delete the row when the entire row has NaNs. If that is true you should be able to achive this by:
df.dropna(axis = 'rows', how = 'all', inplace = True)
If that is not the case then I misunderstood your question.
You should try to use the dropna() function with the subset parameter equal to the columns you are trying to drop on. Here is a short example taken from Pandas' documentation
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
"toy": [np.nan, 'Batmobile', 'Bullwhip'],
"born": [pd.NaT, pd.Timestamp("1940-04-25"),
pd.NaT]})
df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
df.dropna(subset=['name', 'born'])
This gives you the following:
name toy born
1 Batman Batmobile 1940-04-25
Related
I'm merging 2 pretty large data frames, the shape of RD_4ML is (97058, 24) while the shape of NewDF is (104047, 3). They share a common column called 'personUID', below is the merge code I used.
Final_DF = RD_4ML.merge(NewDF, how='left', on='personUID')
Final_DF.fillna('none', inplace=True)
Final_DF.sample()
DF sample output:
|personUID| |code| |Death| |diagnosis_code_type| |lr|
|abc123| |ICD10| |1| |none| |none|
Essentially the columns from RD_4ML populate while the 2 columns from NewDF return "none" values. Does anyone know how to solve an error like this?
I think the 'personUID' column does not match in the two dataframe.
Ensure that they have the same data type.
Merge with how='left' takes every entry from the left dataframe and tries to find a corresponding matching id in the right dataframe. For all nonmatching ones, it will fill in nans for the columns coming from the right frame. In SQL that is called a left join. As an example you can have a look at this here
df1 = pd.DataFrame({"uid":range(4), "values": range(4)})
df2 = pd.DataFrame({"uid":range(5, 9), "values2": range(4)})
df1.merge(df2, how="left", on='uid')
# OUTPUT
uid values values2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
here yous see all uids from the left dataframe end up in the merged dataframe and as no matching entry was found, the column from the right dataframe is set to NaN.
If your goal is, to end up with only those that have a match, you can change from "left" to "inner". For more information about that, just have a look at the great pandas docs.
I have a data frame with 4 columns. I need column 1 and 2 (new_df_1 and bill_df_1) to not change. I want to sort (new_File_Number_Data new_invoice_total) to match column 1 and 2 and if there is no match, match it with missing.
new_df_1 bill_df_1 new_File_Number_Data new_invoice_total
0 1-08912-000218-033 25.0 1-08915-000041-054 134.50
1 1-08915-000041-054 163.0 001-0464-01589-061 148.50
2 001-0464-01589-061 166.7 004-3001-00080-532 54.00
3 004-3001-00080-532 74.0 missing missing
easier to look at Python Data Frame pic
You can't sort only some columns of a dataframe and not others. It sounds like you need to separate the columns into two different dataframes and then merge them so that they are matched as you want. You can then fill the missing values with the string 'missing'. For example:
df1 = df[['new_df_1', 'bill_df_1']]
df2 = df[['new_File_Number_Data', 'new_invoice_total']]
new_df = pd.merge(df1, df2, how='left', left_on='new_df_1', right_on='new_File_Number_Data').fillna('missing')
I was able to pull the rows that I would like to delete from a CSV file but I can't make that drop() function to work.
data = pd.read_csv(next(iglob('*.csv')))
data_top = data.head()
data_top = data_top.drop(axis=0)
What needs to be added?
Example of a CSV file. It should delete everything until it reaches the Employee column.
creation date Unnamed: 1 Unnamed: 2
0 NaN type of client NaN
1 age NaN NaN
2 NaN birth date NaN
3 NaN NaN days off
4 Employee Salary External
5 Dan 130e yes
6 Abraham 10e no
7 Richmond 201e third-party
If it is just the top 5 rows you want to delete, then you can do it as follows:
data = pd.read_csv(next(iglob('*.csv')))
data.drop([0,1,2,3,4], axis=0, inplace=True)
With axis, you should also pass either a single label or list (of column names, or row indexes).
There are, of course, many other ways to achieve this too. Especially if the case is that the index of rows you want to delete is not just the top 5.
edit: inplace added as pointed out in comments.
Considering the coments and further explanations, assuming you know the name of the column, and that you have a positional index, you can try the following:
data = pd.read_csv(next(iglob('*.csv')))
row = data[data['creation date'] == 'Employee']
n = row.index[0]
data.drop(labels=list(range(n)), inplace=True)
The main goal is to find the index of the row that contains the value 'Employee'. To achieve that, assuming there are no other rows that contain that word, you can filter the dataframe to match the value in question in the specific column.
After that, you extract the index value, wich you will use to create a list of labels (given a positional index) that you will drop of the dataframe, as #MAK7 stated in his answer.
I have been searching for an answer to my question for a while, and have not been able to find anything that produces my desired result.
The problem is this: I have a dataframe with two rows that I want to merge into a single row dataframe that has multi-level columns. Using my example below (which I drafted in excel to better visualize my desired output), I want the new DF to have a multicolumn index with the first level being based on the original columns A-C, then add a new column sub level based on the values from the original 'Name' column. It is quite possible i'm incorrectly using existing functions. If you could provide me with your simplest way of altering the dataframe, I would greatly appreciate it!
Code to construct current df:
import pandas as pd
df = pd.DataFrame([['Alex',1,2,3],['Bob',4,5,6]],columns='Name A B
C'.split())
Image of current df with desired output:
Using set_index + unstack
df.set_index('Name').unstack().to_frame().T
Out[198]:
A B C
Name Alex Bob Alex Bob Alex Bob
0 1 4 2 5 3 6
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4