How to drop a list of rows from Pandas dataframe?

How to drop a list of rows from Pandas dataframe? - python

I have a dataframe df :
>>> df
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20060630 6.590 NaN 6.590 5.291
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that ?

Use DataFrame.drop and pass it a Series of index labels:
In [65]: df
Out[65]:
one two
one 1 4
two 2 3
three 3 2
four 4 1
In [66]: df.drop(index=[1,3])
Out[66]:
one two
one 1 4
three 3 2

Note that it may be important to use the "inplace" command when you want to do the drop in line.
df.drop(df.index[[1,3]], inplace=True)
Because your original question is not returning anything, this command should be used.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html

If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.
In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.
Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).
indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))
In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.

I solved this in a simpler way - just in 2 steps.
Make a dataframe with unwanted rows/data.
Use the index of this unwanted dataframe to drop the rows from the original dataframe.
Example:
Suppose you have a dataframe df which as many columns including 'Age' which is an integer. Now let's say you want to drop all the rows with 'Age' as negative number.
df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2
Hope this is much simpler and helps you.

You can also pass to DataFrame.drop the label itself (instead of Series of index labels):
In[17]: df
Out[17]:
a b c d e
one 0.456558 -2.536432 0.216279 -1.305855 -0.121635
two -1.015127 -0.445133 1.867681 2.179392 0.518801
In[18]: df.drop('one')
Out[18]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Which is equivalent to:
In[19]: df.drop(df.index[[0]])
Out[19]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801

If I want to drop a row which has let's say index x, I would do the following:
df = df[df.index != x]
If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:
desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.
ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index
And now to drop those rows using their indexes
new_df = df.drop(ind_drop)

Use only the Index arg to drop row:-
df.drop(index = 2, inplace = True)
For multiple rows:-
df.drop(index=[1,3], inplace = True)

In a comment to #theodros-zelleke's answer, #j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:
dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)
where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.

Determining the index from the boolean as described above e.g.
df[df['column'].isin(values)].index
can be more memory intensive than determining the index using this method
pd.Index(np.where(df['column'].isin(values))[0])
applied like so
df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)
This method is useful when dealing with large dataframes and limited memory.

To drop rows with indices 1, 2, 4 you can use:
df[~df.index.isin([1, 2, 4])]
The tilde operator ~ negates the result of the method isin. Another option is to drop indices:
df.loc[df.index.drop([1, 2, 4])]

Look at the following dataframe df
df
column1 column2 column3
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
Lets drop all the rows which has an odd number in column1
Create a list of all the elements in column1 and keep only those elements that are even numbers (the elements that you dont want to drop)
keep_elements = [x for x in df.column1 if x%2==0]
All the rows with the values [2, 4, 6, 8, 10] in its column1 will be retained or not dropped.
df.set_index('column1',inplace = True)
df.drop(df.index.difference(keep_elements),axis=0,inplace=True)
df.reset_index(inplace=True)
We make the column1 as index and drop all the rows that are not required. Then we reset the index back.
df
column1 column2 column3
0 2 12 22
1 4 14 24
2 6 16 26
3 8 18 28
4 10 20 30

As Dennis Golomazov's answer suggests, using drop to drop rows. You can select to keep rows instead. Let's say you have a list of row indices to drop called indices_to_drop. You can convert it to a mask as follows:
mask = np.ones(len(df), bool)
mask[indices_to_drop] = False
You can use this index directly:
df_new = df.iloc[mask]
The nice thing about this method is that mask can come from any source: it can be a condition involving many columns, or something else.
The really nice thing is, you really don't need the index of the original DataFrame at all, so it doesn't matter if the index is unique or not.
The disadvantage is of course that you can't do the drop in-place with this method.

Consider an example dataframe
df =
index column1
0 00
1 10
2 20
3 30
we want to drop 2nd and 3rd index rows.
Approach 1:
df = df.drop(df.index[2,3])
or
df.drop(df.index[2,3],inplace=True)
print(df)
df =
index column1
0 00
3 30
#This approach removes the rows as we wanted but the index remains unordered
Approach 2
df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =
index column1
0 00
1 30
#This approach removes the rows as we wanted and resets the index.

This worked for me
# Create a list containing the index numbers you want to remove
index_list = list(range(42766, 42798))
df.drop(df.index[index_list], inplace =True)
df.shape
This should drop all indexes within that created range

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?

If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!

I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Pandas fails to remove some rows in dataframe [duplicate]

I have a dataframe df :
>>> df
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20060630 6.590 NaN 6.590 5.291
20060930 10.103 NaN 10.103 7.981
20061231 15.915 NaN 15.915 12.686
20070331 3.196 NaN 3.196 2.710
20070630 7.907 NaN 7.907 6.459
Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:
sales discount net_sales cogs
STK_ID RPT_Date
600141 20060331 2.709 NaN 2.709 2.245
20061231 15.915 NaN 15.915 12.686
20070630 7.907 NaN 7.907 6.459
How or what function can do that ?

Use DataFrame.drop and pass it a Series of index labels:
In [65]: df
Out[65]:
one two
one 1 4
two 2 3
three 3 2
four 4 1
In [66]: df.drop(index=[1,3])
Out[66]:
one two
one 1 4
three 3 2

Note that it may be important to use the "inplace" command when you want to do the drop in line.
df.drop(df.index[[1,3]], inplace=True)
Because your original question is not returning anything, this command should be used.
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html

If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.
In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.
Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).
indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))
In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.

I solved this in a simpler way - just in 2 steps.
Make a dataframe with unwanted rows/data.
Use the index of this unwanted dataframe to drop the rows from the original dataframe.
Example:
Suppose you have a dataframe df which as many columns including 'Age' which is an integer. Now let's say you want to drop all the rows with 'Age' as negative number.
df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2
Hope this is much simpler and helps you.

You can also pass to DataFrame.drop the label itself (instead of Series of index labels):
In[17]: df
Out[17]:
a b c d e
one 0.456558 -2.536432 0.216279 -1.305855 -0.121635
two -1.015127 -0.445133 1.867681 2.179392 0.518801
In[18]: df.drop('one')
Out[18]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Which is equivalent to:
In[19]: df.drop(df.index[[0]])
Out[19]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801

If I want to drop a row which has let's say index x, I would do the following:
df = df[df.index != x]
If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:
desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.
ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index
And now to drop those rows using their indexes
new_df = df.drop(ind_drop)

Use only the Index arg to drop row:-
df.drop(index = 2, inplace = True)
For multiple rows:-
df.drop(index=[1,3], inplace = True)

In a comment to #theodros-zelleke's answer, #j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:
dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)
where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.

Determining the index from the boolean as described above e.g.
df[df['column'].isin(values)].index
can be more memory intensive than determining the index using this method
pd.Index(np.where(df['column'].isin(values))[0])
applied like so
df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)
This method is useful when dealing with large dataframes and limited memory.

To drop rows with indices 1, 2, 4 you can use:
df[~df.index.isin([1, 2, 4])]
The tilde operator ~ negates the result of the method isin. Another option is to drop indices:
df.loc[df.index.drop([1, 2, 4])]

Look at the following dataframe df
df
column1 column2 column3
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
7 8 18 28
8 9 19 29
9 10 20 30
Lets drop all the rows which has an odd number in column1
Create a list of all the elements in column1 and keep only those elements that are even numbers (the elements that you dont want to drop)
keep_elements = [x for x in df.column1 if x%2==0]
All the rows with the values [2, 4, 6, 8, 10] in its column1 will be retained or not dropped.
df.set_index('column1',inplace = True)
df.drop(df.index.difference(keep_elements),axis=0,inplace=True)
df.reset_index(inplace=True)
We make the column1 as index and drop all the rows that are not required. Then we reset the index back.
df
column1 column2 column3
0 2 12 22
1 4 14 24
2 6 16 26
3 8 18 28
4 10 20 30

As Dennis Golomazov's answer suggests, using drop to drop rows. You can select to keep rows instead. Let's say you have a list of row indices to drop called indices_to_drop. You can convert it to a mask as follows:
mask = np.ones(len(df), bool)
mask[indices_to_drop] = False
You can use this index directly:
df_new = df.iloc[mask]
The nice thing about this method is that mask can come from any source: it can be a condition involving many columns, or something else.
The really nice thing is, you really don't need the index of the original DataFrame at all, so it doesn't matter if the index is unique or not.
The disadvantage is of course that you can't do the drop in-place with this method.

Consider an example dataframe
df =
index column1
0 00
1 10
2 20
3 30
we want to drop 2nd and 3rd index rows.
Approach 1:
df = df.drop(df.index[2,3])
or
df.drop(df.index[2,3],inplace=True)
print(df)
df =
index column1
0 00
3 30
#This approach removes the rows as we wanted but the index remains unordered
Approach 2
df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =
index column1
0 00
1 30
#This approach removes the rows as we wanted and resets the index.

This worked for me
# Create a list containing the index numbers you want to remove
index_list = list(range(42766, 42798))
df.drop(df.index[index_list], inplace =True)
df.shape
This should drop all indexes within that created range

Automatically reshape pandas DataFrame as columns of different lengths are added to existing DataFrame?

I have a DataFrame that looks like this:
When I try to add a list of values (of arbitrary length) to one of the columns I get an error:
mydf['a','curr(A)'] = [6,6,6,6,6]
or
mydf['a','curr(A)'] = [6,6]
gives the following error:
"ValueError: Length of values does not match length of index"
But this works:
mydf['a','curr(A)'] = [6,6,6]
How can I add an arbitrary number of entries to a column and pad the DataFrame with NaN's when necessary? Is there a parameter I can set when defining the DataFrame to do this padding automatically?
Thanks for your help.

I think the best way to do this would be something with concat
df2 = pd.DataFrame({
0:[1,2,3],
1:[1,2,3],
2:[4,5,6]
})
row = pd.Series([6,6,6,6])
pd.concat([df2,row], axis=0, ignore_index=True)
Results:
0 1 2
0 1 1.0 4.0
1 2 2.0 5.0
2 3 3.0 6.0
3 6 NaN NaN
4 6 NaN NaN
5 6 NaN NaN
6 6 NaN NaN
I don't think you are able to do this by just assigning the values to a column

Turn the sequence into another df (with the same column names) and then use .combine_first().
df_val = pd.DataFrame({('a', 'curr(a)'): [6, 6, 6]})
df_final = mydf.combine_first(df_val)

I found a workaround to solve my specific problem but it only works because I have all the columns I want in the dataframe ahead of time.
# 2 pairs of lists I want to use as column data.
mydf = pd.DataFrame([[1,2],[3,4],[5,6,7,8,9],[-3,4,-5,6,12]])
mydf = mydf.transpose() # Transpose to go from 4 rows to 4 columns.
# Create multilevel index with 4 indices
multi_idx = multi_idx = pd.MultiIndex.from_product([['a','b'],['curr(A)','volt(V)']])
for col in mydf.columns: # loop through to rename each column
mydf = mydf.rename(columns = {col : multi_idx[col]})
It works, but it seems like there must be a simpler way to do this.
Thanks for your help everyone!

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6

I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

How to form dataframes from a particular dataframe?

Suppose I have a dataframe (say) of 25 columns as follows:
A B C ...... I J ......... Y
I-1 yes 3 1-2-2017 100 james
I-2 no 4 NaN 100 ashok
I-3 NaN 9 2-10-2017 5 mary
I-4 yes NaN 2-10-2017 0 sania
I would like to obtain 3 dataframes from the above dataframe such that
a) the first dataframe consists of columns A to G
b) the second dataframe consists of column A and columns I to J.
c) the third dataframe consists of column A and columns K to Y.
How should I approach it ? (Preferably in Python. Only some column values are illustrated. I will show more if required.)

You can create new DataFrames by using loc in combination with join:
df_a_to_g = df.loc[:, 'A':'G']
df_a_and_i_to_j = df.loc[:, ['A']].join(df.loc[:, 'I':'J'])
df_a_and_k_to_y = df.loc[:, ['A']].join(df.loc[:, 'K':'Y'])
If you want to select the columns 'numerically' you can use iloc instead of loc:
# Select first column and columns 11 through 25.
# We have to slice with 12:27 because indexing starts with 0,
# so 12 equals to column number 11. The destination index '27'
# equals to column 26, from which we have to subtract 1 because
# the last element is exclusive in numerical slicing.
df_new = df.iloc[:, [0]].join(df.iloc[:, 12:27])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.