Pandas drop rows vs filter - python

I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:
df = df[df['A'] >= 0]
or
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
What is the recommended solution? Why?

The recommended solution is the most eficient, which in this case, is the first one.
df = df[df['A'] >= 0]
On the second solution
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the slicing process. But lets break it to pieces to understand why.
When you write
df['A'] >= 0
you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).
When you write
df[df['A'] >= 0]
you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a view of the original DataFrame with only the entries for which the Series was True.
Finally, when you write this
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the proccess because
df[df['A'] < 0]
is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.

df = df[df['A'] >= 0]
is indeed the faster solution. Just be aware that it returns a view of the original data frame, not a new data frame. This can lead you into trouble, for example when you want to change its values, as pandas will give you the SettingwithCopyWarning.
The simple fix of course is what Wen-Ben recommended:
df = df[df['A'] >= 0].copy()

Your question is like this: "I have two identical cakes, but one has icing. Which has more calories?"
The second solution is doing the same thing but twice. A filtering step is enough, there's no need to filter and then redundantly proceed to call a function that does the exact same thing the filtering op from the previous step did.
To clarify: regardless of the operation, you are still doing the same thing: generating a boolean mask, and then subsequently indexing.

Related

pandas - mask works on whole dataframe but on selected columns?

I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.
Can you explain why and tell if it is expected result?
You can try with a dataframe dt, containing 0 in columns:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.
But:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced
i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.
you can try this instead:
dt[columns] = dt[columns].mask(dt[columns] == 0)
The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:
tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)
and then you can go and update the dataframe:
dt[columns] = tmp
Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code
dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)
Extra:
If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:
Understanding inplace=True in pandas
In pandas, is inplace = True considered harmful, or not?
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

Replacing all values in a column with conditions on a dataframe

I have a decent set of data (37509, 166). I am currently trying to replace 0 in several columns based on a set of conditions. I continued to get a memory error until I changed that value, and now my kernel keeps crashing. My questions is, is there a better way to write this code that avoids memory problems?
df = pd.read_csv(".csv")
cols = list(df.select_dtypes(include=[np.number]).columns)
mask = (df["column1"] <= 0) & (df["column2"] == 0)
df.loc[mask, df[cols]] = np.nan
The two columns used for the mask are not included in the cols list and I've tried 1 column at a time. I run into MemoryError every time. I've tried running it through Terality with the same issue.
The error is:
MemoryError: Unable to allocate 10.5 GiB for an array with shape (37509, 37509) and data type float64.
The following code does not work either (I understand why this code won't work with the copy vs view) for the list of columns or individual column:
df[mask][cols].replace(0, np.nan, inplace=True)
If anyone would be willing to help explain a solution or even just explain the problem, I would greatly appreciate it.
DataFrame.loc accepts either booleans or labels:
Access a group of rows and columns by label(s) or a boolean array.
Currently the column indexer is an entire dataframe df[cols]:
df.loc[mask, df[cols]] = np.nan
# ^^^^^^^^
Instead of df[cols], use just the cols list:
df.loc[mask, cols] = np.nan
# ^^^^

Filtering only 1 column in a df without returning the entire DF in 1 line

I'm hoping that there is a way i can return a series from df while im filtering it in 1 line.
Is there a way I could return a column from my df after I filter it?
Currently my process is something like this
df = df[df['a'] > 0 ]
list = df['a']
The df.loc syntax is the preferred way to do this, as #JohnM wrote in his comment, though I find the syntax from #Don'tAccept more readable and scaleable however since it can handle cases like column names with spaces in them. These combine like:
df.loc[df['a'] > 0, 'a']
Note this is expandable to provide multiple columns, for example if you wanted columns 'a' and 'b' you would do:
df.loc[df['a'] > 0, ['a', 'b']]
Lastly, you can verify that df.a and df['a'] are the same by checking
in: df.a is df['a']
out: True
The is here (as opposed to ==) means df.a and df['a'] point to the same object in memory, so they are interchangeable.

How to only keep rows where a value in a column appear often enough

a=df.groupby('value').size()
newFrame = pd.DataFrame()
for el in a.keys():
if a[el] > 300000:
newFrame = pd.concat([newFrame, df[df.value == el]])
I have written this code which does what I want, but is really slow. I only want to keep the rows where the 'value' entry is the same as in 300000 other rows. If it's contained less often than that, I want to drop it.
Use GroupBy.transform for Series with same size like original filled by counts with GroupBy.size and filter by boolean indexing:
df = df[df.groupby('value')['value'].transform('size') > 300000]
If processing output later:
df = df[df.groupby('value')['value'].transform('size') > 300000].copy()
Just do value_counts
df=df.drop(df.value.value_counts().loc[lambda x : x<=300000].index)

df.where not working as expected [duplicate]

I was read Andy's answer to the question Outputting difference in two Pandas dataframes side by side - highlighting the difference
i have two questions regarding the code, unfortunately I dont yet have 50 rep to comment on the answer so I hope i could get some help here.
what does In [24]: changed = ne_stacked[ne_stacked] do?
I'm not sure what df1 = df[df] do and i cant seem to get an answer from pandas doc, could someone explain this to me please?
is np.where(df1 != df2) the same as pd.df.where(df1 != df2). If no, what is the difference?
Question 1
ne_stacked is a pd.Series that consists of True and False values that indicate where df1 and df2 are not equal.
ne_stacked[boolean_array] is a way to filter the series ne_stacked by eliminating the rows of ne_stacked where boolean_array is False and keeping the rows of ne_stacked where boolean_array is True.
It so happens that ne_stacked is also a boolean array and so can be used to filter itself. Why would be want to do this? So we can see what the values of the index are after we've filtered.
So ne_stacked[ne_stacked] is a subset of ne_stacked with only True values.
Question 2
np.where
np.where does two things, if you only pass a conditional like in np.where(df1 != df2), you get a tuple of arrays where the first is a reference of all row indices to be used in conjunction with the second element of the tuple that is a reference to all column indices. I usually use it like this
i, j = np.where(df1 != df2)
Now I can get at all elements of df1 or df2 in which there are differences like
df.values[i, j]
Or I can assign to those cells
df.values[i, j] = -99
Or lots of other useful things.
You can also use np.where as an if, then, else for arrays
np.where(df1 != df2, -99, 99)
To produce an array the same size as df1 or df2 where you have -99 in all the places where df1 != df2 and 99 in the rest.
df.where
On the other hand df.where evaluates the first argument of boolean values and returns an object of equal size to df where the cells that evaluated to True are kept and the rest are either np.nan or the values passed in the second argument of df.where
df1.where(df1 != df2)
Or
df1.where(df1 != df2, -99)
are they the same?
Clearly they are not the "same". But you can use them similarly
np.where(df1 != df2, df1, -99)
Should be the same as
df1.where(df1 != df2, -99).values

Categories