Replacing all values in a column with conditions on a dataframe

Replacing all values in a column with conditions on a dataframe - python

I have a decent set of data (37509, 166). I am currently trying to replace 0 in several columns based on a set of conditions. I continued to get a memory error until I changed that value, and now my kernel keeps crashing. My questions is, is there a better way to write this code that avoids memory problems?
df = pd.read_csv(".csv")
cols = list(df.select_dtypes(include=[np.number]).columns)
mask = (df["column1"] <= 0) & (df["column2"] == 0)
df.loc[mask, df[cols]] = np.nan
The two columns used for the mask are not included in the cols list and I've tried 1 column at a time. I run into MemoryError every time. I've tried running it through Terality with the same issue.
The error is:
MemoryError: Unable to allocate 10.5 GiB for an array with shape (37509, 37509) and data type float64.
The following code does not work either (I understand why this code won't work with the copy vs view) for the list of columns or individual column:
df[mask][cols].replace(0, np.nan, inplace=True)
If anyone would be willing to help explain a solution or even just explain the problem, I would greatly appreciate it.

DataFrame.loc accepts either booleans or labels:
Access a group of rows and columns by label(s) or a boolean array.
Currently the column indexer is an entire dataframe df[cols]:
df.loc[mask, df[cols]] = np.nan
# ^^^^^^^^
Instead of df[cols], use just the cols list:
df.loc[mask, cols] = np.nan
# ^^^^

Related

Applying a function that inverts column values using pandas

I'm hoping to get someone's advice on a problem I'm running into trying to apply a function over columns in a dataframe I have that inverses the values in the columns.
For example, if the observation is 0 and the max of the column is 7, I subtract the absolute value of the max from the observation: abs(0 - 7) = 7, so the smallest value becomes the largest.
All of the columns essentially have a similar range to the above example. The shape of the sliced df is 16984,512
The code I have written creates a bunch of empty columns, that are then replaced with the max values of those columns. The new shape becomes 16984, 1029 including the 5 columns that I sliced off before. Then I use lambda to apply the function over the columns in question:
#create max cols
col = df.iloc[:, 5:]
col_names = col.columns
maximum = '_max'
for col in df[col_names]:
max_value = df[col].max()
df[col+maximum] = np.zeros((16984,))
df[col+maximum].replace(to_replace = 0, value = max_value)
#for each row and column inverse value of row
def invert_col(x, col):
"""Invert values of a column"""
return abs(x[col] - x[col+"_max"])
for col in col_names:
new_df = df.apply(lambda x: invert_col(x, col), axis = 1)
I've tried this where I includes axis = 1 and when I remove it and the behaviour is quite different. I am fairly new to Python so I'm finding it difficult to troubleshoot why this is happening.
When I remove axis = 1, the error I get is a key error: KeyError: 'TV_TIME_LIVE'
TV_TIME_LIVE is the first column in col_names, so it's as if it's not finding it.
When I include axis = 1, I don't get an error, but all the columns in the df get flattened into a Series, with length equal to the original df.
What I'm expecting is a new_df with the same shape (16984,1029) where the values of the 5th to the 517th column have the inverse function applied to them.
I would really appreciate any guidance as to what's going on here and how we might get to the desired output.
Many thanks

apply is slow. It is better to use vectorized approaches as below.
axis=1 means that your function will work column wise, if you do not specify it will work row wise. When you get key error it means pandas is searching for a column name and it cannot find it. If you really must use apply try searching for a few examples how exactly it works.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randint(0,7,size=(100, 4)), columns=list('ABCD'))
col_list=df.columns.copy()
for col in col_list:
df[col+"inversed"]=abs(df[col]-df[col].max())

pandas - mask works on whole dataframe but on selected columns?

I was replacing values in columns and noticed that if use mask on all the dataframe, it will produce expected results, but if I used it against selected columns with .loc, it won't change any value.
Can you explain why and tell if it is expected result?
You can try with a dataframe dt, containing 0 in columns:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
dt.mask(lambda x: x == 0, np.nan, inplace=True)
# will replace all zeros to nan, OK.
But:
dt = pd.DataFrame(np.random.randint(0,3,size=(10, 3)), columns=list('ABC'))
columns = list('BC')
dt.loc[:, columns].mask(lambda x: x == 0, np.nan, inplace=True)
# won't cange anything. I excpet B, C columns to have values replaced

i guess it's because the DataFrame.loc property is just giving access to a slice of your dataframe and you are masking a copy of the dataframe so it doesn't affect the data.
you can try this instead:
dt[columns] = dt[columns].mask(dt[columns] == 0)

The loc functions returns a copy of the dataframe. On this copy you are applying the mask function that perform the operation in place on the data. You can't do this on a one-liner, otherwise the memory copy remains inaccessible. To get access to that memory area you have to split the code into 2 lines, to get a reference to that memory area:
tmp = dt.loc[:, columns]
tmp.mask(tmp[columns] == 0, np.nan, inplace=True)
and then you can go and update the dataframe:
dt[columns] = tmp
Not using the inplace update of the mask function, on the other hand, you can do everything with one line of code
dt[columns] = dt.loc[:, columns].mask(dt[columns] == 0, np.nan, inplace=False)
Extra:
If you want to better understand the use of the inplace method in pandas, I recommend you read these posts:
Understanding inplace=True in pandas
In pandas, is inplace = True considered harmful, or not?
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

Clarification about Replace and Dtypes with Pandas

This is a strange problem. I don't have the possibility to produce a MVE.
I have two dataset in pandas. They contain some Series that can have three values: "Yes", "No", NaN. These Series have Dtype Object at this moment.
I want to remove the NaNs from them, and to prepare them to be used by ML algorithms, so I do this:
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
In final_df1 the Dtype of the Series I mentioned before becomes automatically int64 after dropping the NaN values and replacing the values. In final_df2, this does not happen. They contain the same values (0 and 1) so I really do not understand this.
In order to create a Minimum Viable Example, I tried to
Isolate the Series, do the transformation on them one by one and check the results
Take only a small portion of the Dataframes
Save the DFs on disk and work on them from another script to recreate the problem
But, in any of those attempts, the result was different. Either both DFs ended up with Series having Object Dtype, or both with Int64 Dtype.
For me, this is important, because later on I need the dummies of those DFs, and if some Int series are Object series on the other DF, the columns will not match. This problem is easy to solve, I just need to cast explicitly, but still I have one doubt and I would need to confirm it:
If I replace the content of an Object Series (without NaNs) with numbers, is there a random possibility of this Series being cast to Int64?
I see this as the only explanation for what I am facing
Thanks in advance. If you find any way to clarify my question, please edit or comment
EDIT 1: Screenshots from Spyder
This is the code. I am printing in console the most essential relevant data: Dtype, values and number of Nulls
This is the output before the Drop/Replace. Well, I could have printed something more nice to read, but the idea is simple: before the Drop/Replace they both have null values, they both have "Yes" and "No" values, they both are object type Series.
Aaaaand this is after the Drop/Replace. As you can see, they both have no nulls now, they both have 1/0, but one of them is an object Series, the other is an int64 Series.
I really do not understand: they were the same type before!

Here is sample to reproduce.
If you change col_1 '0' to 0 it will change the dtype
import pandas as pd
import numpy as np
data = {'col_1': ['Yes', 'No', np.nan, '0'], 'col_2': [np.nan, 'Yes', 'Yes', 'No']}
df=pd.DataFrame.from_dict(data)
d1=df[['col_1']]
d2=df[['col_2']]
print(d1.dtypes)
print(d2.dtypes)
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
print(final_df1.dtypes)
print(final_df2.dtypes)
you can also convert the datatypes in the final_df definition
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)

Pandas drop rows vs filter

I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:
df = df[df['A'] >= 0]
or
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
What is the recommended solution? Why?

The recommended solution is the most eficient, which in this case, is the first one.
df = df[df['A'] >= 0]
On the second solution
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the slicing process. But lets break it to pieces to understand why.
When you write
df['A'] >= 0
you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).
When you write
df[df['A'] >= 0]
you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a view of the original DataFrame with only the entries for which the Series was True.
Finally, when you write this
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the proccess because
df[df['A'] < 0]
is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.

df = df[df['A'] >= 0]
is indeed the faster solution. Just be aware that it returns a view of the original data frame, not a new data frame. This can lead you into trouble, for example when you want to change its values, as pandas will give you the SettingwithCopyWarning.
The simple fix of course is what Wen-Ben recommended:
df = df[df['A'] >= 0].copy()

Your question is like this: "I have two identical cakes, but one has icing. Which has more calories?"
The second solution is doing the same thing but twice. A filtering step is enough, there's no need to filter and then redundantly proceed to call a function that does the exact same thing the filtering op from the previous step did.
To clarify: regardless of the operation, you are still doing the same thing: generating a boolean mask, and then subsequently indexing.

computing a value in a dataframe colum from another column, but only if a condition in a 3rd column is met

When I do this:
import pandas as pd
table={'x':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'y':[1,1,2,2,2,1,2,3,4,5,1,2,2,2,3],
'z':[0,0,2,2,0,1,2,0,4,5,0,2,0,2,3],
'type':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c']}
df=pd.DataFrame(table, columns=['x','y','z','type'])
mask = df.z==0
df.x[mask] = 1./df.y[mask]
I get the desired behavior, but pandas complains and says:
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.x[mask] = 1./df.y[mask]
Now, this is just a tiny little df here and so I can make the warning go away making the changes in column 'x' row by row with iloc or suchlike. But in my actual data analysis program the df is on the larger side, so the iloc approach slows things down quite a bit.
Is there a better way to get the changes made in column x, using column y values, only in rows where a condition is true in column z?
Thanks!

Use loc to avoid chain indexing... and assignment on the chain index
df.loc[mask, 'x'] = 1. / df.loc[mask, 'y']
That said. You could do the chain indexing for the values you are trying to assign. You only got the warning for trying to assign to an object after chain indexing....
This works as well
df.x.values[mask] = 1. / df.y[mask]
As well as
df.loc[mask, 'x'] = 1. / df.y[mask]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.