Possible optimization of going through pandas dataframe - python

I'm trying to find a way to optimize looping through pandas dataframe. The dataset contains ~450k rows with ~20 columns. The dataframe contains 3 locational variables as multiindex and I want to drop the rows where NaN columns exist within the group, otherwise fill NaN with mean of the group.
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to multiindex nan values
df = df.fillna({c:-1000 for c in LOC})
df = df.set_index(LOC).sort_index(level=[i for i in range(len(LOC))])
# Looping through subset with same (market, midmarket, submarket)
for k, v in df.copy().groupby(level=[i for i in range(len(LOC))]):
# If there is any column with all NaN value, drop it from df
if v.isnull().all().any():
df.drop(v.index.values)
# If there is at least one non-NaN value, fillna with mean
else:
df.loc[v.index.values] = df.loc[v.index.values].fillna(v.mean())
So if there is dataframe like this
before
and it should be converted like this, removing the rows with all NaN columns
after.
I apologize if this is redundant or not accordance with stack overflow question guideline. But if anyone has better solution for this, I would greatly appreciate it.
Thanks in advance.

There's no need to copy your entire dataframe. Nor is there a need to iterate GroupBy elements manually. Here's an alternative solution:
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to NaN values
df = df.fillna(-1000)
# Include only columns containing non-nulls
non_nulls = np.where(df.notnull().any())[0]
df = df.iloc[:, non_nulls]
# Fill columns with respective groupwise means
g = df.groupby(LOC)
for col in df.columns.difference(LOC):
df[col] = df[col].fillna(g[col].transform('mean'))

Related

Is there a way to return a pandas dataframe with a modified column?

Say I have a dataframe df with column "age".
Say "age" has some NaN values and I want to create two new dataframes, dfMean and dfMedian, which fill in the NaN values differently.
This is the way I would do it:
# Step 1:
dfMean = df
dfMean["age"].fillna(df["age"].mean(),inplace=True)
# Step 2:
dfMedian= df
dfMedian["age"].fillna(df["age"].median(),inplace=True)
I'm curious whether there's a way to do each of these steps in one line instead of two, by returning the modified dataframe without needing to copy the original. But I haven't been able to find anything so far. Thanks, and let me know if I can clarify or if you have a better title in mind for the question :)
Doing dfMean = dfMean["age"].fillna(df["age"].mean()) you create a Series, not a DataFrame.
To add two new Series (=columns) to your DataFrame, use:
df2 = df.assign(age_fill_mean=df["age"].fillna(df["age"].mean()),
age_fill_median=df["age"].fillna(df["age"].median()),
)
You alternatively can use alias Pandas.DataFrame.agg()
"Aggregate using one or more operations over the specified axis."
df.agg({'age' : ['mean', 'median']})
No, need 2 times defined new 2 DataFrames by DataFrame.fillna with dictionary for specify columns names for replacement missing values:
dfMean = df.fillna({'age': df["age"].mean()})
dfMedian = df.fillna({'age': df["age"].median()})
One line is:
dfMean,dfMedian=df.fillna({'age': df["age"].mean()}), df.fillna({'age': df["age"].median()})

Pandas check if cell is null in any of two dataframes and if it is, make both cells nulls

I have two dataframes with same shape:
>>> df1.shape
(400,1200)
>>> df2.shape
(400,1200)
I would like to compare cell-by-cell and if a value is missing in one of the dataframes make the equivalent value in the other dataframe NaN as well.
Here's a (pretty inefficient) piece of code that works:
for i in df.columns: # iterate over columns
for j in range(len(df1): # iterate over rows
if pd.isna(df1[i][j]) | pd.isna(df2[i][j]):
df1[i][j] = np.NaN
df2[i][j] = np.NaN
How would be a better way to do this? I'm very sure there is.
This is a simple problem to solve with pandas. You can use this code:
df1[df2.isna()] = df2[df1.isna()] = np.nan
It first creates mask of df2, i.e., a copy of dataframe containing only True or False values. Each NaN in df2 will have a True in the mask, and every other value will have a False in the mask.
With pandas, you can use such masks to do bulk operations. So you can pass that mask to the [] of df1, and then assign it a value, and where each value in the mask is True, the corresponding value in df1 will be assigned the value.

range(1:len(df)) assigns NaN to last rows in dataframe

I have this weird problem with my code . I am trying to generate Auto Id to my dataframe with this code
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format
now, len(df) is equals to 799734
but df['id'] is Nan after row 77998
I tried to print the values using:
[print(i) for i in range(1,(len(df)+1))]
In first attempt it printed None after 77998 values. In second attempt it printed all values to the end normally. but dataframe has still Nan in last rows.
May be it has something to do with memory? I am not getting any hint. Please help me solve this issue.
Missing values means there is different index values in Series and DataFrame, for correct working need same.
So need pass df.index to Series constructor:
df['id'] = pd.Series(range(1,(len(df)+1)), index=df.index).astype(str).apply('{:0>8}'.format
Or 2 rows solution with assign range:
df['id'] = range(1,(len(df)+1))
df['id'] = df['id'].astype(str).apply('{:0>8}'.format
Or create default index values in DataFrame for same like Series:
df = df.reset_index(drop=True)
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

Why the all dataframes become nan when I just assign nan to one of them

I created two dataframes from one dataframe values. I'm modifying the two dataframes such that they have some nan rows by index. However, I could assign nan to one of the dataframe. When I do the same thing to another one, all three dataframes become nan.
I tried to use dataframe.values instead of the original dataframe to create new dataframes, since I know if you let b = a, then whatever you did to a will also be reflected in b. But it still does not work.
df1 = pd.read_csv(...)
df2 = pd.DataFrame(df1.values, index=df1.index, columns=['a'])
df3 = pd.DataFramd(df1.values, index=df1.index, columns=['a'])
results = [5,6,111,112,145,148] # an example for demonstration
ss_index = list(df1.index[5:6]) + list(df1.index[111:112]) +
list(df1.index[145:148])
nss_index = df1.index.difference(ss_index)
df2.loc[ss_index, :] = np.nan # this set all three dfs at ss_index to nan
df3.loc[nss_index, :] = np.nan # this sets all three dfs at nss_index to nan
New edit: .copy is a super useful attribute. numpy, pandas and a lot of libraries have .copy built in. If not, one could import copy.
The first assignment sets ss_index values to np.nan which are only indices [5,111,145,146,147]. The second one sets nss_index indices to np.nan which are the indices different from ss_index, basically all the remaining indices. Since df2 and df3 are just a reference to df1, when you modify one of them, all of them are modified.
You can create a copy of the values in the Dataframe using .copy() method,
df2 = df1.copy(deep=True)
Now, df2 won't be affected by the changes in df1

Categories