Replace values in dataframe with values from series of different length - python

I would like to replace values in a column of a dataframe with values from a series. The catch is that I only want to replace values that are designated by a mask and the series does not have the same length as the dataframe.
More specifically, I want to replace all the values that are not null with values from a series that contains one value for each non-null value in the dataframe.
Assume the column in the dataframe contains [1,2,3,NaN,5] and the series contains [2,4,6,10]
I naively thought that this might work
df[pd.notna(df)] = s
But it will make the column look like [1,2,3,NaN,NaN]. I understand why it behaves this way, but I need to find something that will give me this: [2,4,6,NaN,10]

The approach you tried is possible, but with some changes:
Update some individual column, not the whole DataFrame.
To "escape" from different index values, take values from
the "updating" Series.
To show how to do it, let's define the DataFrame (df) as:
A B
0 1.0 11.0
1 2.0 12.0
2 3.0 13.0
3 NaN NaN
4 5.0 15.0
and the "updating" Series (upd) as:
11 2
12 4
13 6
14 10
dtype: int64
As you can see, indices in df and upd are different.
To update e.g. A column in df the way you want, run:
df.A[df.A.notna()] = upd.values
The result is:
A B
0 2.0 11.0
1 4.0 12.0
2 6.0 13.0
3 NaN NaN
4 10.0 15.0

Related

How to filter multiple columns/rows in a python dataframe?

Okay so I have a dataframe called ski_data that has two columns named AdultWeekend and AdultWeekday which shows the price for the weekend and weekdays of each ski resort.
However, for some of the resorts, both price columns are NaN and I need to filter that out.
My approach was this:
Step 1)
create a new dataframe that takes each ski resort and gives it a value of 0, 1, or 2 based on how many of the price tags are NaN so the code was
missing_data = ski_data[['AdultWeekend', 'AdultWeekday']].isnull().sum(axis=1)
Step 2)
Create a counter variable and an empty list. Then iterate over the missing_data and if the value == 2. append the counter to the list which keeps track of the index that is equivalent in the ski_data dataframe.
counter = 0
missingList = []
for x in missing_price:
if x==2:
missingList.append(counter)
counter += 1
Step 3)
Iterate over ski_data dataframe and drop location of each index that was appended to the list.
for i in missingList:
ski_data.drop(labels=[i], axis=0, inplace=True)
However I get multiple errors one of which involves the first index label that is appended to the missing_list is 98 but ski_data.loc[98] raises a keyError. Can anyone explain or help me?
https://github.com/seungsooim32/stackoverflow/blob/main/02_data_wrangling.ipynb
code task #28
You can try:
ski_data.dropna(how="all", subset=['AdultWeekend', 'AdultWeekday'])
This will create a new dataframe with NaN rows dropped. If you want to modify the existing dataframe, add the argument inplace=True.
Don't drop nan, keep the right rows:
>>> ski_data
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
4 NaN NaN
>>> ski_data.loc[ski_data[['AdultWeekend', 'AdultWeekday']].notna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
>>> ski_data.loc[~ski_data[['AdultWeekend', 'AdultWeekday']].isna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0

Fill NaN values wit mean of previous rows?

I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances.
Here is the following example:
df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 NaN
5 NaN
6 NaN
7 7.0
And here is the output I need:
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0
5 4.3
6 4.4
7 7.0
I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:
df.fillna(df.rolling(3, min_periods=1).mean().shift())
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
Can someone help me with that? Thanks in advance!
Probably not the most efficient but terse and gets the job done
from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)
output
col1
0 1.000000
1 3.000000
2 4.000000
3 5.000000
4 4.000000
5 4.333333
6 4.444444
7 7.000000
we basically use fillna but require min_periods=3 meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce to repeat this operation as many times as there are NaNs in col1
I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.
Loop approach
For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.
def impute(df2, col_name):
last_3 = []
for index in df.index:
val = df2.loc[index, col_name]
if len(last_3) > 0:
imputed = np.nanmean(last_3)
else:
imputed = None
if np.isnan(val):
val = imputed
last_3.append(val)
df2.loc[index, col_name] = val
if len(last_3) > 3:
last_3.pop(0)
Repeated column operation
The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.
def impute(df2, col_name):
while df2[col_name].isna().any().any():
# If there are multiple NA values in a row, identify just
# the first one
first_na = df2[col_name].isna().diff() & df2[col_name].isna()
# Compute mean of previous 3 values
imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
# Replace NA values with mean if they are very first NA
# value in run of NA values
df2.loc[first_na, col_name] = imputed
Performance comparison
Running both of these on an 80000 row dataframe, I get the following results:
Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

Count all NaNs in a pandas DataFrame

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

Reassign NaN based on column name in Pandas

I have a data frame with several NaN values like so:
first second
0 1.0 3.0
1 2.0 NaN
2 NaN 5.0
and another with lookup values:
fill
second 200
first 100
Is there a way to replace the NaN values with the fill values based on the column name to get this?:
first second
0 1.0 3.0
1 2.0 200
2 100 5.0
This is just an example, as I'm trying to do it on a much larger dataframe. I know that I can rearrange the fields in the dataframes so that the indices match up and I could use pd.where, but I'm wondering if there's a way to make the match just based on column name.
You can use pandas.DataFrame.fillna() for this

Categories