Reassign NaN based on column name in Pandas - python

I have a data frame with several NaN values like so:
first second
0 1.0 3.0
1 2.0 NaN
2 NaN 5.0
and another with lookup values:
fill
second 200
first 100
Is there a way to replace the NaN values with the fill values based on the column name to get this?:
first second
0 1.0 3.0
1 2.0 200
2 100 5.0
This is just an example, as I'm trying to do it on a much larger dataframe. I know that I can rearrange the fields in the dataframes so that the indices match up and I could use pd.where, but I'm wondering if there's a way to make the match just based on column name.

You can use pandas.DataFrame.fillna() for this

Related

How to filter multiple columns/rows in a python dataframe?

Okay so I have a dataframe called ski_data that has two columns named AdultWeekend and AdultWeekday which shows the price for the weekend and weekdays of each ski resort.
However, for some of the resorts, both price columns are NaN and I need to filter that out.
My approach was this:
Step 1)
create a new dataframe that takes each ski resort and gives it a value of 0, 1, or 2 based on how many of the price tags are NaN so the code was
missing_data = ski_data[['AdultWeekend', 'AdultWeekday']].isnull().sum(axis=1)
Step 2)
Create a counter variable and an empty list. Then iterate over the missing_data and if the value == 2. append the counter to the list which keeps track of the index that is equivalent in the ski_data dataframe.
counter = 0
missingList = []
for x in missing_price:
if x==2:
missingList.append(counter)
counter += 1
Step 3)
Iterate over ski_data dataframe and drop location of each index that was appended to the list.
for i in missingList:
ski_data.drop(labels=[i], axis=0, inplace=True)
However I get multiple errors one of which involves the first index label that is appended to the missing_list is 98 but ski_data.loc[98] raises a keyError. Can anyone explain or help me?
https://github.com/seungsooim32/stackoverflow/blob/main/02_data_wrangling.ipynb
code task #28
You can try:
ski_data.dropna(how="all", subset=['AdultWeekend', 'AdultWeekday'])
This will create a new dataframe with NaN rows dropped. If you want to modify the existing dataframe, add the argument inplace=True.
Don't drop nan, keep the right rows:
>>> ski_data
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
4 NaN NaN
>>> ski_data.loc[ski_data[['AdultWeekend', 'AdultWeekday']].notna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
>>> ski_data.loc[~ski_data[['AdultWeekend', 'AdultWeekday']].isna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

Replace values in dataframe with values from series of different length

I would like to replace values in a column of a dataframe with values from a series. The catch is that I only want to replace values that are designated by a mask and the series does not have the same length as the dataframe.
More specifically, I want to replace all the values that are not null with values from a series that contains one value for each non-null value in the dataframe.
Assume the column in the dataframe contains [1,2,3,NaN,5] and the series contains [2,4,6,10]
I naively thought that this might work
df[pd.notna(df)] = s
But it will make the column look like [1,2,3,NaN,NaN]. I understand why it behaves this way, but I need to find something that will give me this: [2,4,6,NaN,10]
The approach you tried is possible, but with some changes:
Update some individual column, not the whole DataFrame.
To "escape" from different index values, take values from
the "updating" Series.
To show how to do it, let's define the DataFrame (df) as:
A B
0 1.0 11.0
1 2.0 12.0
2 3.0 13.0
3 NaN NaN
4 5.0 15.0
and the "updating" Series (upd) as:
11 2
12 4
13 6
14 10
dtype: int64
As you can see, indices in df and upd are different.
To update e.g. A column in df the way you want, run:
df.A[df.A.notna()] = upd.values
The result is:
A B
0 2.0 11.0
1 4.0 12.0
2 6.0 13.0
3 NaN NaN
4 10.0 15.0

Count all NaNs in a pandas DataFrame

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

Efficient way to select most recent index with finite value in column from Pandas DataFrame?

I'm trying to find the most recent index with a value that is not 'NaN' relative to the current index. So, say I have a DataFrame with 'NaN' values like this:
A B C
0 2.1 5.3 4.7
1 5.1 4.6 NaN
2 5.0 NaN NaN
3 7.4 NaN NaN
4 3.5 NaN NaN
5 5.2 1.0 NaN
6 5.0 6.9 5.4
7 7.4 NaN NaN
8 3.5 NaN 5.8
If I am currently at index 4, I have the values:
A B C
4 3.5 NaN NaN
I want to know the last known value of 'B' relative to index 4, which is at index 1:
A B C
1 5.1 -> 4.6 NaN
I know I can get a list of all indexes with NaN values using something like:
indexes = df.index[df['B'].apply(np.isnan)]
But this seems inefficient in a large database. Is there a way to tail just the last one relative to the current index?
You may try something like this, convert the index to a series that have the same NaN values as column B and then use ffill() which carries the last non missing index forward for all subsequent NaNs:
import pandas as pd
import numpy as np
df['Last_index_notnull'] = df.index.to_series().where(df.B.notnull(), np.nan).ffill()
df['Last_value_notnull'] = df.B.ffill()
df
Now at index 4, you know the last non missing value is 4.6 and index is 1.
some useful methods to know
last_valid_index
first_valid_index
for columns B as of index 4
df.B.ix[:4].last_valid_index()
1
you can use this for all columns in this way
pd.concat([df.ix[:i].apply(pd.Series.last_valid_index) for i in df.index],
axis=1).T

Categories