Pandas: filling null values based on values in multiple other columns

Pandas: filling null values based on values in multiple other columns - python

I have this in a Pandas DataFrame:
site channel week value
0 Canada A W01 NaN
1 Canada A W02 NaN
2 Canada A W03 12
3 Canada B W01 NaN
4 Canada B W02 NaN
5 Canada B W03 66
I need to get this:
site channel week value
0 Canada A W01 12
1 Canada A W02 12
2 Canada A W03 12
3 Canada B W01 66
4 Canada B W02 66
5 Canada B W03 66
In words, I need to fill the null values in column value with the values that correspond to the specific combination of site and channel.
How can I do this?

Use DataFrameGroupBy.bfill if always last non numeric values per groups and all another values are NaNs:
df1 = df.groupby(['site', 'channel']).bfill()
Better general solution if possible some only NaNs groups:
df1 = df.groupby(['site', 'channel']).apply(lambda x: x.bfill.ffill())

Related

Moving rows value to another existing column

I have a messy datasets as attached below
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 GS 2001 Assigned NaN
3 V 2004 Received NaN
I am trying to move over the corresponding value into the right columns. So ideally should be like this one.
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
I have tried to find the solutions in this platforms but no luck. I used df.loc to placed the datasets but seems like the result is not like what I expected. I would really appreciate your support for solving this issue. Thank you
*Update
It works with #jezrael solution, thanks! but is it possible if we use it for this case?
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 GS 2001 Assigned NaN
3 3 V 2004 Received NaN
And the result should be like this:
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received

You can create mask by last column for test if missing values by Series.isna and then use DataFrame.shift with axis=1 only for filtered rows:
m = df.iloc[:, -1].isna()
df[m] = df[m].shift(axis=1)
print (df)
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
If need set all columns without first use DataFrame.iloc with indexing .iloc[m, 1:]:
m = df.iloc[:, -1].isna().to_numpy()
df.iloc[m, 1:] = df.iloc[m, 1:].shift(axis=1)
print (df)
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received

Alternative to Excel SUM in Pandas

I have a dataframe (i.e df1) with the below values. I wanted to SUM Row 4 to 9 and put the resulting value in Row3. How can we achieve it? In excel it has been simple SUM formula like this =SUM(B9:B14) but what is the alternative in pandas?
Detail Value
0 Day 23
1 Month Aug
2 Year 2020
3 Total Tickets NaN
4 Pune 2
5 Mumbai 3
6 Thane 33
7 Kolkatta NaN
8 Hyderabad NaN
9 Kerala 283

Pandas Error: '[nan nan] not found in axis' while dropping a column without labels

I'm trying to drop the first two columns in a dataframe that has NaN for column headers. The dataframe looks like this:
**15 NaN NaN NaN Energy Supply Energy Supply Renewable Energy**
17 NaN Afghanistan Afghanistan 1 2 3
18 NaN Albania Albania 1 2 3
19 NaN Algeria Algeria 1 2 3
I need to drop the first two columns labeled NaN. I tried df=df.drop(df.columns[[1,2]],axis=1), which returns an error
What am I missing?
KeyError: '[nan nan] not found in axis'

Strange you have NaN as columns. Please try filter columns that do not start with NaN using regex.
df.filter(regex='^(?!NaN).+', axis=1)
Using your data
print(df)
15 NaN NaN.1 NaN.2 EnergySupply EnergySupply.1 \
0 17 NaN Afghanistan Afghanistan 1 2
1 18 NaN Albania Albania 1 2
2 19 NaN Algeria Algeria 1 2
RenewableEnergy
0 3
1 3
2 3
Solution
print(df.filter(regex='^(?!NaN).+', axis=1))
15 EnergySupply EnergySupply.1 RenewableEnergy
0 17 1 2 3
1 18 1 2 3
2 19 1 2 3

When the NaN columns exist, I had to do a case-insenstive version of the regex from wwnde's answer in order for them to successfully filter out the column:
df = df.filter(regex='(?i)^(?!NaN).+', axis=1)
Other suggestions, such as df=df[df.columns.dropna()] and df=df.drop(np.nan, axis=1) do not work, but the above did.
I'm guessing this is related to the painful reality of np.nan == np.nan not evaluating to true, but ultimately it seems like a bug with pandas.

Merge specific column in multiple dataframe with different length

df1
Color date
0 A 2011
1 B 201411
2 C 20151231
3 A 2019
df2
Color date
0 A 2013
1 B 20151111
2 C 201101
df3
Color date
0 A 2011
1 B 201411
2 C 20151231
3 A 2019
4 Y 20070212
Assuming there are three dataframes:
I want to create a new dataframe by extracting only the 'date' column.
output what I want
New df
df1-date df2-date df3-date
0 2011 2013 2011
1 201411 20151111 201411
2 20151231 201101 20151231
3 2019 NaN 2019
4 NaN NaN 20070212
I want to set the empty part to NaN because the length is different.
I try merge,concat but getting error..
Thank you for reading.

One more approach
df1.join(df2['date'],rsuffix='df2',how='outer').join(df3['date'],rsuffix='df3',how='outer')
Output
Color date datedf2 datedf3
0 A 2011.0 2013.0 2011
1 B 201411.0 20151111.0 201411
2 C 20151231.0 201101.0 20151231
3 A 2019.0 NaN 2019
4 NaN NaN NaN 20070212

This include two problem, 1 multiple dataframes merge, 2 duplicated key merge
def multikey(x):
return x.assign(key=x.groupby('Color').cumcount())
#we use groupby and cumcount create the addtional key
from functools import reduce
#then use reduce
df = reduce(lambda left,right:
pd.merge(left,right,on=['Color','key'],how='outer'),
list(map(multikey, [df1,df2,df3])))
df
Color date_x key date_y date
0 A 2011.0 0 2013.0 2011
1 B 201411.0 0 20151111.0 201411
2 C 20151231.0 0 201101.0 20151231
3 A 2019.0 1 NaN 2019
4 Y NaN 0 NaN 20070212
Notice name here we can always modify by rename
Method 2 from cancat not consider the key one merge with index
s=pd.concat([df1,df2,df3],keys=['df1','df2','df3'], axis=1)
s.columns=s.columns.map('_'.join)
s=s.filter(like='_date')
s
df1_date df2_date df3_date
0 2011.0 2013.0 2011
1 201411.0 20151111.0 201411
2 20151231.0 201101.0 20151231
3 2019.0 NaN 2019
4 NaN NaN 20070212

How to subtract rows of one pandas data frame from another?

The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AND second data frame. With the outer merger we get a data frame that are present EITHER in the first OR in the second data frame.
What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?

Consider Following:
df_one is first DataFrame
df_two is second DataFrame
Present in First DataFrame and Not in Second DataFrame
Solution: by Index
df = df_one[~df_one.index.isin(df_two.index)]
index can be replaced by required column upon which you wish to do exclusion.
In above example, I've used index as a reference between both Data Frames
Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.

How about something like the following?
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1 and df2['common'] = 1):
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
Or you can use isin but you would have to create a single key:
df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]
Team Year foo key
0 Hawks 2001 5 Hawks2001
2 Nets 1987 3 Nets1987
4 Nets 2001 8 Nets2001
5 Nets 2000 10 Nets2000
6 Heat 2004 6 Heat2004
7 Pacers 2003 12 Pacers2003

You could run into errors if your non-index column has cells with NaN.
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
8 Problem 2112 NaN
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
3 Problem 2112 NaN
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
6 Problem 2112 NaN NaN
The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.
Solution:
What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.
df2['in_df2']='yes'
print df2
Team Year foo in_df2
0 Pacers 2003 12 yes
1 Heat 2004 6 yes
2 Nets 1988 6 yes
3 Problem 2112 NaN yes
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]
Team Year foo_x foo_y in_df1 in_df2
0 Hawks 2001 5 NaN yes NaN
1 Hawks 2004 4 NaN yes NaN
2 Nets 1987 3 NaN yes NaN
4 Nets 2001 8 NaN yes NaN
5 Nets 2000 10 NaN yes NaN
NB. The problem row is now correctly filtered out, because it has a value for in_df2.
Problem 2112 NaN NaN yes yes

I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.
new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
4 Nets 2001 8
5 Nets 2000 10
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left’ DataFrame,
“right_only” for observations whose merge key only appears in ‘right’ DataFrame,
and “both” if the observation’s merge key is found in both.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: filling null values based on values in multiple other columns - python

Use DataFrameGroupBy.bfill if always last non numeric values per groups and all another values are NaNs: df1 = df.groupby(['site', 'channel']).bfill() Better general solution if possible some only NaNs groups: df1 = df.groupby(['site', 'channel']).apply(lambda x: x.bfill.ffill())

Related

Moving rows value to another existing column

Alternative to Excel SUM in Pandas

Pandas Error: '[nan nan] not found in axis' while dropping a column without labels

Merge specific column in multiple dataframe with different length

How to subtract rows of one pandas data frame from another?

Categories

Resources