Pandas Grouping by Id and getting non-NaN values [duplicate] - python

This question already has answers here:
pandas group by and find first non null value for all columns
(3 answers)
Closed 1 year ago.
I have a table that tracks changes made to each field for a salesforce record. My goal here is to group by the saleforce_id column and merge all the rows into one replacing null values with text values if there are any text values. I've tried different variations of groupby but can't seem to get the desired output.

This should do what you what:
df.groupby('salesforce_id').first().reset_index(drop=True)
That will merge all the columns into one, keeping only the non-NaN value for each run (unless there are no non-NaN values in all the columns for that row; then the value in the final merged column will be NaN).

Use melt and pivot:
out = df.melt('id').dropna() \
.pivot('id', 'variable', 'value') \
.rename_axis(index=None, columns=None)
print(out)
# Output:
A B C
1 A1 B1 C2
Setup:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 1, 1],
'A': ['A1', np.nan, np.nan],
'B': [np.nan, 'B1', np.nan],
'C': [np.nan, np.nan, 'C2'],
'D': [np.nan, np.nan, np.nan]})
print(df)
# Output:
id A B C D
0 1 A1 NaN NaN NaN
1 1 NaN B1 NaN NaN
2 1 NaN NaN C2 NaN

Related

Pandas replace NaN values with respect to the columns

I have the following data frame.
df = pd.DataFrame({'A': [2, np.nan], 'B': [1, np.nan]})
df.fillna(0) replaces all the NaN values with 0. But
I want to replace the NaN values in the column 'A' with 1 and in the column 'B' with 0, simultaneously. How can I do that ?
Use:
df["A"].fillna(1 , inplace=True) # for col A - NaN -> 1
df["B"].fillna(0 , inplace=True) # for col B - NaN -> 0
This does it in one line
(df['column_name'].fillna(0,inplace=True),df['column_name'].fillna(1,inplace=True))
print(df)
fillna method also exists for Series objects.
df["A"] = df["A"].fillna(1)
df["B"] = df["B"].fillna(0)

How to fill nan based on first value?

imagine you have the following df:
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, np.nan, np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection= pd.DataFrame(data=d)
dffinalselection
description#1 description#2 description#3
0 happy NaN NaN
1 coding NaN NaN
2 NaN NaN NaN
I want to fill the df with the first description#1 column value if NaN:
filldesc = dffinalselection.filter(like='description')
filldesc = filldesc.fillna(dffinalselection['description#1'], axis=1)
filldesc
However, getting the following error:
NotImplementedError: Currently only can fill with dict/Series column by column
How to workaround?
desired output:
description#1 description#2 description#3
0 happy happy happy
1 coding coding coding
2 NaN NaN NaN
Please help!
You can use apply() on rows with axis=1 then use Series.fillna() to fill nan values.
import pandas as pd
import numpy as np
d = {'description#1': ['happy', 'coding', np.nan], 'description#2': [np.nan, 'tokeep', np.nan], 'description#3': [np.nan, np.nan, np.nan]}
dffinalselection = pd.DataFrame(data=d)
df_ = dffinalselection.apply(lambda row: row.fillna(row[0]), axis=1)
print(df_)
description#1 description#2 description#3
0 happy happy happy
1 coding tokeep coding
2 NaN NaN NaN
Use ffill method with axis=1:
dffinalselection.ffill(axis=1)

Filter for rows in pandas dataframe where values in a column are greater than x or NaN

I'm trying to figure out how to filter a pandas dataframe so that that the values in a certain column are either greater than a certain value, or are NaN. Lets say my dataframe looks like this:
df = pd.DataFrame({"col1":[1, 2, 3, 4], "col2": [4, 5, np.nan, 7]})
I've tried:
df = df[df["col2"] >= 5 | df["col2"] == np.nan]
and:
df = df[df["col2"] >= 5 | np.isnan(df["col2"])]
But the first causes an error, and the second excludes rows where the value is NaN. How can I get the result to be this:
pd.DataFrame({"col1":[2, 3, 4], "col2":[5, np.nan, 7]})
Please Try
df[df.col2.isna()|df.col2.gt(4)]
col1 col2
1 2 5.0
2 3 NaN
3 4 7.0
Also, you can fill nan with the threshold:
df[df.fillna(5)>=5]

Pandas - Merge 2 df with same column names but exclusive values

I have 1 main df MainDF, with column key and other columns not relevant.
I also have 2 other dfs, dfA and dfB, with 2 columns, key and tariff. The keys in dfA and dfB are exclusive, ie there is no key in both dfA and dfB.
On my MainDF, I do: MainDF.merge(dfA, how = 'left', on='key'), which will add the column "tariff" to my MainDF, for the keys in dfA and also in MainDF. This will put NaN to all keys in MainDF not in dfA
Now, I need to do MainDF.merge(dfB, how = 'left', on='key') to add the tariff for the keys in MainDF but not in dfA.
When I do the second merge, it will create in MainDF 2 columns tariff_x and tariff_y because tariff is already in MainDF following the first merge. However, since the keys are exclusive, I need to keep only one column tariff with the not-NaN values when possible.
How should I do so in a python way ? I could add a new column which is either tariff_x or tariff_y but I don't find that very elegant.
Thanks
you can first concat dfA and dfB before merging with MainDF:
MainDF.merge(pd.concat([dfA, dfB], axis=0), how='left', on='key')
Do you need something like this:
dfA = pd.DataFrame({'tariff': [1, 2, 3], 'A': list('abc')})
dfB = pd.DataFrame({'tariff': [4, 5, 6], 'A': list('def')})
dfJoin = pd.concat([dfA, dfB], ignore_index=True)
A B tariff
0 a NaN 1
1 b NaN 2
2 c NaN 3
3 NaN d 4
4 NaN e 5
5 NaN f 6
Now you can merge with dfJoin.

Manipulate specific columns (sample features) conditional on another column's entries (feature value) using pandas/numpy dataframe

my input dataframe (shortened) looks like this:
>>> import numpy as np
>>> import pandas as pd
>>> df_in = pd.DataFrame([[1, 2, 'a', 3, 4], [6, 7, 'b', 8, 9]],
... columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_in
c1 c2 col c3 c4
0 1 2 a 3 4
1 6 7 b 8 9
It is supposed to be manipulated, i.e.
if row (sample) in column 'col' (feature) has a specific value (e.g., 'b' here),
then convert the entries in columns 'c1' and 'c2' in the same row to NumPy.NaNs.
Result wanted:
>>> df_out = pd.DataFrame([[1, 2, 'a', 3, 4], [np.nan, np.nan, np.nan, 8, 9]],
columns=(['c1', 'c2', 'col', 'c3', 'c4']))
>>> df_out
c1 c2 col c3 c4
0 1 2 a 3 4
1 NaN NaN b 8 9
So far, I managed to get obtain desired result via the code
>>> dic = {'col' : ['c1', 'c2']} # auxiliary
>>> b_w = df_in[df_in['col'] == 'b'] # Subset with 'b' in 'col'
>>> b_w = b_w.drop(dic['col'], axis=1) # ...inject np.nan in 'c1', 'c2'
>>> b_wo = df_in[df_in['col'] != 'b'] # Subset without 'b' in 'col'
>>> df_out = pd.concat([b_w, b_wo]) # Both Subsets together again
>>> df_out
c1 c2 c3 c4 col
1 NaN NaN 8 9 b
0 1.0 2.0 3 4 a
Although I get what I want (the original data consists entirely of floats, don't
bother the mutation from int to float her), it is a rather inelegant
snippet of code. And my educated guess is that this could be done faster
by using the build-in functions from pandas and numpy, but I am unable to manage this.
Any suggestions how to code this in a fast and efficient way for daily use? Any help is highly appreciated. :)
You can condition on both the row and col positions to assign values using loc which supports both logic indexing and dimension name indexing:
df_in.loc[df_in.col == 'b', ['c1', 'c2']] = np.nan
df_in
# c1 c2 col c3 c4
# 0 1.0 2.0 a 3 4
# 1 NaN NaN b 8 9
When using pandas I would go for the solution provided by #Psidom.
However, for larger datasets it is faster when doing the whole pandas -> numpy -> pandas procedure, i.e. dataframe -> numpy.array -> dataframe (minus 10% process time for my setup). Without converting back to a dataframe, numpy is almost twice as fast for my dataset.
Solution for the question asked:
cols, df_out = df_in.columns, df_in.values
for i in [0, 1]:
df_out[df_out[:, 2] == 'b', i] = np.nan
df_out = pd.DataFrame(df_out, columns=cols)

Categories