Update NaN values with dictionary of values nased on condition - python

I have data frame like this:
c1 c2
0 a 12
1 b NaN
2 a 45
3 c NaN
4 c 32
5 b NaN
and I have dictionary like this
di = {
'a': 10, 'b': 20, 'c':30
}
I want to update my data frame like this
c1 c2
0 a 12
1 b 20
2 a 45
3 c 30
4 c 32
5 b 20
is there any way to do it without using long lambda function with conditions
Here's the code to create your data frame
a = pd.DataFrame({
'c1': ['a', 'b', 'a', 'c', 'c', 'b'],
'c2': [12, np.NaN, 45, np.NaN, 32, np.NaN]
})
di = {
'a': 10, 'b': 20, 'c':30
}
di

You can use apply() method to deal with this.
Create a function and then apply that function to the required features.
`def deal_na(cols):
x=cols[0]
y=cols[1]
if pd.isnull(y):
return di[x]
else:
return y
a['c2'] = a[['c1','c2']].apply(deal_na,axis=1)`
Here, we pass values of feature 'c1' and 'c2' as a list to the function in the cols variable. Then we assign each value to 2 variables x and y. We check whether y is null or not. If it is null then replace with di[x] otherwise return as it is.

Use Series.map with Series.fillna for replace only missing values:
a['c2'] = a['c2'].fillna(a['c1'].map(di))
print (a)
c1 c2
0 a 12.0
1 b 20.0
2 a 45.0
3 c 30.0
4 c 32.0
5 b 20.0
Last if all values of c1 are in keys of dictionary, all values of missing values are replaced and is possible convert to integers:
a['c2'] = a['c2'].fillna(a['c1'].map(di)).astype(int)
print (a)
c1 c2
0 a 12
1 b 20
2 a 45
3 c 30
4 c 32
5 b 20

Related

Create a dataframe based on 3 linked dataframes using a constraint on cumsum

I do have three dataframes like this:
import pandas as pd
df1 = pd.DataFrame(
{
'C1': [2, 7, 3, 6, 5, 3],
'C2': [0, 8, 0, 1, 0, 0]
}
)
df2 = pd.DataFrame(
{
'position1': range(11, 17),
'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'],
'mapper': list('aababb')
}
)
df3 = pd.DataFrame(
{
'position2': range(1, 7),
'C1': list('aabbab'),
'C2': list('abbbaa')
}
)
that looks as follows
C1 C2
0 2 0
1 7 8
2 3 0
3 6 1
4 5 0
5 3 0
position1 column mapper
0 11 C1 a
1 12 C2 a
2 13 C1 b
3 14 C1 a
4 15 C1 b
5 16 C2 b
position2 C1 C2
0 1 a a
1 2 a b
2 3 b b
3 4 b b
4 5 a a
5 6 b a
and I would like to create another dataframe using these 3 dataframes that looks as follows:
position1 position2 value
0 11 1 2
1 11 2 7
2 13 3 3
3 13 4 6
4 14 5 5
5 15 6 3
6 12 1 0
7 16 2 8
8 16 3 0
9 16 4 1
10 12 5 0
11 12 6 0
Here is the logic for C1:
First, one checks the first value in column C1 in df3 which is an a.
Second, one checks in df2 where one first finds the letter determined in 1) - in our case an a for the respective column (here: C1) and notes down the value of position1 (here: 11).
Now one goes to df1 and notes down the respective value for C1 (here: 2)
That gives us the first row of the desired outcome: position2 = 1, position1 = 11 and the value = 2.
So far, so good. The issue comes in due to a constraint:
In df2 each position1 can only be used as long as the sum of all corresponding values from df1 do not exceed 10; if that happens the next valid position in df2 should be found.
So, for the example above:
In df3 if I go to the next row in C1 I again find an a, therefore I again check df2 and end up again with position1 = 11. If I check in df1 I find a value of 7, the cumulative sum would be 9 which is below 10, so all good and I have the next row of my desired dataframe:
position2 = 2, position1 = 11 and the value = 7.
Now I go to the next row in df3 in column C1 and find a b, checking df2 gives me position 13 and the value from df is 3, so I get the row:
position2 = 3, position1 = 13 and the value = 3.
Doing it once more gives
position2 = 4, position1 = 13 and the value = 6.
Doing it again, gives me now letter a again which would point to position1 = 11 in df2. The value from df1 is 5; as the cumulative sum is already 9, I cannot use this position but have to find the next one in df2 which is position2 = 14. Therefore I can add the row:
position2 = 5, position1 = 14 and the value = 5.
And so on...
I am struggling with incorporating the check for the cumsum. Does anyone see an elegant solution to create the desired dataframe from the 3 inputs? Only solutions I have contain several loops and the code is not very readable.
The example might be tricky to follow but I could not design an easier one.
Here's a solution that aggregates all the tables together representing all options of which position1 to use and then removes the options that surpass that 10 count threshold. This doesn't scale well when a letter in df3 has many options in df2.
Couldn't figure out to do this without a loop. Currently performs cumsum and removal of counts over threshold
threshold = 10
#Convert df1 and df3 to long form
df3_long = df3.melt(
id_vars = 'position2',
var_name = 'column',
value_name = 'mapper',
)
#Merge info from across the three dfs into a single df
m = df3_long.merge(
df2,
on = ['mapper','column'],
how = 'right',
)
m['value'] = m.apply(lambda r: df1.loc[r.position2-1][r.column], axis=1)
#Iteratively remove duplicates and redo cumsum (expensive operation!)
while True:
#Count the running tally of the position1 usage and filter out overusage
m['cumsum_score'] = m.groupby('position1')['value'].transform('cumsum')
passing = m['cumsum_score'].le(threshold)
is_dup = m.duplicated(['position2','column'],keep=False)
not_first_dup = m.duplicated(['position2','column'])
is_first_dup = is_dup & ~not_first_dup
#successful stop when all duplicates are removed and all are passing
if all(~is_dup) & all(passing):
break
#unsuccesful stop when all duplicates are removed, but still not all passing
if all(~is_dup) & ~all(passing):
print('ERROR')
break
m = m.loc[
(passing & ~is_dup) |
(passing & is_first_dup) |
(~passing & not_first_dup)
]
#Format the output
m = m.sort_values(['column','position2'])
m = m[['position1','position2','value']].reset_index(drop=True)
m
The answer by #mitoRibo got me on the right track; pd.melt is indeed key to solve it, it seems. Here is my solution with a few comments:
import pandas as pd
import numpy as np
def assign_group_memberships(aniterable, max_sum):
label = 0
total_sum = 0
for val in aniterable:
total_sum += val
if total_sum > max_sum:
total_sum = val
label += 1
yield label
# copy df1, df2 and df3 from the question
desired = pd.DataFrame(
{
'position1': [11, 11, 13, 13, 14, 15, 12, 16, 16, 16, 12, 12],
'position2': list(range(1, 7)) + list(range(1, 7)),
'value': [2, 7, 3, 6, 5, 3, 0, 8, 0, 1, 0, 0]
}
)
threshold = 10
# Convert df1 and df3 to long form
df1_long = df1.melt(
var_name='column'
)
df3_long = df3.melt(
id_vars='position2',
var_name='column',
value_name='mapper',
)
df3_long['value'] = df1_long['value'].copy()
Now we can assign groups to the individual rows based on threshold: whenever threshold is exceeded, a new label is created for each column, mapper group.
df3_long['group'] = (
df3_long.groupby(['column', 'mapper'])['value'].transform(
lambda x: assign_group_memberships(x, threshold)
)
)
position2 column mapper value group
0 1 C1 a 2 0
1 2 C1 a 7 0
2 3 C1 b 3 0
3 4 C1 b 6 0
4 5 C1 a 5 1
5 6 C1 b 3 1
6 1 C2 a 0 0
7 2 C2 b 8 0
8 3 C2 b 0 0
9 4 C2 b 1 0
10 5 C2 a 0 0
11 6 C2 a 0 0
Now we can also determine the respective group labels in df2
df2['group'] = df2.groupby(['column', 'mapper']).cumcount()
position1 column mapper group
0 11 C1 a 0
1 12 C2 a 0
2 13 C1 b 0
3 14 C1 a 1
4 15 C1 b 1
5 16 C2 b 0
and the only thing left to do is to merge df2 and df3_long
result = df3_long.merge(df2, on=['column', 'mapper', 'group'])
position2 column mapper value group position1
0 1 C1 a 2 0 11
1 2 C1 a 7 0 11
2 3 C1 b 3 0 13
3 4 C1 b 6 0 13
4 5 C1 a 5 1 14
5 6 C1 b 3 1 15
6 1 C2 a 0 0 12
7 5 C2 a 0 0 12
8 6 C2 a 0 0 12
9 2 C2 b 8 0 16
10 3 C2 b 0 0 16
11 4 C2 b 1 0 16
Now we can check whether result is equal to desired
result = (
result[
['position1', 'position2', 'value']
].sort_values(['position1', 'position2']).reset_index(drop=True)
)
desired = (
desired.sort_values(
['position1', 'position2']
).reset_index(drop=True)
)
print(result.equals(desired))
which is indeed the case.
Might be better options, so, please post them! And thanks again to mitoRibo for the inspiration!
Actually, I did not get it how do you get position1 in your last table but I have found that you can get the table this way.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
{
'C1': [2, 7, 3, 6, 5, 3],
'C2': [0, 8, 0, 1, 0, 0]
}
)
df1 = pd.melt(df1, var_name='column')
df2 = pd.DataFrame(
{
'position1': range(11, 17),
'column': ['C1', 'C2', 'C1', 'C1', 'C1', 'C2'],
'mapper': list('aababb')
}
)
df3 = pd.DataFrame(
{
'position2': range(1, 7),
'C1': list('aabbab'),
'C2': list('abbbaa')
}
)
df3 = pd.melt(df3,id_vars=['position2'] ,var_name='column',value_name='mapper')
df4=pd.concat([df1, df3])
df=pd.concat([df4,df2])
df = df.apply(lambda x: pd.Series(x.dropna().values))
df = df[df['value'].notna()]
print(df)
The results will be similar to this
column value position2 mapper position1
0 C1 2.0 1.0 a 11.0
1 C1 7.0 2.0 a 12.0
2 C1 3.0 3.0 b 13.0
3 C1 6.0 4.0 b 14.0
4 C1 5.0 5.0 a 15.0
5 C1 3.0 6.0 b 16.0
6 C2 0.0 1.0 a NaN
7 C2 8.0 2.0 b NaN
8 C2 0.0 3.0 b NaN
9 C2 1.0 4.0 b NaN
10 C2 0.0 5.0 a NaN
11 C2 0.0 6.0 a NaN

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Groupby, apply function to each row with shift, and create new column

I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0

Dataframe is not updated when columns are passed to function using apply

I have two dataframes like this:
A B
a 1 10
b 2 11
c 3 12
d 4 13
A B
a 11 NaN
b NaN NaN
c NaN 20
d 16 30
They have identical column names and indices. My goal is to replace the NAs in df2 by the values of df1. Currently, I do this like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': range(1, 5), 'B': range(10, 14)}, index=list('abcd'))
df2 = pd.DataFrame({'A': [11, np.nan, np.nan, 16], 'B': [np.nan, np.nan, 20, 30]}, index=list('abcd'))
def repl_na(s, d):
s[s.isnull().values] = d[s.isnull().values][s.name]
return s
df2.apply(repl_na, args=(df1, ))
which gives me the desired output:
A B
a 11 10
b 2 11
c 3 20
d 16 30
My question is now how this could be accomplished if the indices of the dataframes are different (column names are still the same, and the columns have the same length). So I would have a df2 like this(df1 is unchanged):
A B
0 11 NaN
1 NaN NaN
2 NaN 20
3 16 30
Then the above code does not work anymore since the indices of the dataframes are different. Could someone tell me how the line
s[s.isnull().values] = d[s.isnull().values][s.name]
has to be modified in order to get the same result as above?
You could temporarily change the indexes on df1 to be the same as df2and just combine_first with df2;
df2.combine_first(df1.set_index(df2.index))
A B
1 11 10
2 2 11
3 3 20
4 16 30

Drop rows in pandas dataframe based on columns value

I have a dataframe like this :
cols = [ 'a','b']
df = pd.DataFrame(data=[[NaN, -1, NaN, 34],[-32, 1, -4, NaN],[4,5,41,14],[3, NaN, 1, NaN]], columns=['a', 'b', 'c', 'd'])
I want to retrieve all rows, when the columns 'a' and 'b' are non-negative but if any of them or all are missing, I want to keep them.
The result should be
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
I've tried this but it doesn't give the expected result.
df[(df[cols]>0).all(axis=1) | df[cols].isnull().any(axis=1)]
IIUC, you actually want
>>> df[((df[cols] > 0) | df[cols].isnull()).all(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
Right now you're getting "if they're all positive" or "any are null". You want "if they're all (positive or null)". (Replace > 0 with >=0 for nonnegativity.)
And since NaN isn't positive, we could simplify by flipping the condition, and use something like
>>> df[~(df[cols] <= 0).any(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN

Categories