Python Pandas: Aggregate rows conditional value picking - python

I have a dataframe like this:
df = pd.DataFrame({'dim': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3},
'value1': {0: nan, 1: 1.2, 2: 2.0, 3: nan, 4: 3.0},
'value2': {0: 1.0, 1: 2.0, 2: nan, 3: nan, 4: nan}})
dim id value1 value2
0 A 1 NaN 1.0
1 B 1 1.2 2.0
2 A 2 2.0 NaN
3 B 2 NaN NaN
4 A 3 3.0 NaN
I now want to aggregate the values for different dimensions over the id, so that the following is true:
If dim == 'A' is not None then take the value from dim == 'A' else take value where dim == 'B' (if it is not None). If both are None, just take None.
So the result should be:
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN
My guess is, I would need to use some form of group by function, but I am not too sure. Maybe something with apply?

You can use set_index with unstack and swaplevel for reshape and then combine_first:
df1 = df.set_index(['id','dim']).unstack().swaplevel(0,1,axis=1)
#alternative
#df1 = df.pivot('id','dim').swaplevel(0,1,axis=1)
print (df1)
dim A B A B
value1 value1 value2 value2
id
1 NaN 1.2 1.0 2.0
2 2.0 NaN NaN NaN
3 3.0 NaN NaN NaN
df2 = df1['A'].combine_first(df1['B']).reset_index()
print (df2)
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN
Similar solution with xs for select MultiIndex:
df1 = df.set_index(['id','dim']).unstack()
#alternative
#df1 = df.pivot('id','dim')
print (df1)
value1 value2
dim A B A B
id
1 NaN 1.2 1.0 2.0
2 2.0 NaN NaN NaN
3 3.0 NaN NaN NaN
df2 = df1.xs('A', axis=1, level=1).combine_first(df1.xs('B', axis=1, level=1)).reset_index()
print (df2)
id value1 value2
0 1 1.2 1.0
1 2 2.0 NaN
2 3 3.0 NaN

Related

Add an empty row in a dataframe when the entries of a column repeats

I have a dataframe that stores time-series data
Please find the code below
import pandas as pd
from pprint import pprint
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1],
}
df = pd.DataFrame(d)
pprint(df)
df>
t input type value
0 2 A 0.1
1 2 A 0.2
2 2 A 0.3
0 2 B 1.0
2 2 B 2.0
0 2 B 3.0
1 4 A 1.0
When the first entry of the column t repeats, I would like to add an empty row.
Expected output:
df>
t input type value
0 2 A 0.1
1 2 A 0.2
2 2 A 0.3
0 2 B 1.0
2 2 B 2.0
0 2 B 3.0
1 4 A 1.0
I am not sure how to do this. Suggestions will be really helpful.
EDIT:
dup = df['t'].eq(0).shift(-1, fill_value=False)
helps when starting value in row t si 0.
But it could also be a non-zero value like the example below.
Additional example:
d = {
't': [25, 35, 90, 25, 90, 25, 35],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1],
}
There are several ways to achieve this
option 1
you can use groupby.apply:
(df.groupby(df['t'].eq(0).cumsum(), as_index=False, group_keys=False)
.apply(lambda d: pd.concat([d, pd.Series(index=d.columns, name='').to_frame().T]))
)
output:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B 1.0
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
NaN NaN NaN NaN
option 2
An alternative if the index is already sorted:
dup = df['t'].eq(0).shift(-1, fill_value=False)
pd.concat([df, df.loc[dup].assign(**{c: '' for c in df})]).sort_index()
output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
2
3 0 2 B 1.0
4 2 2 B 2.0
4
5 0 2 B 3.0
6 1 4 A 1.0
addendum on grouping
set the group when the value decreases:
dup = df['t'].diff().lt(0).cumsum()
(df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda d: pd.concat([d, pd.Series(index=d.columns, name='').to_frame().T]))
)
Because groupby is generally slow, you can create helper DataFrame by consecutive groups for starting by 0 in t column, join by concat and sorting:
#groups starting by 0
df.index = df['t'].eq(0).cumsum()
#groups starting by difference if less like 0
df.index = (~df['t'].diff().gt(0)).cumsum()
df = (pd.concat([df, pd.DataFrame('', columns=df.columns, index=df.index.unique())])
.sort_index(kind='mergesort', ignore_index=True)
.iloc[:-1])
print (df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3
4 0 2 B 1.0
5 2 2 B 2.0
6
7 0 2 B 3.0
8 1 4 A 1.0
df.index = (~df['t'].diff().gt(0)).cumsum()
df = (pd.concat([df, pd.DataFrame(' ', columns=df.columns, index=df.index.unique())])
.sort_index(kind='mergesort', ignore_index=True)
.iloc[:-1])
print (df)
t input type value
0 25 2 A 0.1
1 35 2 A 0.2
2 90 2 A 0.3
3
4 25 2 B 1.0
5 90 2 B 2.0
6
7 25 2 B 3.0
8 35 4 A 1.0
Here is my suggestion:
pd.concat([pd.DataFrame(index=df.index[df.t == df.t.iat[0]][1:]), df]).sort_index()
t input type value
0 25.0 2.0 A 0.1
1 35.0 2.0 A 0.2
2 90.0 2.0 A 0.3
3 NaN NaN NaN NaN
3 25.0 2.0 B 1.0
4 90.0 2.0 B 2.0
5 NaN NaN NaN NaN
5 25.0 2.0 B 3.0
6 35.0 4.0 A 1.0

Pandas left merge but overwrite with right data

I would like to merge two dataframes, df2 might have more columns and will always be 1 row. I would like the data from the df2 row to overwrite the matching row in df on a.
df = pd.DataFrame({'a': {0: 0, 1: 1, 2: 2}, 'b': {0: 3, 1: 4, 2: 5}})
df2 = pd.DataFrame({'a': {0: 1}, 'b': {0: 90}, 'c': {0: 76}})
>>> df
a b
0 0 3
1 1 4
2 2 5
>>> df2
a b c
0 1 90 76
The desired output:
a b c
0 0 3 NaN
1 1 90 76
2 2 5 NaN
I have tried merge left but this creates two b columns (b_x and b_y):
>>> pd.merge(df,df2,how='left', on='a')
a b_x b_y c
0 0 3 NaN NaN
1 1 4 90.0 76.0
2 2 5 NaN NaN
You can use df.combine_first here:
df2.set_index("a").combine_first(df.set_index("a")).reset_index()
Or with merge:
out = df.merge(df2,on=['a'],how='left')
out.loc[:,out.columns.str.endswith("_x")] = out.loc[:,
out.columns.str.endswith("_y")].to_numpy()
out = out.groupby(out.columns.str.split("_").str[0],axis=1).first()
print(out)
a b c
0 0 3.0 NaN
1 1 90.0 76.0
2 2 5.0 NaN

Fill out NA values in Pandas DataFrame by using another Pandas DataFrame

import pandas as pd
df1 = pd.DataFrame({
'value1': ["a","a","a","b","b","b","c","c"],
'value2': [1,2,3,4,4,4,5,5],
'value3': [1,2,3, None , None, None, None, None],
'value4': [1,2,3,None , None, None, None, None],
'value5': [1,2,3,None , None, None, None, None]})
df2 = pd.DataFrame({
'value1': ["k","j","l","m","x","y"],
'value2': [2, 2, 1, 3, 4, 5],
'value3': [2, 2, 2, 3, 4, 5],
'value4': [3, 2, 2, 3, 4, 5],
'value5': [2, 1, 2, 3, 4, 5]})
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
df2 =
value1 value2 value3 value4 value5
0 k 2 2 3 2
1 j 2 2 2 1
2 l 1 2 2 2
3 m 3 3 3 3
4 x 4 4 4 4
5 y 5 5 5 5
I would like to fill NaN in df1 from values in df2
So the results of df1 will look like
df1 =
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2 2 1
4 b 4 2 2 2
5 b 4 3 3 3
6 c 5 4 4 4
7 c 5 5 5 5
I used following codes.
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
tmp2 = df2.iloc[1:, 2:]
tmp1 = tmp2 can update values in tmp1, but when I use following
df1[df1.value1 == 'b'].iloc[:, 2:]= tmp2
It doesn't update the values in df1 as shown below.
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 NaN NaN NaN
4 b 4 NaN NaN NaN
5 b 4 NaN NaN NaN
6 c 5 NaN NaN NaN
7 c 5 NaN NaN NaN
Why it happens and how can I solve this issue?
Thank you.
This line doesn't do what you think it's doing:
tmp1 = df1[df1.value1 == 'b'].iloc[:, 2:]
Methods are applied sequentially, so df1[df1.value1 == 'b'] keeps only rows 3, 4, 5 from df1. But this isn't what you want, you want to update all rows starting from the first instance your condition is satisfied.
Instead, first find the required index.
idx = df1['value1'].eq('b').values.argmax()
You then need to explicitly assign the last n rows from df2:
df1.iloc[idx:, 2:] = df2.iloc[-(len(df1.index)-idx):, 2:].values
print(df1)
value1 value2 value3 value4 value5
0 a 1 1.0 1.0 1.0
1 a 2 2.0 2.0 2.0
2 a 3 3.0 3.0 3.0
3 b 4 2.0 2.0 1.0
4 b 4 2.0 2.0 2.0
5 b 4 3.0 3.0 3.0
6 c 5 4.0 4.0 4.0
7 c 5 5.0 5.0 5.0
If you want to replace the nan values using index alignment use pandas fillna
df1.fillna(df2)
Add inplace if you want to update df1
df1.fillna(df2, inplace=True)
-
edit for case without aligned indexes:
If indexes of target and replacement values are not aligned, they can be aligned so that the dataframe fillna method can be used.
To align the indexes, get the indexes of rows containing nans in df1 to be replaced, filter df2 to include replacement values and then assign the replacement indexes from df1 as the index of df2. Then use fillna to transfer values from df2 to df1.
# in this case, find index values when df1.value1 is greater than or equal to 'b'
# (alternately could be indexes of rows containing nans)
idx = df1.index[df1.value1 >= 'b']
# get the section of df2 which will provide replacement values
# limit length to length of idx
align_df = df2[1:len(idx) + 1]
# set the index to match the nan rows from df1
align_df.index = idx
# use auto-alignment with fillna to transfer values from align_df(df2) to df1
df1.fillna(align_df)
# or can use df1.combine_first(align_df) because of the matching target and replacement indexes

Python convert specific dataframe columns to integer

I have a dataframe of 8 columns and I would like to convert last six columns to integer. The dataframe contains also NaN values and I don't want to remove them.
a b c d e f g h
0 john 1 NaN 2.0 2.0 42.0 3.0 NaN
1 david 2 28.0 52.0 15.0 NaN 2.0 NaN
2 kevin 3 1.0 NaN 1.0 10.0 1.0 5.0
Any ideas?
Thank you.
Thanks to MaxU I'm adding this option with nan = -1:
Reason: nan values are float values and can't coexist with integers.
So either nan values and floats or the option to think of -1 as nan
http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({'a': {0: 'john', 1: 'david', 2: 'kevin'},
'b': {0: 1, 1: 2, 2: 3},
'c': {0: np.nan, 1: 28.0, 2: 1.0},
'd': {0: 2.0, 1: 52.0, 2: np.nan},
'e': {0: 2.0, 1: 15.0, 2: 1.0},
'f': {0: 42.0, 1: np.nan, 2: 10.0},
'g': {0: 3.0, 1: 2.0, 2: 1.0},
'h': {0: np.nan, 1: np.nan, 2: 5.0}})
df.iloc[:, -6:] = df.iloc[:, -6:].fillna(-1)
df.iloc[:, -6:] = df.iloc[:, -6:].apply(pd.to_numeric, downcast='integer')
df
a b c d e f g h
0 john 1 -1 2 2 42 3 -1
1 david 2 28 52 15 -1 2 -1
2 kevin 3 1 -1 1 10 1 5
Thanks #AntonvBR for the downcast='integer' hint:
In [29]: df.iloc[:, -6:] = df.iloc[:, -6:].apply(pd.to_numeric, errors='coerce', downcast='integer')
In [30]: df
Out[30]:
a b c d e f g h
0 john 1 NaN 2.0 2 42.0 3 NaN
1 david 2 28.0 52.0 15 NaN 2 NaN
2 kevin 3 1.0 NaN 1 10.0 1 5.0
In [31]: df.dtypes
Out[31]:
a object
b int64
c float64
d float64
e int8
f float64
g int8
h float64
dtype: object

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Categories