Find a first non NaN value in Pandas - python

I have a Pandas dataframe such that
|user_id|value|No|
|:-:|:-:|:-:|
|id1|100|1|
|id1|200|2|
|id1|250|3|
|id2|NaN|1|
|id2|100|2|
|id3|400|1|
|id3|NaN|2|
|id3|200|3|
|id4|NaN|1|
|id4|NaN|2|
|id4|300|3|.
Then I want the folloing dataset:
|user_id|value|No|NewNo|
|:-:|:-:|:-:|:-:|
|id1|100|1|1|
|id1|200|2|2|
|id1|250|3|3|
|id2|100|2|1|
|id3|400|1|1|
|id3|NaN|2|2|
|id3|200|3|3|
|id4|300|3|1|
namely, I want to delete NaN values such that the first value of user_id is not NaN value. Thank you.

you can groupby & forward fill the value column. Null values in the transformed data indicate the values from the start for each group that are null. Filter out the rows that are null
df2 = df[df.groupby('user_id').value.ffill().apply(pd.notnull)].copy()
# application of copy here creates a new data frame and allows us to assign
# values to the result (df2). This is needed to create the column `NewNo`
# in the next & final step
# df2 outputs:
user_id value No
0 'id1' 100.0 1
1 'id1' 200.0 2
2 'id1' 250.0 3
4 'id2' 100.0 2
5 'id3' 400.0 1
6 'id3' NaN 2
7 'id3' 200.0 3
10 'id4' 300.0 3
Generate NewNo column using ranking within the group.
df2['NewNo'] = df2.groupby('user_id').No.rank()
# df2 outputs:
user_id value No NewNo
0 'id1' 100.0 1 1.0
1 'id1' 200.0 2 2.0
2 'id1' 250.0 3 3.0
4 'id2' 100.0 2 1.0
5 'id3' 400.0 1 1.0
6 'id3' NaN 2 2.0
7 'id3' 200.0 3 3.0
10 'id4' 300.0 3 1.0

groupby + first_valid_index + cumcount
You can calculate indices for first non-null values by group, then use Boolean indexing:
# use transform to align groupwise first_valid_index with dataframe
firsts = df.groupby('user_id')['value'].transform(pd.Series.first_valid_index)
# apply Boolean filter
res = df[df.index >= firsts]
# use groupby + cumcount to add groupwise labels
res['NewNo'] = res.groupby('user_id').cumcount() + 1
print(res)
user_id value No NewNo
0 id1 100.0 1 1
1 id1 200.0 2 2
2 id1 250.0 3 3
4 id2 100.0 2 1
5 id3 400.0 1 1
6 id3 NaN 2 2
7 id3 200.0 3 3
10 id4 300.0 3 1

Related

Backfill column values using real value divided by number of preceding NA values in Pandas

test_df = pd.DataFrame({'a':[np.nan,np.nan,np.nan,4,np.nan,np.nan,6]})
test_df
a
0 NaN
1 NaN
2 NaN
3 4.0
4 NaN
5 NaN
6 6.0
I'm trying to backfill with the real value divided by the number of na values + itself. The following is what I'm trying to get
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
Try:
# identify the blocks by cumsum on the reversed non-nan series
groups = test_df['a'].notna()[::-1].cumsum()
# groupby and transform
test_df['a'] = test_df['a'].fillna(0).groupby(groups).transform('mean')
Output:
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
IIUC use:
# get reverse group
group = test_df.loc[::-1,'a'].notna().cumsum()
# get size and divide
test_df['a'] = (test_df['a']
.bfill()
.div(test_df.groupby(group)['a'].transform('size'))
)
Or with rdiv:
test_df['a'] = (test_df
.groupby(group)['a']
.transform('size')
.rdiv(test_df['a'].bfill())
)
Output (as new column for clarity):
a a2
0 NaN 1.0
1 NaN 1.0
2 NaN 1.0
3 4.0 1.0
4 NaN 2.0
5 NaN 2.0
6 6.0 2.0

Join dataframes by key - repeated data as new columns

I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.
Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0

Fill null values with values from a column in another dataset

I have 2 datasets like this:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 NaN
4 3 8.0
5 2 NaN
df2.head(2)
cost
3 33.0
5 55.0
df2 contains one column with values on the same indexes, where df1 is null
I would like to do get this result:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0
So fill the cost column in df1 by values in df2 on the same indexes
fillna
Pandas assigns by index naturally:
df1['cost'] = df1['cost'].fillna(df2['cost'])
print(df1)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0

Pandas: get two different rows with same pair of values in two different columns

I have two columns _Id and _ParentId with this example data. Using this I want to group _Id with _ParentId.
_Id _ParentId
1 NaN
2 NaN
3 1.0
4 2.0
5 NaN
6 2.0
After grouping the result should be shown as below.
_Id _ParentId
1 NaN
3 1.0
2 NaN
4 2.0
6 2.0
5 NaN
The main aim for this is to group which _Id belongs to which _ParentId (e.g _Id 3 belongs to _Id 1).
I have attempted to use groupby and duplicated but I can't seem to get the results shown above.
Use sort_values on temp
In [3188]: (df.assign(temp=df._ParentId.combine_first(df._Id))
.sort_values(by='temp').drop('temp', 1))
Out[3188]:
_Id _ParentId
0 1 NaN
2 3 1.0
1 2 NaN
3 4 2.0
5 6 2.0
4 5 NaN
Details
In [3189]: df._ParentId.combine_first(df._Id)
Out[3189]:
0 1.0
1 2.0
2 1.0
3 2.0
4 5.0
5 2.0
Name: _ParentId, dtype: float64
In [3190]: df.assign(temp=df._ParentId.combine_first(df._Id))
Out[3190]:
_Id _ParentId temp
0 1 NaN 1.0
1 2 NaN 2.0
2 3 1.0 1.0
3 4 2.0 2.0
4 5 NaN 5.0
5 6 2.0 2.0
Your expected output is quite the same as input, just that IDs 4 and 6 are together, with NaNs being at different places. Its not possible to have that expected output.
Here is how group-by would ideally work:
print("Original: ")
print(df)
df = df.fillna(-1) # if not replaced with another character , the grouping won't show NaNs.
df2 = df.groupby('_Parent')
print("\nAfter grouping: ")
for key, item in df2:
print (df2.get_group(key))
Output:
Original:
_Id _Parent
0 1 NaN
1 2 NaN
2 3 1.0
3 4 2.0
4 5 NaN
5 6 2.0
After grouping:
_Id _Parent
0 1 0.0
1 2 0.0
4 5 0.0
_Id _Parent
2 3 1.0
_Id _Parent
3 4 2.0
5 6 2.0

Append rows to dataframe, add new columns if not exist

I have a df like below which
>>df
group sub_group max
0 A 1 30.0
1 B 1 300.0
2 B 2 3.0
3 A 2 2.0
I need to have group and sub_group as atrributes (columns) and max as row
So I do
>>> newdf.set_index(['group','sub_group']).T
group A B A
sub_group 1 1 2 2
max 30.0 300.0 3.0 2.0
This gives me my intended formatting
Now I need to merge it to another similar dataframe say
>>df2
group sub_group max
0 C 1 3000.0
1 A 1 4000.0
Such that my merge results in
group A B A C
sub_group 1 1 2 2 1
max 30.0 300.0 3.0 2.0 NaN
max 4000.0 NaN NaN NaN 3000.0
Basically at every new df we are placing values under appropriate heading, if there is a new group or subgroup we add it the larger df. I am not sure if my way of transposing and then trying to merge append is a good approach
Since these df are generated in loop (loop items being dates), I would like to get a way to replace max printed in 1st column(of expected op) by loop date.
dates=['20170525', '20170623', '20170726']
for date in dates:
df = pd.read_csv()
I think you can add parameter index_col to read_csv first for Multiindex from first and second column:
dfs = []
for date in dates:
df = pd.read_csv('name', index_col=[0,1])
dfs.append(df)
#another test df was added
print (df3)
max
group sub_group
D 1 3000.0
E 1 4000.0
Then concat them together with parameter keys by list, then reshape by unstack and transpose:
#dfs = [df,df2,df3]
dates=['20170525', '20170623', '20170726']
df = pd.concat(dfs, keys=dates)['max'].unstack(0).T
print (df)
group A B C D E
sub_group 1 2 1 2 1 1 1
20170525 30.0 2.0 300.0 3.0 NaN NaN NaN
20170623 4000.0 NaN NaN NaN 3000.0 NaN NaN
20170726 NaN NaN NaN NaN NaN 3000.0 4000.0

Categories