I have given the following df
df = pd.DataFrame(data = {'day': [1, 1, 1, 2, 2, 3], 'pos': 2*[1, 14, 18], 'value': 2*[1, 2, 3]}
df
day pos value
0 1 1 1
1 1 14 2
2 1 18 3
3 2 1 1
4 2 14 2
5 3 18 3
and i want to fill in rows such that every day has every possible value of column 'pos'
desired result:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Proposition:
df.set_index('pos').reindex(pd.Index(3*[1,14,18])).reset_index()
yields:
ValueError: cannot reindex from a duplicate axis
Let's try pivot then stack:
df.pivot('day','pos','value').stack(dropna=False).reset_index(name='value')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Option 2: merge with MultiIndex:
df.merge(pd.DataFrame(index=pd.MultiIndex.from_product([df['day'].unique(), df['pos'].unique()])),
left_on=['day','pos'], right_index=True, how='outer')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 3 18 3.0
5 2 18 NaN
5 3 1 NaN
5 3 14 NaN
You can reindex:
s = pd.MultiIndex.from_product([df["day"].unique(),df["pos"].unique()], names=["day","pos"])
print (df.set_index(["day","pos"]).reindex(s).reset_index())
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
I'd avoid the manual product of all possible values.
Instead, one can get the unique values and just reindex per day:
u = df.pos.unique()
df.groupby('day').apply(lambda s: s.set_index('pos').reindex(u))['value']\
.reset_index()
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('day', 'pos')
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Related
Below is a toy Pandas dataframe that has three columns: 'id' (group id), 'b' (for condition), and 'c' (target):
df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [1,0,1,10,1,1,20,1,10,0,1,20,1,1]})
print(df)
id b c
0 1 3 1
1 1 4 0
2 1 5 1
3 1 A 10
4 1 3 1
5 1 4 1
6 1 A 20
7 2 1 1
8 2 A 10
9 2 1 0
10 2 3 1
11 2 A 20
12 2 2 1
13 2 3 1
For each group, I want to replace the values in column 'c' with nan (i.e., np.nan) before the first occurrence of 'A' in column 'b'.
The desired output is the following:
desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [3,4,5,'A',3,4,'A',1,'A',1,3,'A',2,3],
'c' : [np.nan,np.nan,np.nan,10,1,1,20,np.nan,10,0,1,20,1,1]})
print(desired_output_df)
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
I am able to get the index of the values of column c that I want to change using the following command: df.groupby('id').apply(lambda x: x.loc[:(x.b == 'A').idxmax()-1]).index. But the result is a "MultiIndex" and I can't seem to use it to replace the values.
MultiIndex([(1, 0),
(1, 1),
(1, 2),
(2, 7)],
names=['id', None])
Thanks in advance.
Try:
df['c'] = np.where(df.groupby('id').apply(lambda x: x['b'].eq('A').cumsum()) > 0, df['c'], np.nan)
print(df)
Prints:
id b c
0 1 3 NaN
1 1 4 NaN
2 1 5 NaN
3 1 A 10.0
4 1 3 1.0
5 1 4 1.0
6 1 A 20.0
7 2 1 NaN
8 2 A 10.0
9 2 1 0.0
10 2 3 1.0
11 2 A 20.0
12 2 2 1.0
13 2 3 1.0
i have to replace the s_months and incidents NaN values with the corresponding means in jupyter notebook.
Input data :
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
I have tried the code below but it does not seem to work and I have tried different variations such as replacing the transform.
df.fillna['s_months'] = df.fillna(df.grouby(['types' , 'o_periods']['s_months','incidents']).tranform('mean'),inplace = True)
s_months incidents
Types o_periods
1 1 911 3
2 1688 8
2 1 26851 36
2 14440 36
3 1 914 2
2 862 1
4 1 296 0
2 889 3
5 1 663 4
2 1046 6
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
Types,c_years,o_periods,s_months,incidents
0,1,1,1,127.0,0.0
1,1,1,2,63.0,0.0
2,1,2,1,1095.0,3.0
3,1,2,2,1095.0,4.0
4,1,3,1,1512.0,6.0
5,1,3,2,3353.0,18.0
6,1,4,1,NaN,NaN
7,1,4,2,2244.0,11.0
14,2,4,1,NaN,NaN"""), sep=',')
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
>>> df[['c_years', 's_months', 'incidents']] = df.groupby(['Types', 'o_periods']).transform(lambda x: x.fillna(x.mean()))
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
The last NaN is here because it belongs to the last group which contains no value in the columns s_months and incidents and therefore, no mean.
Try this df['s_months'].fillna(df['s_months'].mean())
df['s_months'].mean() counts mean without Nan.
Your code is close, you can try modify it as follows to make it work:
df[['s_months','incidents']] = df[['s_months','incidents']].fillna(df.groupby(['Types' , 'o_periods'])[['s_months','incidents']].transform('mean'))
Data Input:
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
Output
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you
try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0
i have a data frame of many patients and their measurements in six hour, but for some patients not all the six hour values have been recorded .
I want for each subject-id , add values form 1 to 6 in hour column , and if the hour value already exist write it the same value, other wise leave it blank.
note (i will deal with this blank values using missing value techniques later.)
subject_id hour value
2 1 23
2 3 15
2 5 28
2 6 11
3 4 18
3 6 22
it is the out put i want to get
subject_id hour value
2 1 23
2 2
2 3 15
2 4
2 5 28
2 6 11
3 1
3 2
3 3
3 4 18
3 5
3 6 22
any one can help me how to make that
any help will be appreciated
Use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df = df.set_index(['subject_id','hour']).reindex(mux).reset_index()
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Alternative is create all possible combinations by product and then DataFrame.merge with left join:
from itertools import product
df1 = pd.DataFrame(list(product(df['subject_id'].unique(), np.arange(1,7))),
columns=['subject_id','hour'])
df = df1.merge(df, how='left')
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
EDIT: If get error:
cannot handle a non-unique multi-index
It means duplicated values per subject_id with hour.
print (df)
subject_id hour value
0 2 1 23 <- duplicate 2, 1
1 2 1 50 <- duplicate 2, 1
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
Possible solution is aggregate sum or mean instead set_index:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = df.groupby(['subject_id','hour']).sum().reindex(mux).reset_index()
print (df1)
subject_id hour value
0 2 1 73.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Detail:
print (df.groupby(['subject_id','hour']).sum())
value
subject_id hour
2 1 73
3 15
5 28
6 11
3 4 18
6 22
Or removed duplicates:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = (df.drop_duplicates(['subject_id','hour'])
.set_index(['subject_id','hour'])
.reindex(mux)
.reset_index())
print (df1)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Detail:
print (df.drop_duplicates(['subject_id','hour']))
subject_id hour value
0 2 1 23 <- duplicates are removed
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
Probably a similar question has been asked before, but I could not find anyone to solve my problem. Maybe I am not using the proper search words!.
I have two pandas Dataframes as below:
import pandas as pd
import numpy as np
df1
a = np.array([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
b = np.array([1,1,2,2,3,3,1,1,2,2,3,3,1,1,2,2,3,3])
df1 = pd.DataFrame({'a':a, 'b':b})
print(df1)
a b
0 1 1
1 1 1
2 1 2
3 1 2
4 1 3
5 1 3
6 2 1
7 2 1
8 2 2
9 2 2
10 2 3
11 2 3
12 3 1
13 3 1
14 3 2
15 3 2
16 3 3
17 3 3
df2 is as below:
a2 = np.array([1,1,1,2,2,2,3,3,3])
b2 = np.array([1,2,3,1,2,3,1,2,3])
c = np.array([4,8,3,np.nan, 2, 5,6, np.nan, 1])
df2 = pd.DataFrame({'a':a2, 'b':b2, 'c': c})
a b c
0 1 1 4.0
1 1 2 8.0
2 1 3 3.0
3 2 1 NaN
4 2 2 2.0
5 2 3 5.0
6 3 1 6.0
7 3 2 NaN
8 3 3 1.0
Now I want to map column c from df2 to df1 but keeping the grouping of columns a=a1 and b=b2. Therefore, df1 is modified as shown below
a b c
0 1 1 4
1 1 1 4
2 1 2 8
3 1 2 8
4 1 3 3
5 1 3 3
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
How can I achieve this with simple and intuitive way using pandas?
Quite simple using merge:
df1.merge(df2)
a b c
0 1 1 4.0
1 1 1 4.0
2 1 2 8.0
3 1 2 8.0
4 1 3 3.0
5 1 3 3.0
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
If you have more columns and you want to specifically only merge on a and b, use:
df1.merge(df2, on=['a','b'])