My df looks like below
id number
123 1
256 2
879 3
132 4
3215 5
216 6
Output should be like this:
id number
123 1
256 2
879 3
132 4
3215 5
216 6
NaN 7
NaN 8
NaN 9
NaN 10
So basically I need add 1 into previous row in column number and in column id there shouldn't be any values. I need 30 new rows. I tried with this:
n = 30
for i in range(n):
df = df.append(df.tail(1).add(1))
but result was not correct. Do youhave any ideas? Thanks for help.
Regards
Tomasz
You can set_index, reindex and reset_index:
df.set_index('number').reindex(range(1, 11)).reset_index()
output:
number id
0 1 123.0
1 2 256.0
2 3 879.0
3 4 132.0
4 5 3215.0
5 6 216.0
6 7 NaN
7 8 NaN
8 9 NaN
9 10 NaN
If you want to keep the column order:
cols = df.columns
df.set_index('number').reindex(range(1, 11)).reset_index()[cols]
id number
0 123.0 1
1 256.0 2
2 879.0 3
3 132.0 4
4 3215.0 5
5 216.0 6
6 NaN 7
7 NaN 8
8 NaN 9
9 NaN 10
A merge is another efficient option, and maintains column order:
df.merge(pd.Series(range(1,11), name = 'number'),how = 'right')
id number
0 123.0 1
1 256.0 2
2 879.0 3
3 132.0 4
4 3215.0 5
5 216.0 6
6 NaN 7
7 NaN 8
8 NaN 9
9 NaN 10
Try set_index and reindex:
>>> df.set_index('number').reindex(range(11)).reset_index()
number id
0 0 NaN
1 1 123.0
2 2 256.0
3 3 879.0
4 4 132.0
5 5 3215.0
6 6 216.0
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
>>>
Related
I am trying to calculate the Moving Average on the following dataframe but i have trouble joining the result back to the dataframe
The dataframe is : (Moving Average values are displayed in parentheses)
Key1 Key2 Value MovingAverage
1 2 1 (Nan)
1 7 2 (Nan)
1 8 3 (Nan)
2 5 1 (Nan)
2 3 2 (Nan)
2 2 3 (Nan)
3 7 1 (Nan)
3 5 2 (Nan)
3 8 3 (Nan)
4 7 1 (1.33)
4 2 2 (2)
4 9 3 (Nan)
5 8 1 (2.33)
5 3 2 (Nan)
5 9 3 (Nan)
6 2 1 (2)
6 7 2 (1.33)
6 9 3 (3)
The Code is :
import pandas as pd
d = {'Key1':[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6], 'Key2':[2,7,8,5,3,2,7,5,8,7,2,9,8,3,9,2,7,9],'Value':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3]}
df = pd.DataFrame(d)
print(df)
MaDf = df.groupby(['Key2'])['Value'].rolling(window=3).mean().to_frame('mean')
print (MaDf)
If you run the code it will correctly calculate the Moving Average based on 'Key2' and 'Value' but i can't find the way to correctly reinsert it back to the original dataframe (df)
Remove first level of MultiIndex by Series.reset_index with drop=True for align by second level:
df['mean'] = (df.groupby('Key2')['Value']
.rolling(window=3)
.mean()
.reset_index(level=0, drop=True))
print (df)
Key1 Key2 Value mean
0 1 2 1 NaN
1 1 7 2 NaN
2 1 8 3 NaN
3 2 5 1 NaN
4 2 3 2 NaN
5 2 2 3 NaN
6 3 7 1 NaN
7 3 5 2 NaN
8 3 8 3 NaN
9 4 7 1 1.333333
10 4 2 2 2.000000
11 4 9 3 NaN
12 5 8 1 2.333333
13 5 3 2 NaN
14 5 9 3 NaN
15 6 2 1 2.000000
16 6 7 2 1.333333
17 6 9 3 3.000000
If default RangeIndex is possible use Series.sort_index:
df['mean'] = (df.groupby(['Key2'])['Value']
.rolling(window=3)
.mean()
.sort_index(level=1)
.values)
print (df)
Key1 Key2 Value mean
0 1 2 1 NaN
1 1 7 2 NaN
2 1 8 3 NaN
3 2 5 1 NaN
4 2 3 2 NaN
5 2 2 3 NaN
6 3 7 1 NaN
7 3 5 2 NaN
8 3 8 3 NaN
9 4 7 1 1.333333
10 4 2 2 2.000000
11 4 9 3 NaN
12 5 8 1 2.333333
13 5 3 2 NaN
14 5 9 3 NaN
15 6 2 1 2.000000
16 6 7 2 1.333333
17 6 9 3 3.000000
Simply df['mean'] = df.groupby(['Key2'])['Value'].rolling(window=3).mean().values
i have a data frame of many patients and their measurements in six hour, but for some patients not all the six hour values have been recorded .
I want for each subject-id , add values form 1 to 6 in hour column , and if the hour value already exist write it the same value, other wise leave it blank.
note (i will deal with this blank values using missing value techniques later.)
subject_id hour value
2 1 23
2 3 15
2 5 28
2 6 11
3 4 18
3 6 22
it is the out put i want to get
subject_id hour value
2 1 23
2 2
2 3 15
2 4
2 5 28
2 6 11
3 1
3 2
3 3
3 4 18
3 5
3 6 22
any one can help me how to make that
any help will be appreciated
Use DataFrame.reindex with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df = df.set_index(['subject_id','hour']).reindex(mux).reset_index()
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Alternative is create all possible combinations by product and then DataFrame.merge with left join:
from itertools import product
df1 = pd.DataFrame(list(product(df['subject_id'].unique(), np.arange(1,7))),
columns=['subject_id','hour'])
df = df1.merge(df, how='left')
print (df)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
EDIT: If get error:
cannot handle a non-unique multi-index
It means duplicated values per subject_id with hour.
print (df)
subject_id hour value
0 2 1 23 <- duplicate 2, 1
1 2 1 50 <- duplicate 2, 1
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
Possible solution is aggregate sum or mean instead set_index:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = df.groupby(['subject_id','hour']).sum().reindex(mux).reset_index()
print (df1)
subject_id hour value
0 2 1 73.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Detail:
print (df.groupby(['subject_id','hour']).sum())
value
subject_id hour
2 1 73
3 15
5 28
6 11
3 4 18
6 22
Or removed duplicates:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,7)],
names=['subject_id','hour'])
df1 = (df.drop_duplicates(['subject_id','hour'])
.set_index(['subject_id','hour'])
.reindex(mux)
.reset_index())
print (df1)
subject_id hour value
0 2 1 23.0
1 2 2 NaN
2 2 3 15.0
3 2 4 NaN
4 2 5 28.0
5 2 6 11.0
6 3 1 NaN
7 3 2 NaN
8 3 3 NaN
9 3 4 18.0
10 3 5 NaN
11 3 6 22.0
Detail:
print (df.drop_duplicates(['subject_id','hour']))
subject_id hour value
0 2 1 23 <- duplicates are removed
2 2 3 15
3 2 5 28
4 2 6 11
5 3 4 18
6 3 6 22
I have a data frame (sample, not real):
df =
A B C D E F
0 3 4 NaN NaN NaN NaN
1 9 8 NaN NaN NaN NaN
2 5 9 4 7 NaN NaN
3 5 7 6 3 NaN NaN
4 2 6 4 3 NaN NaN
Now I want to fill NaN values with previous couple(!!!) values of row (fill Nan with left existing couple of numbers and apply to the whole row) and apply this to the whole dataset.
There are a lot of answers concerning filling the columns. But in
this case I need to fill based on rows.
There are also answers related to fill NaN based on other column, but
in my case number of columns are more than 2000. This is sample data
Desired output is:
df =
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
IIUC, a quick solution without reshaping the data:
df.iloc[:,::2] = df.iloc[:,::2].ffill(1)
df.iloc[:,1::2] = df.iloc[:,1::2].ffill(1)
df
Output:
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Idea is reshape DataFrame for possible forward and back filling missing values with stack and modulo and integer division of 2 of array by length of columns:
c = df.columns
a = np.arange(len(df.columns))
df.columns = [a // 2, a % 2]
#if possible some pairs missing remove .astype(int)
df1 = df.stack().ffill(axis=1).bfill(axis=1).unstack().astype(int)
df1.columns = c
print (df1)
A B C D E F
0 3 4 3 4 3 4
1 9 8 9 8 9 8
2 5 9 4 7 4 7
3 5 7 6 3 6 3
4 2 6 4 3 4 3
Detail:
print (df.stack())
0 1 2
0 0 3 NaN NaN
1 4 NaN NaN
1 0 9 NaN NaN
1 8 NaN NaN
2 0 5 4.0 NaN
1 9 7.0 NaN
3 0 5 6.0 NaN
1 7 3.0 NaN
4 0 2 4.0 NaN
1 6 3.0 NaN
This is particular case of question in header.
I have following dataframe:
values = [[100,54,25,26,32,33,15,2],[1,2,3,4,5,6,7,8]]
columns = ["numbers", "order"]
zipped = dict(zip(columns,values))
df = pd.DataFrame(zipped)
print(df)
numbers order
0 100 1
1 54 2
2 25 3
3 26 4
4 32 5
5 33 6
6 15 7
7 2 8
Imagine that dataframe ascendingly sorted by column order. In column numbers I want to replace values with NaN if there is a bigger value present down the rows, and achieve following result:
numbers order
0 100 1
1 54 2
2 NaN 3
3 NaN 4
4 NaN 5
5 33 6
6 15 7
7 2 8
What will be the best approach to achieve it without going through the loop?
Update: Probably better example for the initial DF and expected results (to add discontiguous blocks of values to be replaced):
values = [[100,54,25,26,34,32,31,33,15,2],[1,2,3,4,5,6,7,8,9,10]]
numbers order
0 100 1
1 54 2
2 25 3
3 26 4
4 34 5
5 32 6
6 31 7
7 33 8
8 15 9
9 2 10
Results:
numbers order
0 100.0 1
1 54.0 2
2 NaN 3
3 NaN 4
4 34.0 5
5 NaN 6
6 NaN 7
7 33.0 8
8 15.0 9
9 2.0 10
I read this slightly differently, if the numbers are bigger below that means their reversed cummax is higher:
In [11]: df.at[3, 'numbers'] = 24 # more illustrative example
In [12]: df.numbers[::-1].cummax()[::-1]
Out[12]:
0 100
1 54
2 33
3 33
4 33
5 33
6 15
7 2
Name: numbers, dtype: int64
In [13]: df.loc[df.numbers < df.numbers[::-1].cummax()[::-1], 'numbers'] = np.nan
In [14]: df
Out[14]:
numbers order
0 100.0 1
1 54.0 2
2 NaN 3
3 NaN 4
4 NaN 5
5 33.0 6
6 15.0 7
7 2.0 8
You can loop through the values of your columns and check if it's greater than all the elements that come after:
arr = df['numbers'].values
df['numbers'] = [x if all(x > arr[n+1:]) else np.nan for n, x in enumerate(arr)]
df
Output:
numbers order
0 100.0 1
1 54.0 2
2 NaN 3
3 NaN 4
4 NaN 5
5 33.0 6
6 15.0 7
7 2.0 8
How can I merge two pandas dataframes with different lengths like those:
df1 = Index block_id Ut_rec_0
0 0 7
1 1 10
2 2 2
3 3 0
4 4 10
5 5 3
6 6 6
7 7 9
df2 = Index block_id Ut_rec_1
0 0 3
2 2 5
3 3 5
5 5 9
7 7 4
result = Index block_id Ut_rec_0 Ut_rec_1
0 0 7 3
1 1 10 NaN
2 2 2 5
3 3 0 5
4 4 10 NaN
5 5 3 9
6 6 6 NaN
7 7 9 4
I already tried something like, but it did not work:
df_result = pd.concat([df1, df2], join_axes=[df1['block_id']])
I already tried:
df_result = pd.concat([df1,df2,axis = 1)
But the result was:
Index block_id Ut_rec_0 Index block_id Ut_rec_1
0 0 7 0.0 0.0 3.0
1 1 10 1.0 2.0 5.0
2 2 2 2.0 3.0 5.0
3 3 0 3.0 5.0 9.0
4 4 10 4.0 7.0 4.0
5 5 3 NaN NaN NaN
6 6 6 NaN NaN NaN
7 7 9 NaN NaN NaN
pandas.DataFrame.join can "join" dataframes based on overlap in column data (or index). Something like this will likely work for you:
df1.join(df2.set_index('block_id'), on='block_id')
As #Wen said best would be using concat with axis as 1, like the below code:
pd.concat([df1, df2],axis=1)
you need, pd.merge with outer join,
pd.merge(df1,df2,on=['Index','block_id'],how='outer')
#[out]
#Index block_id Ut_rec_0 Ut_rec_1
#0 0 7 3.0
#1 1 10 NaN
#2 2 2 5.0
#3 3 0 5.0
#4 4 10 NaN
#5 5 3 9.0
#6 6 6 NaN
#7 7 9 4.0