Fill DataFrame NaN with another DataFrame with groupby - python

I am sure this has been answered before but I cannot seem to find the right solution. I have tried pd.merge, merge, combine_first and update and they all don't seem to get the right job. They either create a new variable with an _x or they imply stack in below. I am wishing to merge df1 into df where column c is missing values. I wish to do this for each id on each date
Example df for task
df
date id a b c d
1/1/2000 1 10 20 10 11
1/1/2000 2 11 21 NaN 11
1/1/2000 3 15 20 NaN 11
1/1/2000 4 12 24 13 11
1/2/2000 1 10 25 10 11
1/2/2000 2 10 20 NaN 15
1/2/2000 3 10 26 NaN 11
1/2/2000 4 10 20 16 13
1/3/2000 1 10 20 10 11
1/3/2000 2 10 20 NaN 11
1/3/2000 3 10 20 NaN 11
1/3/2000 4 10 20 10 11
df1
date id c
12/29/1999 2 1
12/30/1999 3 1
12/30/1999 2 1
12/31/1999 3 1
12/31/1999 2 1
12/31/1999 4 1
1/1/2000 2 1
1/1/2000 3 14
1/2/2000 2 13
1/2/2000 3 22
1/3/2000 2 13
1/3/2000 3 18
desired df after combining df and d1
df
date id a b c d
1/1/2000 1 10 20 10 11
1/1/2000 2 11 21 1 11
1/1/2000 3 15 20 14 11
1/1/2000 4 12 24 13 11
1/2/2000 1 10 25 10 11
1/2/2000 2 10 20 13 15
1/2/2000 3 10 26 22 11
1/2/2000 4 10 20 16 13
1/3/2000 1 10 20 10 11
1/3/2000 2 10 20 13 11
1/3/2000 3 10 20 18 11
1/3/2000 4 10 20 10 11

Lets create a MultiIndex in both the dataframe with id and date columns then use Series.fillna to fill the NaN values in column c of df1 from corresponding values in df2:
df1['c'] = df1.set_index(['date', 'id'])['c']\
.fillna(df2.set_index(['id', 'date'])['c']).tolist()
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 11 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 12 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 10 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 10 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11

At least in your example, you can fill your NA values from a list of values (without indexing). AKA, df1, is always the same length as the number of missing values:
df = df.reset_index(drop=True)
df1 = df.reset_index(drop=True)
df.loc[df['c'].isna(), 'c'] = list(df1['c'])
Result:
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 11 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 12 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 10 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 10 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11

Related

Pandas merge 2 dataframes

I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN

Filter out rows with groupby that are all NaN for a specific column

I am trying to filter out rows when groupby id when all values in column a are NaN. So, across all id, if all observations across the dates are NaN, I want to filter the row out. e.g. I want to filter out id = 2
df
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 NaN 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 NaN 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 NaN 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11
desired dataframe
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 3 15 20 14.0 11
2 1/1/2000 4 NaN 24 13.0 11
3 1/2/2000 1 10 25 10.0 11
4 1/2/2000 3 10 26 22.0 11
5 1/2/2000 4 10 20 16.0 13
6 1/3/2000 1 10 20 10.0 11
7 1/3/2000 3 10 20 18.0 11
8 1/3/2000 4 10 20 10.0 11
Test non missing values by Series.notna and then get all groups with at least one match by GroupBy.any, GroupBy.transform is used for return Series with same size like original, so possible filter by boolean indexing:
df = df[df['a'].notna().groupby(df['id']).transform('any')]
print (df)
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11
Or use DataFrame.loc for filter also id with non missisng a and then filter original column by Series.isin with boolean indexing too:
df = df[df['id'].isin(df.loc[df['a'].notna(), 'id'])]
print (df)
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11
Notice not fast as transform but easy to understand
out = df.groupby('id').filter(lambda x : x['a'].notnull().any())
Out[31]:
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11

Python fillna based on a condition

I have the following dataframe grouped by datafile and I want to fillna(method ='bfill') only for those 'groups' that contain more than half of the data.
df.groupby('datafile').count()
datafile column1 column2 column3 column4
datafile1 5 5 3 4
datafile2 5 5 4 5
datafile3 5 5 5 5
datafile4 5 5 0 0
datafile5 5 5 1 1
As you can see in the df above, I'd like to fill those groups that contain most of the information but not those who has none or little information. So I was thinking in a condition something like fillna those who have more than half of the counts and don't fill the rest or those with less than half.
I'm struggling on how to set up my condition since it involves working with a result of a groupby and the original df.
Help is appreciated it.
example df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 NaN 20
1 datafile1 6 6 NaN 21
2 datafile1 7 7 9 NaN
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 NaN 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
expected output df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 9 20
1 datafile1 6 6 9 21
2 datafile1 7 7 9 23
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 6 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
if the proportion of NON-null values ​​is greater than or equal to 0.5 in each column then it is filled with the bfill method:
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (
df.bfill()
.where(
g.transform('sum')
.div(g['datafile'].transform('size'), axis=0)
.ge(rate) |
not_na
)
)
print(df_fill)
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Also we can use:
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(), axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
both methods have similar returns for the sample dataframe
%%timeit
rate = 0.5
not_na = df.notna()
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(),axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
11.1 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (df.bfill()
.where(g.transform('sum').div(g['datafile'].transform('size'),
axis=0).ge(rate) |
not_na)
)
12.9 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pandas.groupby.filter
def most_not_null(x): return x.isnull().sum().sum() < (x.notnull().sum().sum() // 2)
filtered_groups = df.groupby('datafile').filter(most_not_null)
df.loc[filtered_groups.index] = filtered_groups.bfill()
Output
>>> df
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN

Pandas Merge Resample Result for Missing Rows

Note: I asked a worse version of this yesterday, which I quickly deleted. However #FlorianGD left a comment which gave me the answer I needed - so I've added this up in any case with the solution he advised.
Prepare Start Dataframe
I have some dates:
date_dict = {0: '1/31/2010',
1: '12/15/2009',
2: '3/19/2010',
3: '10/25/2009',
4: '1/17/2009',
5: '9/4/2009',
6: '2/21/2010',
7: '8/30/2009',
8: '1/31/2010',
9: '11/30/2008',
10: '2/8/2009',
11: '4/9/2010',
12: '9/13/2009',
13: '10/19/2009',
14: '1/24/2010',
15: '3/8/2009',
16: '11/30/2008',
17: '7/30/2009',
18: '12/12/2009',
19: '3/8/2009',
20: '6/18/2010',
21: '11/30/2008',
22: '12/30/2009',
23: '10/28/2009',
24: '1/28/2010'}
Convert to dataframe and datetime format:
import pandas as pd
from datetime import datetime
df = pd.DataFrame(list(date_dict.items()), columns=['Ind', 'Game_date'])
df['Date'] = df['Game_date'].apply(lambda x: datetime.strptime(x.strip(), "%m/%d/%Y"))
df.sort_values(by='Date', inplace=True)
df.reset_index(drop=True, inplace=True)
del df['Ind'], df['Game_date']
df['Count'] = 1
df
Date
0 2008-11-30
1 2008-11-30
2 2008-11-30
3 2009-01-17
4 2009-02-08
5 2009-03-08
6 2009-03-08
7 2009-07-30
8 2009-08-30
9 2009-09-04
10 2009-09-13
11 2009-10-19
12 2009-10-25
13 2009-10-28
14 2009-12-12
15 2009-12-15
16 2009-12-30
17 2010-01-24
18 2010-01-28
19 2010-01-31
20 2010-01-31
21 2010-02-21
22 2010-03-19
23 2010-04-09
24 2010-06-18
Now what I want to do is to resample this dataframe to group rows into groups of weeks, and return the information to the original dataframe.
2 Use resample() to group for each week and return count
I perform a resample for each week every Tuesday:
c_index = df.set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
c_index.dropna(subset=['Count'], axis=0, inplace=True)
c_index = c_index.reset_index(drop=True)
c_index['Index_Col'] = c_index.index + 1
c_index
Date Count Index_Col
0 2008-12-02 3.0 1
1 2009-01-20 1.0 2
2 2009-02-10 1.0 3
3 2009-03-10 2.0 4
4 2009-08-04 1.0 5
5 2009-09-01 1.0 6
6 2009-09-08 1.0 7
7 2009-09-15 1.0 8
8 2009-10-20 1.0 9
9 2009-10-27 1.0 10
10 2009-11-03 1.0 11
11 2009-12-15 2.0 12
12 2010-01-05 1.0 13
13 2010-01-26 1.0 14
14 2010-02-02 3.0 15
15 2010-02-23 1.0 16
16 2010-03-23 1.0 17
17 2010-04-13 1.0 18
18 2010-06-22 1.0 19
This shows the number of rows in df that fall within each week in c_index, so, for week 2008-12-02 there were 3 rows that fell in this week.
Broadcast Information back to original df
Now, I want to merge those columns back onto the original df, essentially broadcasting the grouped data onto the individual rows.
This should give:
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3 1
1 2008-11-30 1 3 1
2 2008-11-30 1 3 1
3 2009-01-17 1 1 2
4 2009-02-08 1 1 3
5 2009-03-08 1 2 4
6 2009-03-08 1 2 4
7 2009-07-30 1 1 5
8 2009-08-30 1 1 6
9 2009-09-04 1 1 7
10 2009-09-13 1 1 8
11 2009-10-19 1 1 9
12 2009-10-25 1 1 10
13 2009-10-28 1 1 11
14 2009-12-12 1 2 12
15 2009-12-15 1 2 12
16 2009-12-30 1 1 13
17 2010-01-24 1 1 14
18 2010-01-28 1 3 15
19 2010-01-31 1 3 15
20 2010-01-31 1 3 15
21 2010-02-21 1 1 16
22 2010-03-19 1 1 17
23 2010-04-09 1 1 18
24 2010-06-18 1 1 19
So the Count_Total represents the total number in that group, and Index_Col tracks the order of the groups.
For example, in this case, the group info for 2010-02-02 has been assigned to 2010-01-28, 2010-01-31, and 2010-01-31.
To do this I have tried the following:
Failed attempt
df.merge(c_index, on='Date', how='left', suffixes=('_Raw', '_Total'))
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 NaN NaN
1 2008-11-30 1 NaN NaN
2 2008-11-30 1 NaN NaN
3 2009-01-17 1 NaN NaN
4 2009-02-08 1 NaN NaN
5 2009-03-08 1 NaN NaN
6 2009-03-08 1 NaN NaN
7 2009-07-30 1 NaN NaN
8 2009-08-30 1 NaN NaN
9 2009-09-04 1 NaN NaN
10 2009-09-13 1 NaN NaN
11 2009-10-19 1 NaN NaN
12 2009-10-25 1 NaN NaN
13 2009-10-28 1 NaN NaN
14 2009-12-12 1 NaN NaN
15 2009-12-15 1 2.0 12.0
16 2009-12-30 1 NaN NaN
17 2010-01-24 1 NaN NaN
18 2010-01-28 1 NaN NaN
19 2010-01-31 1 NaN NaN
20 2010-01-31 1 NaN NaN
21 2010-02-21 1 NaN NaN
22 2010-03-19 1 NaN NaN
23 2010-04-09 1 NaN NaN
24 2010-06-18 1 NaN NaN
Reasons for failure: This merges the two dataframes only when the date in c_index is also present in df. In this example the only week that has had information added is 2009-12-15 as this is the only date common across both dataframes.
How can I do a better merge to get what I'm after?
As indicated by #FlorianGD, this can be achieved using pandas.merge_asof with the direction='forward' argument:
pd.merge_asof(left=df, right=c_index, on='Date', suffixes=('_Raw', '_Total'), direction='forward')
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3.0 1
1 2008-11-30 1 3.0 1
2 2008-11-30 1 3.0 1
3 2009-01-17 1 1.0 2
4 2009-02-08 1 1.0 3
5 2009-03-08 1 2.0 4
6 2009-03-08 1 2.0 4
7 2009-07-30 1 1.0 5
8 2009-08-30 1 1.0 6
9 2009-09-04 1 1.0 7
10 2009-09-13 1 1.0 8
11 2009-10-19 1 1.0 9
12 2009-10-25 1 1.0 10
13 2009-10-28 1 1.0 11
14 2009-12-12 1 2.0 12
15 2009-12-15 1 2.0 12
16 2009-12-30 1 1.0 13
17 2010-01-24 1 1.0 14
18 2010-01-28 1 3.0 15
19 2010-01-31 1 3.0 15
20 2010-01-31 1 3.0 15
21 2010-02-21 1 1.0 16
22 2010-03-19 1 1.0 17
23 2010-04-09 1 1.0 18
24 2010-06-18 1 1.0 19

How to "unconcatenate" a dataframe in Pandas?

I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0

Categories