I am raising this question for learning a new method for myself.
I have a dataframe like below,
ID Value
0 1 10
1 1 12
2 1 14
3 1 16
4 1 18
5 2 32
6 2 12
7 2 -8
8 2 -28
9 2 -48
10 2 -68
11 3 12
12 3 1
13 3 43
I want to convert this into:
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
one way to solve this,
print
pd.concat([df[df['ID']==1].reset_index(drop=True),df[df['ID']==2].reset_index(drop=True),df[df['ID']==3].reset_index(drop=True)],axis=1)
But I'm thinking can I do the same concat operation for each groupby method result instead of filtering by value?
Any better/new approaches are more appreciated.
Thanks in advance.
Yup, very possible and quite simple with pd.concat, in fact.
df = pd.concat({k : g.reset_index(drop=True) for k, g in df.groupby('ID')}, axis=1)
df.columns = df.columns.droplevel(0)
Or, a minor variation in Dark's (now deleted) answer (which does not give you the opportunity to specify column suffixes automatically) -
pd.concat([g.reset_index(drop=True) for _, g in df.groupby('ID')], axis=1)
df
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
Those column names are terrible, though. Rather than dropping the first level, you should consider concatenating them to form a pre/suf-fix for the second level. That should be a good exercise for you with df.columns.map.
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
I have a dataframe with 4 columns that can have np.nan
df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
I am looking for invalid rows.
invalid rows are
[1] rows with duplicate columns = [i_example, i_frame, OId] or
[2] rows with duplicate columns = [i_example, i_frame, HId].
So in the example above, all the rows are invalid beside the first three rows.
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
and
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
1 0 21 3.0 NaN
2 0 21 3.0 0.0
These two rows are invalid because of the condition [1].
and
3 1 22 0.0 4.0
4 1 22 NaN 4.0
are invalid because of the condition [2]
and
5 2 20 0.0 4.0
6 2 20 1.0 4.0
are invalid for the same reason
I tried is_duplicated but it does not work with nan values
I am not sure if the df.duplicated() function offers to eliminate NaNs. But you can add a condition to check of the value is NaN or not and find the duplicates.
df[df.duplicated(['i_example', 'i_frame', 'OId'], keep=False) & df['OId'].notna()]
Result:
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
So, for your question, I would see if the value is not NaN and then find the duplicates using df.duplicated() and create a boolean mask. With that filter the df as valid and invalid.
dupes = (df['OId'].notna() & df.duplicated(['i_example', 'i_frame', 'OId'], keep=False)) | (df['HId'].notna() & df.duplicated(['i_example', 'i_frame', 'HId'], keep=False))
invalid_df = df[dupes]
valid_df = df[~dupes]
Result:
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 NaN 1.264 1
3 NaN 0.000 1
4 0.0 NaN 1
5 NaN 1.264 1
6 NaN 6.704 1
7 1.0 NaN 1
8 0.0 NaN 1
9 NaN 11.746 2
10 1.0 NaN 2
11 0.0 NaN 2
12 NaN 16.961 3
13 1.0 NaN 3
14 NaN 11.832 3
15 0.0 NaN 3
16 NaN 17.082 3
17 NaN 22.435 3
18 NaN 28.707 3
19 NaN 34.216 3
I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)
You can apply separate methods to each column. For example:
# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))
>>> df
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 1.0 1.2640 1
3 1.0 0.0000 1
4 0.0 0.6320 1
5 0.0 1.2640 1
6 0.0 6.7040 1
7 1.0 6.7040 1
8 0.0 6.7040 1
9 NaN 11.7460 2
10 1.0 11.7460 2
11 0.0 11.7460 2
12 NaN 16.9610 3
13 1.0 14.3965 3
14 1.0 11.8320 3
15 0.0 14.4570 3
16 0.0 17.0820 3
17 0.0 22.4350 3
18 0.0 28.7070 3
19 0.0 34.2160 3
Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()
I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN