Use values from one DataFrame to keep values from another DataFrame - python

I have created two dataframes, lets call them, df1 df2 where
print(df1)
a b
1 0.375241 1.0
2 NaN 2.0
3 0.448792 3.0
4 NaN 4.0
df1 = df1[np.isfinite(df1['a'])]
print(df1)
a b
1 0.375241 1.0
3 0.448792 3.0
print(df2)
aa bb
1 0.047606 1.0
2 0.202927 1.0
3 0.205663 1.0
4 NaN 1.0
5 1.388080 1.0
6 0.097084 1.0
7 0.136873 1.0
8 NaN 1.0
9 NaN 1.0
10 NaN 1.0
12 0.084676 2.0
13 0.236850 2.0
14 0.532835 2.0
15 NaN 2.0
16 NaN 2.0
17 0.035106 2.0
18 NaN 2.0
19 NaN 2.0
20 NaN 2.0
22 0.419956 3.0
23 0.267132 3.0
24 0.944217 3.0
25 0.403024 3.0
26 NaN 3.0
27 0.184425 3.0
28 0.473998 3.0
29 NaN 3.0
30 NaN 3.0
31 NaN 3.0
33 0.465454 4.0
34 0.240867 4.0
35 NaN 4.0
36 NaN 4.0
37 0.323195 4.0
38 0.193764 4.0
39 NaN 4.0
40 NaN 4.0
41 NaN 4.0
42 NaN 4.0
based on the results from df1['b'] where I only have 1 & 3 as valid numbers now, how would I go about keeping df2['bb'] 1 & 3, and setting all other values in df2 to np.nan so that this would be the final product
df2 = df2[np.isfinite(df2['aa'])]
print(df2)
aa bb
1 0.047606 1.0
2 0.202927 1.0
3 0.205663 1.0
5 1.388080 1.0
6 0.097084 1.0
7 0.136873 1.0
22 0.419956 3.0
23 0.267132 3.0
24 0.944217 3.0
25 0.403024 3.0
27 0.184425 3.0
28 0.473998 3.0

Related

Drop nan of each column in Pandas DataFrame

I have a dataframe as example:
A B C
0 1
1 1
2 1
3 1 2
4 1 2
5 1 2
6 2 3
7 2 3
8 2 3
9 3
10 3
11 3
And I would like to remove nan values of each column to get the result:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
Do I have an easy way to do that?
You can apply a custom sorting function for each column that doesn't actually sort numerically, it justs moves all the NaN values to the end of the column. Then, dropna:
df = df.apply(lambda x: sorted(x, key=lambda v: isinstance(v, float) and np.isnan(v))).dropna()
Output:
>>> df
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0
Given
>>> df
A B C
0 1.0 NaN NaN
1 1.0 NaN NaN
2 1.0 NaN NaN
3 1.0 2.0 NaN
4 1.0 2.0 NaN
5 1.0 2.0 NaN
6 NaN 2.0 3.0
7 NaN 2.0 3.0
8 NaN 2.0 3.0
9 NaN NaN 3.0
10 NaN NaN 3.0
11 NaN NaN 3.0
use
>>> df.apply(lambda s: s.dropna().to_numpy())
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 1.0 2.0 3.0
5 1.0 2.0 3.0

fill NA of a column with elements of another column

i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?
Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3
IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0

Change yearly ordered dataframe to seasonly orderd dataframe

In Pandas, I would like to create columns, which will represent the season (e.g. travel season) starting from November and ending in October next year.
This is my snippet:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
I was trying this approach pandas groupby by customized year, e.g. a school year
I failed after 'unpivoting' the table with both solutions. It would be easier for me to keep up the pivot table for future operations.
My desired output would be something like this:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
Many thanks!
Your table is already foramtted as you want, roughly: you’re basically shifting all the rows down by 2, and getting the 2 bottom rows up to the start − but shifted into the next year.
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
The rest is pretty easy:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Of course you should now rename the columns appropriately:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Note that the 2 first values of 2015 are NaN, which makes sense, as those were not in the initial dataframe.
An alternate way is to use datetime tools. This may be more generic:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
Then as in the linked question get the year fiscal year starting in november:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN

find duplicate subset of columns with nan values in dataframe

I have a dataframe with 4 columns that can have np.nan
df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
I am looking for invalid rows.
invalid rows are
[1] rows with duplicate columns = [i_example, i_frame, OId] or
[2] rows with duplicate columns = [i_example, i_frame, HId].
So in the example above, all the rows are invalid beside the first three rows.
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
and
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
1 0 21 3.0 NaN
2 0 21 3.0 0.0
These two rows are invalid because of the condition [1].
and
3 1 22 0.0 4.0
4 1 22 NaN 4.0
are invalid because of the condition [2]
and
5 2 20 0.0 4.0
6 2 20 1.0 4.0
are invalid for the same reason
I tried is_duplicated but it does not work with nan values
I am not sure if the df.duplicated() function offers to eliminate NaNs. But you can add a condition to check of the value is NaN or not and find the duplicates.
df[df.duplicated(['i_example', 'i_frame', 'OId'], keep=False) & df['OId'].notna()]
Result:
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
So, for your question, I would see if the value is not NaN and then find the duplicates using df.duplicated() and create a boolean mask. With that filter the df as valid and invalid.
dupes = (df['OId'].notna() & df.duplicated(['i_example', 'i_frame', 'OId'], keep=False)) | (df['HId'].notna() & df.duplicated(['i_example', 'i_frame', 'HId'], keep=False))
invalid_df = df[dupes]
valid_df = df[~dupes]
Result:
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0

Indexing columns based on cell value in pandas

I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN

Categories