Get proportionate values of columns in a dataframe - Pandas - python

I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.

You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00

Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)

Related

Add selected interactions as columns to pandas dataframe

I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00

change values in dataframe row based on condition

I have this dataframe
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.50 0.50
2 N.Amerca 4.40 4.40 4.40
3 N.Amerca 0.00 8.00 8.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 6.00 6.00
9 Asia 7.50 7.50 7.50
10 Asia 3.75 3.75 3.75
11 Asia 3.50 3.50 3.50
12 Asia 3.80 3.80 3.80
13 Asia 0.00 0.00 0.00
14 Europe 6.52 6.52 6.52
Once a value in 2021 is found it should carry a 0 to the rest (2022 and 2023)
and if a value in 2022 is found -it should carry 0 to the rest. In other words, once value in found in columns 2021 and forth it should zero the rest on the right.
expected result would be:
Region 2021 2022 2023
0 Europe 0.00 0.00 0.00
1 N.Amerca 0.50 0.00 0.00
2 N.Amerca 4.40 0.00 0.00
3 N.Amerca 0.00 8.00 0.00
4 Asia 0.00 0.00 1.75
5 Asia 0.00 0.00 0.00
6 Asia 0.00 0.00 2.00
7 N.Amerca 0.00 0.00 0.50
8 Eurpoe 6.00 0.00 0.00
9 Asia 7.50 0.00 0.00
10 Asia 3.75 0.00 0.00
11 Asia 3.50 0.00 0.00
12 Asia 3.80 0.00 0.00
13 Asia 0.00 0.00 0.00
14 Europe 6.52 0.00 0.00
I have tried to apply a lambda:
def foo(r):
#if r['2021')>0: then 2020 and forth should be zero)
df = df.apply(lambda x: foo(x), axis=1)
but the challange is that there are 2021 - to 2030 and the foo becomes a mess)
Let us try duplicated
df = df.mask(df.T.apply(pd.Series.duplicated).T,0)
Out[57]:
Region 2021 2022 2023
0 Europe 0.00 0.0 0.00
1 N.Amerca 0.50 0.0 0.00
2 N.Amerca 4.40 0.0 0.00
3 N.Amerca 0.00 8.0 0.00
4 Asia 0.00 0.0 1.75
5 Asia 0.00 0.0 0.00
6 Asia 0.00 0.0 2.00
7 N.Amerca 0.00 0.0 0.50
8 Eurpoe 6.00 0.0 0.00
9 Asia 7.50 0.0 0.00
10 Asia 3.75 0.0 0.00
11 Asia 3.50 0.0 0.00
12 Asia 3.80 0.0 0.00
13 Asia 0.00 0.0 0.00
14 Europe 6.52 0.0 0.00
This is another way:
df2 = df.set_index('Region').diff(axis=1).reset_index()
df2['2021'] = df['2021']
or:
df.iloc[:,1:].where(df.iloc[:,1:].ne(0).cumsum(axis=1).eq(1),0)
Output:
2021 2022 2023
0 0.00 0.0 0.00
1 0.50 0.0 0.00
2 4.40 0.0 0.00
3 0.00 8.0 0.00
4 0.00 0.0 1.75
5 0.00 0.0 0.00
6 0.00 0.0 2.00
7 0.00 0.0 0.50
8 6.00 0.0 0.00
9 7.50 0.0 0.00
10 3.75 0.0 0.00
11 3.50 0.0 0.00
12 3.80 0.0 0.00
13 0.00 0.0 0.00
14 6.52 0.0 0.00

Return column names for 3 highest values in rows

I'm trying to come up with a way to return the column names for the 3 highest values in each row of the table below. So far I've been able to return the highest value using idxmax but I haven't been able to figure out how to get the 2nd and 3rd highest.
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6
0 9 0.00 0.15 0.06 0.11 0.23 0.01
1 4 0.00 0.25 0.04 0.10 0.10 0.00
2 11 0.00 0.34 0.00 0.09 0.24 0.00
3 12 0.00 0.16 0.00 0.11 0.00 0.00
4 0 0.00 0.35 0.00 0.04 0.02 0.00
5 17 0.01 0.21 0.02 0.18 0.27 0.01
Expected output:
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5,Stat2,Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2,Stat4,Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2,Stat5,Stat4]
3 12 0.00 0.16 0.00 0.19 0.00 0.01 [Stat4,Stat2,Stat6]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2,Stat4,Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5,Stat2,Stat4]
If anyone has ideas on how to do this I'd appreciate it.
Use numpy.argsort for positions of sorted values and filter all columns without first:
a = df.iloc[:, 1:].to_numpy()
df['TopThree'] = df.columns[1:].to_numpy()[np.argsort(-a, axis=1)[:, :3]].tolist()
print (df)
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5, Stat2, Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2, Stat4, Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2, Stat5, Stat4]
3 12 0.00 0.16 0.00 0.11 0.00 0.00 [Stat2, Stat4, Stat1]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2, Stat4, Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5, Stat2, Stat4]
If performace is not important:
df['TopThree'] = df.iloc[:, 1:].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)

Pandas sum over partition by rows following SQL equivalent

I am looking a way to aggregate (in pandas) a subset of values based on a particular partition, an equivalent of
select table.*,
sum(income) over (order by id, num_yyyymm rows between 3 preceding and 1 preceding) as prev_income_3,
sum(income) over (order by id, num_yyyymm rows between 1 following and 3 following) as next_income_3
from table order by a.id_customer, num_yyyymm;
I tried with the following solution but it has some problems:
1) Takes ages to complete
2) I have to merge all the results at the end of
for x, y in df.groupby(['id_customer']):
print(y[['num_yyyymm', 'income']])
y['next3'] = y['income'].iloc[::-1].rolling(3).sum()
print(y[['num_yyyymm', 'income', 'next3']])
break
Results:
num_yyyymm income next3
0 201501 0.00 0.00
1 201502 0.00 0.00
2 201503 0.00 0.00
3 201504 0.00 0.00
4 201505 0.00 0.00
5 201506 0.00 0.00
6 201507 0.00 0.00
7 201508 0.00 0.00
8 201509 0.00 0.00
9 201510 0.00 0.00
10 201511 0.00 0.00
11 201512 0.00 0.00
12 201601 0.00 0.00
13 201602 0.00 0.00
14 201603 0.00 0.00
15 201604 0.00 0.00
16 201605 0.00 0.00
17 201606 0.00 0.00
18 201607 0.00 0.00
19 201608 0.00 0.00
20 201609 0.00 1522.07
21 201610 0.00 1522.07
22 201611 0.00 1522.07
23 201612 1522.07 0.00
24 201701 0.00 -0.00
25 201702 0.00 1.52
26 201703 0.00 1522.07
27 201704 0.00 1522.07
28 201705 1.52 1520.55
29 201706 1520.55 0.00
30 201707 0.00 NaN
31 201708 0.00 NaN
32 201709 0.00 NaN
Does anybody have an alternative solution?

Delete non-consecutive values from a dataframe column

I have a dataframe like this:
Ind TIME PREC ET PET YIELD
0 1 1.21 0.02 0.02 0.00
1 2 0.00 0.03 0.04 0.00
2 3 0.00 0.03 0.05 0.00
3 4 0.00 0.04 0.05 0.00
4 5 0.00 0.05 0.07 0.00
5 6 0.00 0.03 0.05 0.00
6 7 0.00 0.02 0.04 0.00
7 8 1.14 0.03 0.04 0.00
8 9 0.10 0.02 0.03 0.00
9 10 0.00 0.03 0.04 0.00
10 11 0.10 0.05 0.11 0.00
11 12 0.00 0.06 0.15 0.00
12 13 2.30 0.14 0.44 0.00
13 14 0.17 0.09 0.29 0.00
14 15 0.00 0.13 0.35 0.00
15 16 0.00 0.14 0.39 0.00
16 17 0.00 0.10 0.31 0.00
17 18 0.00 0.15 0.51 0.00
18 19 0.00 0.22 0.58 0.00
19 20 0.10 0.04 0.09 0.00
20 21 0.00 0.04 0.06 0.00
21 22 0.27 0.13 0.43 0.00
22 23 0.00 0.10 0.25 0.00
23 24 0.00 0.03 0.04 0.00
24 25 0.00 0.04 0.05 0.00
25 26 0.43 0.04 0.15 0.00
26 27 0.17 0.06 0.23 0.00
27 28 0.50 0.02 0.04 0.00
28 29 0.00 0.03 0.04 0.00
29 30 0.00 0.04 0.08 0.00
30 31 0.00 0.04 0.08 0.00
31 1 6.48 1.97 5.10 0.03
32 32 0.00 0.22 0.70 0.00
33 33 0.00 0.49 0.88 0.00
In this dataframe column 'TIME' shows ordinal day number in a year, and after the end of every month - an ordinal number of month in a year, which messes up all dataframe calculations, so, for this reason, I would like to drop all rows that contain month value. First, I tried to use .shift():
df = df.loc[df.TIME == df.TIME.shift() +1],
however, in this case, I delete twice as many rows as it supposed to be. I also tried to delete every value after the end of every month:
for i in indexes:
df = df.loc[df.index != i],
where indexes is a list, containing row indexes after day value is equal to 31, 59, ... 365 or end of every month. However, in a leap year, these values would be different, and I could create another list for a leap year, but this method would be very non-pythonist. So, I wonder, is there any better way to delete non-consecutive values from a dataframe (excluding when one year ends and another one starts: 364, 365, 1, 2)?
EDIT: I should, probably, add that there are twenty years in this dataframe, so this is how the dataframe looks like at the end of each year:
TIME PREC ET PET YIELD
370 360 0.00 0.14 0.26 0.04
371 361 0.00 0.15 0.27 0.04
372 362 0.00 0.14 0.25 0.04
373 363 0.11 0.18 0.32 0.04
374 364 0.00 0.15 0.25 0.04
375 365 0.00 0.17 0.29 0.04
376 12 16.29 4.44 7.74 1.89
377 1 0.00 0.16 0.28 0.03
378 2 0.00 0.18 0.32 0.03
379 3 0.00 0.22 0.40 0.03
df
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
6 12 16.29 4.44 7.74 1.89
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
38 1 6.48 1.97 5.10 0.03
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
Look at the diffs in TIME. Drop the rows where diff is between -360 and -1
df[~df.TIME.diff().le(-12)]
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
df[df['TIME'].shift().fillna(0) <= df['TIME']]
Gives what you're looking for. You were almost there with
df.loc[df.TIME == df.TIME.shift() +1]
But you don't need to get rid of cases where .shift is smaller, because that's just the first of the month.
The addition of .fillna(0) takes care of the NaN in the first row of df['TIME'].shift().
Edit:
For the end of year case, just be sure to also take those with a difference of 11, to catch where the 12th month ends.
That would give
df[(df['TIME'].shift().fillna(0) <= df['TIME']+11)]
Edit2:
By the by, I checked solution runtimes, and the current version(df[~df.TIME.diff().le(-12)]) of #piRSquared's seems to run fastest.
For completeness, of the one presented in this post and the original version posted by #piRSquared,
the former was a bit faster on datasets on the order of 10000 rows or fewer, the latter somewhat faster on those larger.

Categories