I try to calculate the relative weights of df1 in each row with a max value of 0.5. So far, I was able to calculate the relative weights in df2 but without an upper boundary. Here would be a simple example:
import pandas as pd
df1 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'ID1':[0,0,2,1,1],
'ID2':[1,3,1,1,2],
'ID3':[1,0,0,1,0],
'ID4':[1,1,7,1,0],
'ID5':[0,6,0,0,1]})
df1:
Dates ID1 ID2 ID3 ID4 ID5
0 2021-01-01 0 1 1 1 0
1 2021-01-02 0 3 0 1 6
2 2021-01-03 2 1 0 7 0
3 2021-01-04 1 1 1 1 0
4 2021-01-05 1 2 0 0 1
df1 = df1.set_index('Dates').T
df2 = df1.transform(lambda x: x/sum(x)).T
df2.round(2)
df2:
Dates ID1 ID2 ID3 ID4 ID5
2021-01-01 0.00 0.33 0.33 0.33 0.00
2021-01-02 0.00 0.30 0.00 0.10 0.60
2021-01-03 0.20 0.10 0.00 0.70 0.00
2021-01-04 0.25 0.25 0.25 0.25 0.00
2021-01-05 0.25 0.50 0.00 0.00 0.25
I try to get df3 with a relative weight maximum of 0.5.
df3:
Dates ID1 ID2 ID3 ID4 ID5
2021-01-01 0.00 0.33 0.33 0.33 0.00
2021-01-02 0.00 0.30 0.00 0.10 0.50
2021-01-03 0.20 0.10 0.00 0.50 0.00
2021-01-04 0.25 0.25 0.25 0.25 0.00
2021-01-05 0.25 0.50 0.00 0.00 0.25
When I use the following adjusted function, I get the error: Transform function failed
df1.transform(lambda x: x/sum(x) if x/sum(x) < 0.5 else 0.5).T
Thanks a lot!
Instead of transposing and applying transformations on each element, we can manipulate rows directly.
df3 = df1.copy().set_index('Dates')
df3 = df3.div(df3.sum(axis=1), axis=0).clip(upper=0.5).round(2).reset_index()
Output:
Dates ID1 ID2 ID3 ID4 ID5
0 2021-01-01 0.00 0.33 0.33 0.33 0.00
1 2021-01-02 0.00 0.30 0.00 0.10 0.50
2 2021-01-03 0.20 0.10 0.00 0.50 0.00
3 2021-01-04 0.25 0.25 0.25 0.25 0.00
4 2021-01-05 0.25 0.50 0.00 0.00 0.25
Would this work for you?
You can use apply(...,axis=1) and clip the values with a max of 0.5 (this assumes Date is always the first columns - alternatively, we could set it as an index):
df1[df1.columns[1:]] = df1[df1.columns[1:]].apply(lambda x:x/sum(x), axis=1).clip(upper=0.5)
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: x/sum(df1[col]) if x/sum(df1[col]) < 0.5 else 0.5)
Have fun!
Related
I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
I have two dataframes df1 and df2.
One with clients debt, the other with client payments with dates.
I want to create a new data frame with the % of the debt paid in the month of the payment until 01-2017.
import pandas as pd
d1 = {'client number': ['2', '2','3','6','7','7','8','8','8','8','8','8','8','8'],
'month': [1, 2, 3,1,10,12,3,5,8,1,2,4,5,8],
'year':[2013,2013,2013,2019,2013,2013,2013,2013,2013,2014,2014,2015,2016,2017],
'payment' :[100,100,200,10000,200,100,300,500,200,100,200,200,500,50]}
df1 = pd.DataFrame(data=d1).set_index('client number')
df1
d2 = {'client number': ['2','3','6','7','8'],
'debt': [200, 600,10000,300,3000]}
df2 = pd.DataFrame(data=d2)
x=[1,2,3,4,5,6,7,8,9,10]
y=[2013,2014,2015,2016,2017]
for x in month and y in year
if df1['month']=x and df1['year']=year :
df2[month&year] = df1['payment']/df2['debt']
the result needs to be something like this for all the clients
what am I missing?
thank you for your time and help
First set the index of both the dataframes df1 and df2 to client number, then use Index.map to map the client numbers in df1 to their corresponding debt's from df2, then use Series.div to divide the payments of each client by their respective debt's, thus obtaining the fraction of debt which is paid, then create a new column date in df1 from month and year columns finally use DataFrame.join along with DataFrame.pivot_table:
df1 = df1.set_index('client number')
df2 = df2.set_index('client number')
df1['pct'] = df1['payment'].div(df1.index.map(df2['debt'])).round(2)
df1['date'] = df1['year'].astype(str) + '-' + df1['month'].astype(str).str.zfill(2)
df3 = (
df2.join(
df1.pivot_table(index=df1.index, columns='date', values='pct', aggfunc='sum').fillna(0))
.reset_index()
)
Result:
# print(df3)
client number debt 2013-01 2013-02 2013-03 2013-05 2013-08 ... 2013-12 2014-01 2014-02 2015-04 2016-05 2017-08 2019-01
0 2 200 0.5 0.5 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 3 600 0.0 0.0 0.33 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 6 10000 0.0 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 1.0
3 7 300 0.0 0.0 0.00 0.00 0.00 ... 0.33 0.00 0.00 0.00 0.00 0.00 0.0
4 8 3000 0.0 0.0 0.10 0.17 0.07 ... 0.00 0.03 0.07 0.07 0.17 0.02 0.0
I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.
You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)
I'm trying to come up with a way to return the column names for the 3 highest values in each row of the table below. So far I've been able to return the highest value using idxmax but I haven't been able to figure out how to get the 2nd and 3rd highest.
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6
0 9 0.00 0.15 0.06 0.11 0.23 0.01
1 4 0.00 0.25 0.04 0.10 0.10 0.00
2 11 0.00 0.34 0.00 0.09 0.24 0.00
3 12 0.00 0.16 0.00 0.11 0.00 0.00
4 0 0.00 0.35 0.00 0.04 0.02 0.00
5 17 0.01 0.21 0.02 0.18 0.27 0.01
Expected output:
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5,Stat2,Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2,Stat4,Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2,Stat5,Stat4]
3 12 0.00 0.16 0.00 0.19 0.00 0.01 [Stat4,Stat2,Stat6]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2,Stat4,Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5,Stat2,Stat4]
If anyone has ideas on how to do this I'd appreciate it.
Use numpy.argsort for positions of sorted values and filter all columns without first:
a = df.iloc[:, 1:].to_numpy()
df['TopThree'] = df.columns[1:].to_numpy()[np.argsort(-a, axis=1)[:, :3]].tolist()
print (df)
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5, Stat2, Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2, Stat4, Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2, Stat5, Stat4]
3 12 0.00 0.16 0.00 0.11 0.00 0.00 [Stat2, Stat4, Stat1]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2, Stat4, Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5, Stat2, Stat4]
If performace is not important:
df['TopThree'] = df.iloc[:, 1:].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)
I have calculated a series of totals tips by day of a week and appended it to the bottom of totalspt dataframe.
I have set the index.name for the totalspt dataframe to None.
However while the dataframe is displaying the default 0,1,2,3 index it doesn't display the default empty cell in the top left directly above the index.
How could I make this cell empty in the dataframe?
total_bill tip sex smoker day time size tip_pct
0 16.54 1.01 F N Sun D 2 0.061884
1 12.54 1.40 F N Mon D 2 0.111643
2 10.34 3.50 M Y Tue L 4 0.338491
3 20.25 2.50 M Y Wed D 2 0.123457
4 16.54 1.01 M Y Thu D 1 0.061064
5 12.54 1.40 F N Fri L 2 0.111643
6 10.34 3.50 F Y Sat D 3 0.338491
7 23.25 2.10 M Y Sun B 3 0.090323
pivot = tips.pivot_table('total_bill', index=['sex', 'size'],columns=['day'],aggfunc='sum').fillna(0)
print pivot
day Fri Mon Sat Sun Thu Tue Wed
sex size
F 2 12.54 12.54 0.00 16.54 0.00 0.00 0.00
3 0.00 0.00 10.34 0.00 0.00 0.00 0.00
M 1 0.00 0.00 0.00 0.00 16.54 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 20.25
3 0.00 0.00 0.00 23.25 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00 10.34 0.00
totals_row = tips.pivot_table('total_bill',columns=['day'],aggfunc='sum').fillna(0).astype('float')
totalpt = pivot.reset_index('sex').reset_index('size')
totalpt.index.name = None
totalpt = totalpt[['Fri', 'Mon','Sat', 'Sun', 'Thu', 'Tue', 'Wed']]
totalpt = totalpt.append(totals_row)
print totalpt
**day** Fri Mon Sat Sun Thu Tue Wed #problem text day
0 12.54 12.54 0.00 16.54 0.00 0.00 0.00
1 0.00 0.00 10.34 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 16.54 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 0.00 20.25
4 0.00 0.00 0.00 23.25 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 10.34 0.00
total_bill 12.54 12.54 10.34 39.79 16.54 10.34 20.25
That's the columns' name.
In [11]: df = pd.DataFrame([[1, 2]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
In [13]: df.columns.name = 'XX'
In [14]: df
Out[14]:
XX A B
0 1 2
You can set it to None to clear it.
In [15]: df.columns.name = None
In [16]: df
Out[16]:
A B
0 1 2
An alternative, if you wanted to keep it, is to give the index a name:
In [21]: df.columns.name = "XX"
In [22]: df.index.name = "index"
In [23]: df
Out[23]:
XX A B
index
0 1 2
You can use rename_axis. Since 0.17.0
In [3939]: df
Out[3939]:
XX A B
0 1 2
In [3940]: df.rename_axis(None, axis=1)
Out[3940]:
A B
0 1 2
In [3942]: df = df.rename_axis(None, axis=1)
In [3943]: df
Out[3943]:
A B
0 1 2