Pandas hierarchical indexes and calculations

Pandas hierarchical indexes and calculations - python

Given:
df = pd.DataFrame({"panum": ["PA1", "PA1", "PA1", "PA2", "PA2", "PA2"],
"which": ["A", "A", "A", "B", "B", "B"],
"score": [88, 80, 90, 92, 95, 99]})
df.set_index(['panum', 'which'], inplace=True)
df
score
panum which
PA1 A 88
A 80
A 90
PA2 B 92
B 95
B 99
Is it possible to write something that would create a new index entry in 'which' called max which would be the max but for the level, so it would create two new rows, PA1,Max and PA2,Max?
Update
I have corrected the indexes. The example above is not what I meant.
panmum factor score
PA1 init 90
resub 94
final 93
PA2 init 60
resub 90
final 88
And my question in this better scenario would be: "I want to create a new "panum" called mean, which would have three rows, (mean, init), (mean, resub), (mean, final)".
Pseudocode would be something like df['mean'] = (df['pa1'] + df['pa2']) / 2
I know this is a different question!

You can create new DataFrame of max values, add second level max, append to original and last sort_index:
m = df.max(level=0).assign(max='max').set_index('max', append=True)
print (m)
score
panum max
PA1 max 90
PA2 max 99
df = df.append(m).sort_index()
print (df)
score
panum which
PA1 A 88
A 80
A 90
max 90
PA2 B 92
B 95
B 99
max 99
EDIT answer: solution is changed for mean by second level and swaplevel for correct align to final DataFrame:
df = pd.DataFrame({"panum": ["PA1", "PA1", "PA1", "PA2", "PA2", "PA2"],
"factor": ["init", "resub", "final"] * 2,
"score": [90, 94, 93, 60, 90, 88]})
df.set_index(['panum', 'factor'], inplace=True)
print (df)
score
panum factor
PA1 init 90
resub 94
final 93
PA2 init 60
resub 90
final 88
m = (df.mean(level=1)
.assign(factor='mean')
.set_index('factor', append=True)
.swaplevel(0,1))
print (m)
score
factor factor
mean init 75.0
resub 92.0
final 90.5
df = df.append(m)
print (df)
score
panum factor
PA1 init 90.0
resub 94.0
final 93.0
PA2 init 60.0
resub 90.0
final 88.0
mean init 75.0
resub 92.0
final 90.5

Append a max as we go with pd.concat
pd.concat([
d.append(d.max().rename((n, 'max')))
for n, d in df.groupby('panum')
])
score
panum which
PA1 A 88
A 80
A 90
max 90
PA2 B 92
B 95
B 99
max 99

Related

Get the sum of absolutes of columns for a dataframe

If I have a dataframe and I want to sum the values of the columns I could do something like
import pandas as pd
studentdetails = {
"studentname":["Ram","Sam","Scott","Ann","John","Bobo"],
"mathantics" :[80,90,85,70,95,100],
"science" :[85,95,80,90,75,100],
"english" :[90,85,80,70,95,100]
}
index_labels=['r1','r2','r3','r4','r5','r6']
df = pd.DataFrame(studentdetails ,index=index_labels)
print(df)
df3 = df.sum()
print(df3)
col_list= ['studentname', 'mathantics', 'science']
print( df[col_list].sum())
How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns?
I tried abs in several way but it did not work
Edit:
studentname mathantics science english
r1 Ram 80 85 90
r2 Sam 90 95 -85
r3 Scott -85 80 80
r4 Ann 70 90 70
r5 John 95 -75 95
r6 Bobo 100 100 100
Expected output
mathantics 520
science 525
english 520
Edit2:
The col_list cannot include string value columns

You need numeric columns for absolute values:
col_list = df.columns.difference(['studentname'])
df[col_list].abs().sum()
df.set_index('studentname').abs().sum()
df.select_dtypes(np.number).abs().sum()

Subtract a constant from a column in a pandas dataframe

I have a dataframe as follows:
year,value
1970,2.0729729191557147
1971,1.0184197388632872
1972,2.574009084167593
1973,1.4986879160266255
1974,3.0246498975934464
1975,1.7876222478238608
1976,2.5631745148930913
1977,2.444014336917563
1978,2.619502688172043
1979,2.268273809523809
1980,2.6086169818316645
1981,0.8452720174091145
1982,1.3158922171018947
1983,-0.12695212493599603
1984,1.4374230626622169
1985,2.389290834613415
1986,2.3489311315924217
1987,2.6002265745007676
1988,1.2623717711036955
1989,1.1793426779313878
I would like to subtract a constant from each of the values in the second column. This is the code I have tried:
df = pd.read_csv(f1, sep=",", header=0)
df2 = df["value"].subtract(1)
However when I do this, df2 becomes this:
70 1.072973
71 0.018420
72 1.574009
73 0.498688
74 2.024650
75 0.787622
76 1.563175
77 1.444014
78 1.619503
79 1.268274
80 1.608617
81 -0.154728
82 0.315892
83 -1.126952
84 0.437423
85 1.389291
86 1.348931
87 1.600227
88 0.262372
89 0.179343
The year becomes only the last two digits. How can I retain all of the digits of the year?

I think column year is not modified, only need assign back subtracted values:
df["value"] = df["value"].subtract(1)

Inverse Score in Python

I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?

Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000

This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.

Grouping column values together

I have a dataframe like so:
Class price demand
1 22 8
1 60 7
3 32 14
2 72 9
4 45 20
5 42 25
What I'd like to do is group classes 1-3 in one category and classes 4-5 in one category. Then I'd like to get the sum of price for each category and the sum of demand for each category. I'd like to also get the mean. The result should look something like this:
Class TotalPrice TotalDemand AveragePrice AverageDemand
P 186 38 46.5 9.5
E 87 45 43.5 22.5
Where P is classes 1-3 and E is classes 4-5. How can I group by categories in pandas? Is there a way to do this?

In [8]: df.groupby(np.where(df['Class'].isin([1, 2, 3]), 'P', 'E'))[['price', 'demand']].agg(['sum', 'mean'])
Out[8]:
price demand
sum mean sum mean
E 87 43.5 45 22.5
P 186 46.5 38 9.5

You can create a dictionary that defines your groups.
mapping = {**dict.fromkeys([1, 2, 3], 'P'), **dict.fromkeys([4, 5], 'E')}
Then if you pass a dictionary or callable to a groupby it automatically gets mapped onto the index. So, let's set the index to Class
d = df.set_index('Class').groupby(mapping).agg(['sum', 'mean']).sort_index(1, 1)
Finally, we do some tweaking to get column names the way you specified.
rename_dict = {'sum': 'Total', 'mean': 'Average'}
d.columns = d.columns.map(lambda c: f"{rename_dict[c[1]]}{c[0].title()}")
d.rename_axis('Class').reset_index()
Class TotalPrice TotalDemand AveragePrice AverageDemand
0 E 87 45 43.5 22.5
1 P 186 38 46.5 9.5

In general, you can form arbitrary bins to group your data using pd.cut, specifying the right bin edges:
import pandas as pd
pd.cut(df.Class, bins=[0, 3, 5], labels=['P', 'E'])
#0 P
#1 P
#2 P
#3 P
#4 E
#5 E
df2 = (df.groupby(pd.cut(df.Class, bins=[0,3,5], labels=['P', 'E']))[['demand', 'price']]
.agg({'sum', 'mean'}).reset_index())
# Get rid of the multi-level columns
df2.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in df2.columns]
Output:
Class demand_sum demand_mean price_sum price_mean
0 P 38 9.5 186 46.5
1 E 45 22.5 87 43.5

Removing rows below first line that meets threshold in pandas dataframe

I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72

We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas hierarchical indexes and calculations - python

Append a max as we go with pd.concat pd.concat([ d.append(d.max().rename((n, 'max'))) for n, d in df.groupby('panum') ]) score panum which PA1 A 88 A 80 A 90 max 90 PA2 B 92 B 95 B 99 max 99

Related

Get the sum of absolutes of columns for a dataframe

Subtract a constant from a column in a pandas dataframe

Inverse Score in Python

Grouping column values together

Removing rows below first line that meets threshold in pandas dataframe

Categories

Resources