Pandas Building a subtable based on means

Pandas Building a subtable based on means - python

I have a DataFrame containing columns of numerical and non-numerical data. Here's a slice of it:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
I want to group data by Cancer Stage, take the mean of every numerical data column and produce a table which lists means for each Cancer Stage; like this:
Cancer Stage ATG12 Mean ATG5 Mean ATG7 Mean
IIA 5.03 6.20 8.34
IIB 7.45 4.22 7.99
IIIA 5.32 3.85 6.68
I've figured out the groupby and mean() functions and can compute the means for one column at a time with:
AVG = data.groupby("Cancer Stage")['ATG12 Norm'].mean()
But that only gives me:
Cancer Stage
IIA 5.03
IIB 7.45
IIIA 5.32
Name: ATG12 Norm, dtype: float64
How can I apply this process to all the columns I want at once and produce a dataframe of it all? Sorry if this is a repeat; the pandas questions I've found that seem to be about related topics are all over my head.

Did you try
df.groupby('Cancer Stage').mean()
or
df.groupby('Cancer Stage')['ATG12 Norm','ATG5 Norm'].mean()
Example data with extra text column:
import pandas as pd
from StringIO import StringIO
data='''ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage Text
5.55 4.99 8.99 IIA ABC
4.87 5.77 8.88 IIA ABC
5.98 7.88 8.34 IIC ABC'''
df = pd.DataFrame.from_csv(StringIO(data), index_col=None, sep='\s{2,}')
print df
print df.groupby('Cancer Stage')['ATG12 Norm','ATG5 Norm'].mean()
print df.groupby('Cancer Stage').mean()
result:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage Text
0 5.55 4.99 8.99 IIA ABC
1 4.87 5.77 8.88 IIA ABC
2 5.98 7.88 8.34 IIC ABC
ATG12 Norm ATG5 Norm
Cancer Stage
IIA 5.21 5.38
IIC 5.98 7.88
ATG12 Norm ATG5 Norm ATG7 Norm
Cancer Stage
IIA 5.21 5.38 8.935
IIC 5.98 7.88 8.340

Related

Animate px.line line with plotly express

Learning plotly line animation and come across this question
My df:
Date
1Mo
2Mo
3Mo
6Mo
1Yr
2Yr
0
2023-02-12
4.66
4.77
4.79
4.89
4.50
4.19
1
2023-02-11
4.66
4.77
4.77
4.90
4.88
4.49
2
2023-02-10
4.64
4.69
4.72
4.88
4.88
4.79
3
2023-02-09
4.62
4.68
4.71
4.82
4.88
4.89
4
2023-02-08
4.60
4.61
4.72
4.83
4.89
4.89
How do I animate this dataframe so the frame has
x = [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr], and
y = the actual value on a date, eg y=df[df['Date']=="2023-02-08"], animation_frame = df['Date']?
I tried
plot = px.line(df, x=df.columns[1:], y=df['Date'], title="Treasury Yields", animation_frame=df_treasuries_yield['Date'])
No joy :(

I think the problem is you cannot pass multiple columns to the animation_frame parameter. But we can get around this by converting your df from wide to long format using pd.melt – for your data, we will want to take all of the values from [1Mo, 2Mo, 3Mo, 6Mo, 1Yr, 2Yr] and put them a new column called "value" and we will have a variable column called "variable" to tell us which column the value came from.
df_long = pd.melt(df, id_vars=['Date'], value_vars=['1Mo', '2Mo', '3Mo', '6Mo', '1Yr', '2Yr'])
This will look like the following:
Date variable value
0 2023-02-12 1Mo 4.66
1 2023-02-11 1Mo 4.66
2 2023-02-10 1Mo 4.64
3 2023-02-09 1Mo 4.62
4 2023-02-08 1Mo 4.60
...
28 2023-02-09 2Yr 4.89
29 2023-02-08 2Yr 4.89
Now can pass the argument animation_frame='Date' to px.line:
fig = px.line(df_long, x="variable", y="value", animation_frame="Date", title="Yields")

Encoding data with LabelEncoder()

I'm having the following dataset as a csv file.
Dataset ecoli.csv:
seq_name,mcg,gvh,lip,chg,aac,alm1,alm2,class
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
(more entries...)
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp
ADI_ECOLI,0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp
My purpose for this dataset is to apply some classification algorithms. In order to handle ecoli.csv file I'm trying to change the class column and put in as first one while seq_name column is dropped. Then I'm printing a test to search for null values. Afterwards I'm plotting with the help of sns library.
Code before error:
column_drop = 'seq_name'
dataframe = pd.read_csv('ecoli.txt', header='infer')
dataframe.drop(column_drop, axis=1, inplace=True) # Dropping columns that I don't need
print(dataframe.isnull().sum())
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True)
plt.show()
Before the encoding, and the error I'm facing, I group the values of the dataset based on class. Finally I'm trying to encode the dataset with LabelEncoder but and an error appears:
Error code:
result = dataframe.groupby(by=("class")).sum().reset_index()
print(result)
le = preprocessing.LabelEncoder()
dataframe.result = le.fit_transform(dataframe.result)
print(result)
Error:
AttributeError: 'DataFrame' object has no attribute 'result'
Update: result is filled with the following index
class mcg gvh lip chg aac alm1 alm2
0 cp 51.99 58.59 68.64 71.5 64.99 44.71 56.52
1 im 36.84 38.24 37.48 38.5 41.28 58.33 56.24
2 imL 1.45 0.94 2.00 1.5 0.91 1.29 1.14
3 imS 1.48 1.02 0.96 1.0 1.07 1.28 1.14
4 imU 25.41 16.06 17.32 17.5 19.56 26.04 26.18
5 om 13.45 14.20 10.12 10.0 14.78 9.25 6.11
6 omL 3.49 2.56 5.00 2.5 2.71 2.82 1.11
7 pp 33.91 36.39 24.96 26.0 22.71 24.34 19.47
Desired output:
Any thoughts?

How to create new pandas series based on comparing existing values to lower & upper boundaries

I am creating a script that updates retail prices based on supplier cost changes.
I have successfully created a script that bring in external supplier data, matches to internal data, outputs the changes and passes these into API to update our ERP and to Sheets so we can visualise the changes. My final task is to workout retail price changes but I can't figure out the best way to use Pandas for this problem.
df1 (priceChange):
Cat Nr Net Cost Status
2801 0825646183913 8.50 ACTIVE
2802 0603497902941 7.96 ACTIVE
2803 0603497897452 9.35 ACTIVE
2804 4050538324761 14.45 ACTIVE
2805 4050538307429 10.20 ACTIVE
df2 (priceGrid):
Cost (low) Cost (upp) Retail
0 2.00 3.30 5.99
1 3.31 5.00 8.99
2 5.01 6.15 10.99
3 6.16 7.15 12.99
4 7.16 8.15 14.99
5 8.16 9.25 16.99
6 9.26 10.75 18.99
7 10.76 11.50 20.99
8 11.51 12.75 22.99
9 12.76 13.75 24.99
10 13.76 14.75 26.99
So I want to create df1['Retail'] by comparing df1['Net Cost'] to df2['Cost (low)'] & df2['Cost (upp)'] and returning df2['Retail']
For example line 2801 'Net Cost' == 8.50, therefore it would return a 'Retail' of 16.99.
df1 would look like:
Cat Nr Net Cost Status Retail
2801 0825646183913 8.50 ACTIVE 16.99
2802 0603497902941 7.96 ACTIVE 14.99
2803 0603497897452 9.35 ACTIVE 18.99
2804 4050538324761 14.45 ACTIVE 26.99
2805 4050538307429 10.20 ACTIVE 18.99

You can use pandas.merge_asof for this.
A requirement of this method however, is that your keys on the left frame must be sorted. Hence the need to use .reset_index, .sort_values and then .set_index, .sort_index in the example below:
df_merged = (pd.merge_asof(df1.reset_index().sort_values('Net Cost'),
df2[['Cost (low)', 'Retail']],
left_on='Net Cost',
right_on='Cost (low)')
.set_index('index')
.sort_index()
.drop('Cost (low)', axis=1))
print(df_merged)
Cat Nr Net Cost Status Retail
index
2801 825646183913 8.50 ACTIVE 16.99
2802 603497902941 7.96 ACTIVE 14.99
2803 603497897452 9.35 ACTIVE 18.99
2804 4050538324761 14.45 ACTIVE 26.99
2805 4050538307429 10.20 ACTIVE 18.99

Another approach you could choose is to create the cartesian product and filter the rows you are interested in. You would not need to sort the data twice (which can be costly), but you might need more memory.
cartesian_product = pd.merge(df1.assign(key=0), df2.assign(key=0), how='outer').drop('key', axis=1)
mask = (cartesian_product['Net Cost'] >= cartesian_product['Cost (low)'])
& (cartesian_product['Net Cost'] < cartesian_product['Cost (upp)'])
cartesian_product[mask]
Cat Nr Net Cost Status Cost (low) Cost (Upp) Retail
5 2801 825646183913 8.50 ACTIVE 8.16 9.25 16.99
15 2802 603497902941 7.96 ACTIVE 7.16 8.15 14.99
28 2803 603497897452 9.35 ACTIVE 9.26 10.75 18.99
43 2804 4050538324761 14.45 ACTIVE 13.76 14.75 26.99
50 2805 4050538307429 10.20 ACTIVE 9.26 10.75 18.99
Of course you can filter the dataframe accordingly.
Btw: Does anybody have a hint on how to work properly with the columns names with spaces? read from clipboard mixes a lot of things up ;)

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

I have a dataframe, df, in which I am attempting to fill in values within the empty "Set" column, depending on a condition. The condition is as follows: the value of the 'Set' columns need to be "IN" whenever the 'valence_median_split' column's value is 'Low_Valence' within the corresponding row, and "OUT' in all other cases.
Please see below for an example of my attempt to solve this:
df.head()
Out[65]:
ID Category Num Vert_Horizon Description Fem_Valence_Mean \
0 Animals_001_h Animals 1 h Dead Stork 2.40
1 Animals_002_v Animals 2 v Lion 6.31
2 Animals_003_h Animals 3 h Snake 5.14
3 Animals_004_v Animals 4 v Wolf 4.55
4 Animals_005_h Animals 5 h Bat 5.29
Fem_Valence_SD Fem_Av/Ap_Mean Fem_Av/Ap_SD Arousal_Mean ... Contrast \
0 1.30 3.03 1.47 6.72 ... 68.45
1 2.19 5.96 2.24 6.69 ... 32.34
2 1.19 5.14 1.75 5.34 ... 59.92
3 1.87 4.82 2.27 6.84 ... 75.10
4 1.56 4.61 1.81 5.50 ... 59.77
JPEG_size80 LABL LABA LABB Entropy Classification \
0 263028 51.75 -0.39 16.93 7.86
1 250208 52.39 10.63 30.30 6.71
2 190887 55.45 0.25 4.41 7.83
3 282350 49.84 3.82 1.36 7.69
4 329325 54.26 -0.34 -0.95 7.82
valence_median_split temp_selection set
0 Low_Valence Animals_001_h
1 High_Valence NaN
2 Low_Valence Animals_003_h
3 Low_Valence Animals_004_v
4 Low_Valence Animals_005_h
[5 rows x 36 columns]
df['set'] = np.where(df.loc[df['valence_median_split'] == 'Low_Valence'], 'IN', 'OUT')
ValueError: Length of values does not match length of index
I can accomplish this by using loc to separate the df into two different df's, but wondering if there is a more elegant solution using the "np.where" or a similar approach.

Change to
df['set'] = np.where(df['valence_median_split'] == 'Low_Valence', 'IN', 'OUT')
If need .loc
df.loc[df['valence_median_split'] == 'Low_Valence','set']='IN'
df.loc[df['valence_median_split'] != 'Low_Valence','set']='OUT'

Not performing calculation on blank field in dataframe

I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Building a subtable based on means - python

Related

Animate px.line line with plotly express

Encoding data with LabelEncoder()

How to create new pandas series based on comparing existing values to lower & upper boundaries

What is the best way to populate a column of a dataframe with conditional values based on corresponding rows in another column?

Not performing calculation on blank field in dataframe

Categories

Resources