Calculated mean column for a group transposed in a dataframe - python

I'm having an issue with my final analysis column. I'm looking to get the mean of each row in the below output.
ValueSource StackedValues Count Sum_Weight Group_Count Mean
0 AgeBand 4.0 402 6152.237828 2418 NaN
2 AgeBand 2.0 402 5250.436317 2053 NaN
7 AgeBand 3.0 402 4344.387011 1667 NaN
11 AgeBand 5.0 402 7296.371395 2911 NaN
19 AgeBand 1.0 402 3260.035257 1254 NaN
20 AgeBand 6.0 402 8501.978737 3341 NaN
59 AgeBand 8.0 402 15487.932515 6210 NaN
92 AgeBand 7.0 402 12054.620941 4846 NaN
So for index row 0, the mean would be Sum_Weight/SUM(Sum_Weight) and grouped across Valuesource
I tried the following Data['Mean'] = Data.groupby("ValueSource")['Sum_Weight'].mean() but as you can see, it didn't quite work.
The end result would be a mean column that has a value for each row per ValueSource and StackedValue
Any help would be much appreciated.

You could do that with groupby and apply like
Data['Mean'] = Data.groupby("ValueSource")['Sum_Weight'].apply(lambda x: x / x.sum())

Related

Replace missing values based on value of a specific column in Python

I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203

Only calculate mean of data rows in dataframe with no NaN-values

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses for 2014-2018 of each ID in the dataframe.
There is however one condition: if one of the cells in the rows (2014-2018) is empty, NaN should be returned. So I only want the mean to be calculated when all 5 row-cells in the columns 2014-2018 have a numeric value.
Initial dataframe:
2014 2015 2016 2017 2018 ID
100 122.0 324 632 NaN 12.0
120 159.0 54 452 541.0 96.0
NaN 164.0 687 165 245.0 20.0
180 421.0 512 184 953.0 73.0
110 654.0 913 173 103.0 84.0
130 NaN 754 124 207.0 26.0
170 256.0 843 97 806.0 87.0
140 754.0 95 101 541.0 64.0
80 985.0 184 84 90.0 11.0
96 65.0 127 130 421.0 34.0
Desired output
2014 2015 2016 2017 2018 ID mean
100 122.0 324 632 NaN 12.0 NaN
120 159.0 54 452 541.0 96.0 265.20
NaN 164.0 687 165 245.0 20.0 NaN
180 421.0 512 184 953.0 73.0 450.00
110 654.0 913 173 103.0 84.0 390.60
130 NaN 754 124 207.0 26.0 NaN
170 256.0 843 97 806.0 87.0 434.40
140 754.0 95 101 541.0 64.0 326.20
80 985.0 184 84 90.0 11.0 284.60
96 65.0 127 130 421.0 34.0 167.80
Tried code: -> this however only gives me the mean, ignoring the NaN condition. Is their some brief lambda function that can add the condition to the code?
import pandas as pd
import numpy as np

data = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"2014": [100,120,np.nan,180,110,130,170,140,80,96],
"2015": [122,159,164,421,654,np.nan,256,754,985,65],
"2016": [324,54,687,512,913,754,843,95,184,127],
"2017": [632,452,165,184,173,124,97,101,84,130],
"2018": [np.nan,541,245,953,103,207,806,541,90,421]})


print(data)

fiveyear = ["2014", "2015", "2016", "2017", "2018"] -> if a cell in these rows is empty(NaN), then NaN should be in the new 'mean'-column. I only want the mean when, all 5 cells in the row have a numeric value.


data.loc[:, 'mean'] = data[fiveyear].mean(axis=1)

print(data)
Use dropna to remove rows before calculating the mean. Because pandas will align on index when assigning the result back, and these rows were removed, the result of these dropped rows is NaN
df['mean'] = df[fiveyear].dropna(how='any').mean(1)
Also possible to mask the result to only those rows that were all non-null
df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))
A bit more of a hack, but because you know you need all 5 values you could also use sum which supports the min_count argument, so anything with fewer than 5 values is NaN
df['mean'] = df[fiveyear].sum(1, min_count=len(fiveyear))/len(fiveyear)
This is the same as #ALollz answer but with a flexible way to detect all columns regardless of how many years there are in the df
#get years columns in a list
yearsCols= [c for c in df if c != 'ID']
#calculate mean
df['mean'] = df[yearsCols].dropna(how='any').mean(1)

Need to iterate over row to check conditions and retrieve values from different columns if the conditions are met

I have a daily price data for a stock. Pasting last 31 rows of the data as an example dataset as below:
Date RSI Smooth max min
110 2019-02-13 38.506874 224.006543 NaN NaN
111 2019-02-14 39.567068 227.309923 NaN NaN
112 2019-02-15 43.774479 229.830776 NaN NaN
113 2019-02-18 43.651440 231.690179 NaN NaN
114 2019-02-19 43.467237 232.701976 NaN NaN
115 2019-02-20 44.370123 233.526131 NaN NaN
116 2019-02-21 45.605073 233.834988 233.834988 NaN
117 2019-02-22 46.837518 232.335179 NaN NaN
118 2019-02-25 42.087860 229.570711 NaN NaN
119 2019-02-26 39.008014 226.379526 NaN NaN
120 2019-02-27 39.542339 225.607475 NaN 225.607475
121 2019-02-28 39.051104 228.305615 NaN NaN
122 2019-03-01 48.191687 232.544289 NaN NaN
123 2019-03-05 51.909527 237.063534 NaN NaN
124 2019-03-06 52.988668 240.243201 NaN NaN
125 2019-03-07 54.205990 242.265173 NaN NaN
126 2019-03-08 54.967076 243.912033 NaN NaN
127 2019-03-11 58.080738 244.432163 244.432163 NaN
128 2019-03-12 55.587328 243.573710 NaN NaN
129 2019-03-13 51.714123 241.191933 NaN NaN
130 2019-03-14 48.948075 238.470485 NaN NaN
131 2019-03-15 46.615111 236.144640 NaN NaN
132 2019-03-18 48.219815 233.588265 NaN NaN
133 2019-03-19 41.866898 230.271903 NaN 230.271903
134 2019-03-20 34.818844 239.457110 NaN NaN
135 2019-03-22 42.167870 246.824173 NaN NaN
136 2019-03-25 60.228588 255.294124 NaN NaN
137 2019-03-26 66.896640 267.069173 NaN NaN
138 2019-03-27 68.823285 278.222343 NaN NaN
139 2019-03-28 63.654023 289.042091 289.042091 NaN
I am trying to develop a logic of code which as below:
if max > 0, then search for the previous non-zero max value which and assign it to max2. Also, assign the corresponding RSI of previous non-zero max as RSI2.
Desired output:
For line 139 in the data set, max2 will be 244.432163 and RSI2 will be 58.080738
For line 138 in the data set, max2 will be 0 and RSI 2 will be 0 and so on...
I tried different approached but was unsuccessful at getting any outputs so I do not have a sample code to paste.
I also tried using if loops but I am unable to make it work. I am very new at programming.
First you will need to iterate the dataframe.
Then you will need to store the previous values that you will need to save on the next hit. Since you are always going back to the previous max, you can reuse that as you loop through.
Something like this (did not test, just for an idea):
last_max = 0
last_rsi = 0
for index, row in df.iterrows():
if row['max']:
row['max2'] = last_max
row['rsi2'] = last_rsi
last_max = row['max'] # store this max/rsi for next time
last_rsi = row['rsi']
The right answer is to add a line of code as below:
df[['max2', 'RSI2']] = df[['max', 'RSI']].dropna(subset=['max']).shift(1).fillna(0)

How to retrieve one column from csv file using python?

im trying to retrieve the age column from one of the csv file , here is what i coded so far.
df = pd.DataFrame.from_csv('train.csv')
result = df[(df.Sex=='female') & (df.Pclass==3)]
print(result.Age)
# finding the average age of all people who survived
print len(result)
sum = len(result)
I printed out the age, because i wanted to see the list of all ages that belong to the colunm of sex that has the value of "female" and the column of class which has the value of "3"
the print result for some reason shows the colunm number and the age next to it, i just want it print the list of ages thats all.
PassengerId
3 26.0
9 27.0
11 4.0
15 14.0
19 31.0
20 NaN
23 15.0
25 8.0
26 38.0
29 NaN
33 NaN
39 18.0
40 14.0
41 40.0
45 19.0
48 NaN
50 18.0
69 17.0
72 16.0
80 30.0
83 NaN
86 33.0
101 28.0
107 21.0
110 NaN
112 14.5
114 20.0
115 17.0
120 2.0
129 NaN
...
658 32.0
678 18.0
679 43.0
681 NaN
692 4.0
698 NaN
703 18.0
728 NaN
730 25.0
737 48.0
768 30.5
778 5.0
781 13.0
787 18.0
793 NaN
798 31.0
800 30.0
808 18.0
814 6.0
817 23.0
824 27.0
831 15.0
853 9.0
856 18.0
859 24.0
864 NaN
876 15.0
883 22.0
886 39.0
889 NaN
Name: Age, dtype: float64
This is what my program prints, i just want the list of age on the right column only not the passengerID column which is on the left.
Thank you
result.Age is a pandas Series object, and so when you print it, column headers, indices, and data types are shown as well. This is a good thing, because it makes the printed representation of the object much more useful.
If you want to control exactly how the data is displayed, you will need to do some string formatting. Something like this should do what you're asking for:
print('\n'.join(str(x) for x in result.Age))
If you want access to the raw data underlying that column for some reason (usually you can work with the Series just as well), without indices or headers, you can get a numpy array with
result.Age.values

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories