Add new column based on mean slice of another column

Add new column based on mean slice of another column - python

Suppose I have a DataFrame
my_df = pd.DataFrame([10, 20, 30, 40, 50], columns=['col_1'])
I would like to add a new column where the value of each row in the new column is the mean of the values in col_1 starting at that row. In this case the new column (let's call it 'col_2' would be [30, 35, 40, 45, 50].
The following is not good code but it at least describes generating the values.
for i in range(len(my_df)):
my_df.loc[i]['col_2'] = my_df[i:]['col_1'].mean()
How can I do this in a clean, idiomatic way that doesn't raise a SettingWithCopyWarning?

You can reverse the column, take the incremental mean, and then reverse it back again.
my_df.loc[::-1, 'col_1'].expanding().mean()[::-1]
# 0 30.0
# 1 35.0
# 2 40.0
# 3 45.0
# 4 50.0
# Name: col_1, dtype: float64
A similar ndarray-level approach could be to use np.cumsum and divide by the increasing number of elements.
np.true_divide(np.cumsum(my_df.col_1.values[::-1]),
np.arange(1, len(my_df)+1))[::-1]
# array([30., 35., 40., 45., 50.])

Related

Converting columns from float datatype to categorical datatype using binning

I wish to convert a data frame consisting of two columns.
Here is the sample df:
Output:
df:
cost numbers
1 360 23
2 120 35
3 2000 49
Both columns are float and I wish to convert them to categorical using binning.
I wish to create the following bins for each column when converting to categorical.
Bins for the numbers : 18-24, 25-44, 45-65, 66-92
Bins for cost column: >=1000, <1000
Finally, I want to not create a new column but just convert the column without creating a new one.
Here is my attempted code at this:
def PreprocessDataframe(df):
#use binning to convert age and budget to categorical columns
df['numbers'] = pd.cut(df['numbers'], bins=[18, 24, 25, 44, 45, 65, 66, 92])
df['cost'] = pd.cut(df['cost'], bins=['=>1000', '<1000'])
return df
I understand how to convert the "numbers" column but I am having trouble with the "cost" one.
Help would be nice on how to solve this.
Thanks in advance!
Cheers!

If you use bins=[18, 24, 25, 44, 45, 65, 66, 92], this is going to generate bins for 18-24, 24-25, 25-44, 44-45, etc... and you don't need the ones for 24-25, 44-45...
By default, the bins are from the first value (not incusive) to the last value inclusive.
So, for numbers, you could use instead bins=[17, 24, 44, 65, 92] (note the 17 at the first position, so 18 is included).
The optional parameter label allows to choose labels for the bins.
df['numbers'] = pd.cut(df['numbers'], bins=[17, 24, 44, 65, 92], labels=['18-24', '25-44', '45-65', '66-92'])
df['cost'] = pd.cut(df['cost'], bins=[0, 999.99, df['cost'].max()], labels=['<1000', '=>1000'])
print(df)
>>> df
cost numbers
0 <1000 18-24
1 <1000 25-44
2 =>1000 45-65

Create a time series that sums data on each day D, if D is between the start date and the end date

My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date.
I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
mask = (df['start']<t) & (df['end']>t)
df2['quantities'] = df[mask]['quantity'].sum()

Maybe you could create date range for each record, then explode and groupby:
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
df = df.explode('range')
df.groupby('range')['quantity'].sum()

Your data describes a step function, ie on the 2nd of March (midnight) it increases by a value of 10, and on the 27th of March (midnight) it decreases by 10.
This solution uses a package called staircase which is built on pandas and numpy for working with (mathematical) step functions.
setup
data = [[10, '2020-03-02', '2020-03-27'],
[18, '2020-03-06', '2020-03-10'],
[21, '2020-03-20', '2020-05-02'],
[33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)
dates = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df["startdate"] = pd.to_datetime(df["startdate"])
df["enddate"] = pd.to_datetime(df["enddate"])
solution
Create a staircase.Stairs object (which is to staircase as pandas.Series is to pandas) which represents a step function. It is as easy as passing the start times, end times, and values, which since your data is in a pandas.Dataframe can be done by passing the column names
import staircase as sc
sf = sc.Stairs(frame=df, start="startdate", end="enddate", value="quantity")
The step function will be composed of left-closed intervals by default.
There are lots of things you can do with step functions including plotting
sf.plot(style="hlines")
If you want to just get the value at the start of each day then you can sample the step function like this
sf(dates, include_index=True)
The result will be a pandas.Series indexed by your date range
2020-01-01 0
2020-01-02 33
2020-01-03 33
2020-01-04 33
2020-01-05 33
..
2020-05-28 0
2020-05-29 0
2020-05-30 0
2020-05-31 0
2020-06-01 0
Freq: D, Length: 153, dtype: int64
A more general solution to your problem which includes start and end times at any datetime (not just midnight) and arbitrary bins can be achieved with slicing and integrating.
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Pandas: How to revert pct_change to the original value, with the initial value?

It is usual to use .pct_change() to have the daily change of time-series data.
Now I want to have the original value by using the pct_change result.
I have a data frame like this:
df = pd.DataFrame({
'value': [44, 45, 33, 56, 60]
})
df['pct_change'] = df['value'].pct_change() # get changes
initial_value=df['value'].values[0] # store the initial value
How can I use df['pct_change'] and initial_value to get the df['value']?

You can do cumprod
df['pct_change'].add(1,fill_value=0).cumprod()*44
Out[200]:
0 44.0
1 45.0
2 33.0
3 56.0
4 60.0
Name: pct_change, dtype: float64

Adding new calculated columns in pandas data frame

Assume I have a small data frame:
import pandas as pd
df = pd.DataFrame(
[
["A", 28, 726, 120],
["B", 28, 1746, 250],
["C", 543, 15307, 4500]
],
columns = ["case", "x", "y", "z"]
)
I know how to calculate a total column as (for example):
cols = list(df.columns)
df['total'] = df.loc[:, cols].sum(axis=1)
Now I would like to append to df 3 other columns x_pct, y_pct, z_pct, containing the percentage of x,y,z in relation to total, that is to say: x_pct=100*(x/total), etc.
And after that, I would like to still append 3 new columns x_pctr, y_pctr, z_pctr, containing the percentages rounded to a whole number: round(x_pct), etc.
Although I know, of course, how to calculate individually x_pct, x_pctr and so on, I couldn't find how to express simply the calculation of the 3 "percentage columns" in one run (and besides the calculation of the 3 "rounded columns" in one run), nor to construct a "global" data frame containing the previous columns and the resulting ones...
I am a little confused because I guess apply(lambda...) would do the job, if only I knew how to use it? Could you get me out of there?

Try:
df[["x_pctr", "y_pctr", "z_pctr"]] = (
df.loc[:, "x":].div(df.sum(axis=1), axis=0) * 100
).round()
print(df)
Prints:
case x y z x_pctr y_pctr z_pctr
0 A 28 726 120 3.0 83.0 14.0
1 B 28 1746 250 1.0 86.0 12.0
2 C 543 15307 4500 3.0 75.0 22.0

Groupby bins on multiple items

My dataframe looks something like this (but with about 100,000 rows of data):
ID,Total,TotalDate,DaysBtwRead,Type,YearlyAvg
1,1250,6/2/2017,17,AT267,229
2,1670,2/3/2012,320,PQ43,50
I'm trying to groupby yearly average totals using
df.groupby(pd.cut(df['YearlyAvg'], np.arange(0,1250,50))).count()
so that I can set-up for a unique monte carlo distribution, but I need these grouped by each individual Type as well. This currently only counts each range regardless of any other values.
Rather than having an overall aggregate count, I'm trying to set my code up so that the output looks more like the following (with YearlyAvg containing a count of each range)
Index,YearlyAvg
AT267(0, 50], 200
PQ43(0, 50], 123
AT267(50, 100], 49
PQ43(50, 100], 67
Is there an easier way to do this outside of creating a separate dataframe for each Type value?

You can using unstack with stack
df['bins']=pd.cut(df['YearlyAvg'], np.arange(0,1250,50))
df.groupby(['Type','bins']).size().unstack(fill_value=0).stack()# also here will create the multiple index for achieve what you need
Out[1783]:
Type bins
AT267 (0, 50] 0
(200, 250] 1
PQ43 (0, 50] 1
(200, 250] 0
dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add new column based on mean slice of another column - python

Related

Converting columns from float datatype to categorical datatype using binning

Create a time series that sums data on each day D, if D is between the start date and the end date

Pandas: How to revert pct_change to the original value, with the initial value?

Adding new calculated columns in pandas data frame

Groupby bins on multiple items

Categories

Resources