Python - Aggregate by month and calculate average - python

I have a csv which looks like this:
Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc
And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:
Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3
I'm using Pandas to open csv
data = pd.read_csv("pks.csv", sep=",")
so in data['Date'] I have dates and in data['Sentiment'] I have values. Any idea how to do it?

Probably the simplest approach is to use the resample command. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIO part and the header=True ... I am reading in your sample data from a multi-line string):
>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
index_col='Date')
>>> df
Sentiment
Date
2014-01-03 0.40
2014-01-04 -0.03
2014-01-09 0.00
2014-01-10 0.07
2014-01-12 0.00
2014-02-24 0.00
2014-02-25 0.00
2014-02-25 0.00
2014-02-26 0.00
2014-02-28 0.00
2014-03-01 0.10
2014-03-02 -0.50
2014-03-03 0.00
2014-03-08 -0.06
2014-03-11 -0.13
2014-03-22 0.00
2014-03-23 0.33
2014-03-23 0.30
2014-03-25 -0.14
2014-03-28 -0.25
>>> df.resample('M').mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
And if you want a month counter, you can add it after your resample:
>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg
Sentiment cnt
2014-01-31 0.088 0
2014-02-28 0.000 1
2014-03-31 -0.035 2
You can also do this with the groupby method and the TimeGrouper function (group by month and then call the mean convenience method that is available with groupby).
>>> df.groupby(pd.TimeGrouper(freq='M')).mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035

To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:
Convert the column with the dates , df['dates'] into the index of the DataFrame df: df.set_index('date',inplace=True)
Then I'll convert the index dates into a month-index: df.index.month
Finally I'll calculate the mean of the DataFrame GROUPED BY MONTH: df.groupby(df.index.month).Sentiment.mean()
I go slowly throw each step here:
Generation DataFrame with dates and values
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date' between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
the df has two columns 'date' and 'Sentiment':
date Sentiment
0 2018-01-07 34
1 2018-01-14 32
2 2018-01-21 15
3 2018-01-28 0
4 2018-02-04 95
5 2018-02-11 53
6 2018-02-18 7
7 2018-02-25 35
8 2018-03-04 17
Set 'date'column as the index of the DataFrame:
df.set_index('date',inplace=True)
df has one column 'Sentiment' and the index is 'date':
Sentiment
date
2018-01-07 34
2018-01-14 32
2018-01-21 15
2018-01-28 0
2018-02-04 95
2018-02-11 53
2018-02-18 7
2018-02-25 35
2018-03-04 17
Capture the month number from the index
months=df.index.month
Obtain the mean value of each month grouping by month:
monthly_avg=df.groupby(months).Sentiment.mean()
The mean of the dataset by month 'monthly_avg' is:
date
1 20.25
2 47.50
3 17.00

Related

Calculate the amount of time that a column was positive before reverting to 0

I have a table containing non-negative values against a datetimeindex, like the following:
CapturableSeparation
date
2021-02-23 18:09:00 0.00
2021-02-23 18:10:00 0.00
2021-02-23 18:11:00 0.04
2021-02-23 18:12:00 0.04
2021-02-23 18:13:00 0.00
... ...
2021-02-25 23:56:00 0.00
2021-02-25 23:57:00 0.91
2021-02-25 23:58:00 0.74
2021-02-25 23:59:00 0.55
I want to create a table of the amount of time between non-consecutive 0s (amount of time that positive values persist before reverting to 0) keyed with the average value of the "CapturableSeparation" during those consecutive positive values. For the data that is visible, the table might look like:
AvgValue
persistence
00:02:00 0.04
00:03:00 0.73
where the first row corresponds to the positive values at the beginning of the Dataframe that persist for 2 minutes and the second row corresponds to those at the end that persist for 3 minutes.
How should this be done?
Here is one way of solving the problem by identifying the consecutive blocks of non-zero values using boolean masking and cumsum:
m = df['CapturableSeparation'].eq(0)
b = m.cumsum()[~m]
agg_dict = {'persistence': ('date', np.ptp),
'avgvalue' : ('CapturableSeparation', 'mean')}
out = df.groupby(b, as_index=False).agg(**agg_dict)
out['persistence'] += pd.Timedelta(minutes=1)
Details:
Compare the CapturableSeparation column with 0 to create a boolean mask:
>>> m
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 False
Name: CapturableSeparation, dtype: bool
Then use cumsum on the above boolean mask to identify the blocks of consecutive non-zero values:
>>> b
2 2
3 2
6 4
7 4
8 4
Name: CapturableSeparation, dtype: int64
Group the dataframe on these consecutive blocks and aggregate the column date using np.ptp and column CapturableSeparation using mean:
>>> out
persistence avgvalue
0 0 days 00:02:00 0.040000
1 0 days 00:03:00 0.733333

Is it possible to create an additional pct_change column for each column of the dataframe?

I tried to solve this on my own and have searched other topics for help, yet, my problem remains. If anyone can help me or point me to the right direction I would appreciate
I'm fairly new to python and I'm trying to perform some changes on a pandas dataframe.
To summarize, I want to verify percental changes over sales data.
I'm aware of the pct_change method and below is what I've tried.
This is a sample data that looks like my original dataframe:
store_id sales_value day
0 1 54,141.00 2020-12-22
1 1 78,921.00 2020-12-23
2 6 5,894.00 2020-12-24
3 6 22,991.00 2020-12-23
4 6 25,802.00 2020-12-22
I made a function to calculate variations over rows.
it looks like this:
def var_percent(df, n):
return df.pct_change(n)
Then, using:
df['var_pct'] = var_percent(df['sales_value'],1)
it gave me something like the following:
store_id sales_value day var_pct
0 1 54,141.00 2020-12-22 nan
1 1 78,921.00 2020-12-23 0.46
4 6 25,802.00 2020-12-22 -0.67
3 6 22,991.00 2020-12-23 -0.11
2 6 5,894.00 2020-12-24 -0.74
That's not really what I want. I need to see changes for each store alone (store_id), where this type of configuration calculates over rows, no matter from which store it is.
Moving forward I tried this:
df.set_index(["day", "store_id"]).unstack(level=1)
Finally I get my actual dataframe, the one I'm stuck with and it's kind of like this:
sotore_id 1 6 15 22
day
2020-12-22 54141 25802 173399 36,200.00
2020-12-23 78921 22991 234885 32,762.00
2020-12-24 0 5894 0 10,956.00
2020-12-26 0 0 0 0.00 10980
2020-12-28 0 0 0 0.00 0
Now the dataframe is how I need it to be, but I haven't found a way to implement pct_change the way I want, which would be something like this, adding a percental change column for each existing column (these are dummy numbers, it's just a visual representation of how I'd like it to be):
sotore_id 1 1_pct 6 6_pct 15 15_pct
day
2020-12-22 54141 0 25802 0 173399 0
2020-12-23 78921 25 22991 -8 234885 20
2020-12-24 0 0 5894 -60 0 0
2020-12-26 0 0 0.00 0 10980 1000
2020-12-28 0 0 0.00 0 0 0
Is is even possible to do so?
You can use the below:
sales value should be converted to a numeric and date should be changed to datetime, then the data should be sorted. If all of these are already done , you can skip this block:
df['sales_value']=pd.to_numeric(df['sales_value'].str.replace(",",''))
df['day'] = pd.to_datetime(df['day'])
df = df.sort_values(['store_id','day'])
Calculate pct_change for each group and then unstack
out = (df.assign(pct=df.groupby("store_id")['sales_value'].pct_change()
.mul(100).round()).set_index(["day", "store_id"])
.unstack(level=1).fillna(0).sort_index(level=1,axis=1))
out.columns = out.columns.map('{0[1]} {0[0]}'.format)
print(out)
day
1 pct
1 sales_value
6 pct
6 sales_value
2020-12-22
0.0
54141.0
0.0
25802.0
2020-12-23
46.0
78921.0
-11.0
22991.0
2020-12-24
0.0
0.0
-74.0
5894.0

How to group sum along timestamp in pandas?

one years' data showed as follows:
datetime data
2008-01-01 00:00:00 0.044
2008-01-01 00:30:00 0.031
2008-01-01 01:00:00 -0.25
.....
2008-01-31 23:00:00 0.036
2008-01-31 23:30:00 0.42
2008-01-02 00:00:00 0.078
2008-01-02 00:30:00 0.008
2008-01-02 01:00:00 0.09
2008-01-02 01:30:00 0.054
.....
2008-12-31 22:00:00 0.55
2008-12-31 22:30:00 0.05
2008-12-31 23:00:00 0.08
2008-12-31 23:30:00 0.033
There is a value per half-hour. I want the sum of all values in a day, so convert to 365 rows values.
year day sum values
2008 1 *
2008 2 *
...
2008 364 *
2008 365 *
You can use dt.year + dt.dayofyear with groupby and aggregate sum:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear]).sum()
print (df)
data
datetime datetime
2008 1 -0.175
2 0.230
31 0.456
366 0.713
And if need DataFrame is possible convert index to column and set columns names by reset_index + rename_axis:
df = df.groupby([df['datetime'].dt.year, df['datetime'].dt.dayofyear])['data']
.sum()
.rename_axis(('year','dayofyear'))
.reset_index()
print (df)
year dayofyear data
0 2008 1 -0.175
1 2008 2 0.230
2 2008 31 0.456
3 2008 366 0.713

Seasonal data selection in python pandas

I have a dataframe with time series data in one column and index as a date. The data looks like below: It's from 2000 to 2015.
2000-02-24 NaN
2000-02-25 NaN
2000-02-26 0.272
2000-02-27 0.417
2000-02-28 0.837
2000-02-29 1.082
2000-03-01 0.613
2000-03-02 0.709
2000-03-03 0.857
2000-03-04 0.391
2000-03-05 0.470
2000-03-06 0.288
2000-03-07 0.286
I want data only from months March to July for each year so is there any way to do that in pandas?
You can filter using the DatetimeIndex attribute month:
df[(df.index.month >=2) & (df.index.month<=6)]
months are zero based so 2 corresponds to March and 6 is July

Can't access pure index data in option dataframe in pandas

I'm doing calculating work from yahoo option web page with the following code:
from pandas.io.data import Options
aapl = Options('aapl', 'yahoo')
data = aapl.get_all_data()
middle = data.query('Expiry == "2015-08-28" & Type == "call"')
strike = middle.ix[:, 0]
I find I can't access 'Strike' column with above code, when I retrieve data of column 0, I get 'Ask' column. It seems pandas only looks 'Strike' as index, instead of data column.
if I use following code:
strike = middle.index.levels[0]
the returned result is all the element of 'Strike' index, but not the real column data of 'Strike' from the query above ( middle = data.query('Expiry == "2015-08-28" & Type == "call"') ).
If I print middle, I can see a 'Strike' column data in the result, but if I print middle.columns, there is no 'Strike' in the result. So It seems 'Strike' is a pure index.
Who can tell me how to access the data of such a pure index?
The quickest way to "Explode" a nested index is to call reset_index:
data = data.reset_index()
data.head()
Strike Expiry Type Symbol Last Bid Ask Chg PctChg Vol Open_Int IV Root IsNonstandard Underlying Underlying_Price Quote_Time
0 34.29 2016-01-15 call AAPL160115C00034290 80.85 81.65 82.05 1.35 1.70% 3 420 83.79% AAPL False AAPL 115.96 2015-08-17 16:00:00
1 34.29 2016-01-15 put AAPL160115P00034290 0.03 0.00 0.04 0.00 0.00% 2 12469 67.19% AAPL False AAPL 115.96 2015-08-17 16:00:00
2 35.71 2016-01-15 call AAPL160115C00035710 77.30 80.25 80.60 0.00 0.00% 39 113 80.66% AAPL False AAPL 115.96 2015-08-17 16:00:00
3 35.71 2016-01-15 put AAPL160115P00035710 0.05 0.00 0.04 0.00 0.00% 150 9512 64.84% AAPL False AAPL 115.96 2015-08-17 16:00:00
4 37.14 2016-01-15 call AAPL160115C00037140 78.55 78.80 79.20 0.00 0.00% 10 0 78.52% AAPL False AAPL 115.96 2015-08-17 16:00:00
Then you can just access your column as data.Strike or data['Strike']:
data.Strike.head()
Out[175]:
0 34.29
1 34.29
2 35.71
3 35.71
4 37.14
Name: Strike, dtype: float64

Categories